SharePoint 2013 Search enables users to modify the managed properties of crawled items before they are indexed by calling out to an external content enrichment web service. The ability to modify managed properties for items during content processing is helpful when performing tasks such as data cleansing, entity extraction, classification and tagging.
The content processing component is designed over a fixed pipeline, which in turn is made of several processing stages arranged in sequence to perform distinct activities while processing a document for indexing.
While the content processing component provides several improvements over SharePoint 2010 enterprise search (not FAST Search for SharePoint 2010), it has introduced a bottleneck where custom processing is needed in the pipeline. For custom processing SharePoint 2013 has provided mechanism “Content Enrichment Web Service (CEWS) (shown as Web Service Callout in above diagram).” This is in principle a hook in the pipeline for an external WCF service.
The major two drawbacks of this process are:
- Whatever custom processing we need, must be performed within this single WCF service call. There is only one registration of a CEWS allowed per pipeline (and there is only one pipeline allowed per Search Service Application). This introduced a bottleneck where we have requirements for multiple external processing of documents passing through the pipeline.
- After we register the CEWS it will be applicable for all “Content Sources” in a specific Search Service Application. In practical scenario if we might have multiple Search Content Sources for a Search Service Application and have different requirements for each of the Content Sources there is no way to achieve this. This is explained below.
Content Source 1 ->required to process Managed Property values from Repository 1
Content Source 2 -> do not need to process any values from external repository
Content Source 3 -> required to process Managed Property values from Repository 2
There is no need for a Content Enrichment Web Service call for Content Source 2 and Content Source 1 and Content Source 2 needs to call two completely different repository to get the managed property values. Using a single Web Service call for both of them will not resolve the problem here. Registering a WCF Routing Service as CEWS can resolve this problem which we’ll discuss in below section.
In our demo solution. There are 2 different Content Sources specific to a Search Service Application. The requirement is
- We need to generate the document preview using the Longitude Preview Service from BA-Insight and at the same time we’ll populate some managed property values coming from a SQL server database for one of the Content Sources. Let’s name it “Content Source CEWS Multiple”. The BA-Insight uses their own Content Enrichment Web Service to generate document preview.
- For another Content Source we need to generate the document preview only. Let’s name it “Content Source CEWS Single”. This only needs to call the BA-Insight preview generator Content Enrichment Service only.
Introducing WCF Workflow Service
The WCF Workflow service has the ability to call more than one WCF services. Instead of registering a simple WCF service as an endpoint for a Content Enrichment web service we can register a WCF workflow service as the endpoint and then call our custom WCF services from the Workflow Service. The WCF Workflow Service can call the BA-Insight preview generator service first to generate the preview. Then it’ll call our Custom WCF service which gets the values of the Managed Properties from SQL Server database. After getting the values, the Workflow Service will create the Output Properties and send it back to SharePoint Pipeline where SharePoint will populate the Managed Property values. But this will be applicable for both of the Content Sources which is not desired as mentioned earlier. The second Content Source “Content Source CEWS Single” needs to call the BA-Insight preview service only to generate the document previews. To resolve this we need the help of WCF Routing service which is described below.
WCF Routing Service
WCF 4.0 introduces a new service called the Routing Service. The purpose of routing service is to pick up the request from client and based on the routing logic direct the request to proper endpoints or downstream services. These Downstream services may be hosted on the same machine or distributed across several machines in a server farm. So instead of registering the WCF Workflow Service as the endpoint in our CEWS we need to register the WCF Routing Service as the endpoint for our CEWS. The SharePoint pipeline will call the WCF Routing Service during crawl with some Input and Output properties. Based on the Input property parameter the routing service will then redirect the request to either the WCF Workflow Service or the BA-Insight Preview Service. To understand in details we need to discuss some of the details on SharePoint Content Enrichment Service.
SharePoint Content Enrichment Web Service Components
Following are some key components of the Content Enrichment Web Service Parameters which can be defined during the registration of the Service.
1. InputProperties: The InputProperties parameter specifies the managed properties sent to the service.
2. OutputProperties: The OutputProperties specifies the managed properties returned by the service
Note, that both are case sensitive. All managed properties referenced need to be created in advance.
3. Trigger: A trigger condition that represents a predicate to execute for every item being processed. If a trigger condition is used, the external web service is called only when the trigger evaluates to true. If no trigger condition is used, all items are sent to the external web service.
4. SendRawData: A SendRawData switch that sends the raw data of an item in binary form. This is useful when more metadata is required than what can be retrieved from the parsed version of the item. In our case we need to set it to true since the BA
5. TimeOut: The amount of time until the web service times out in milliseconds. Valid range 100 - 30000. In our case we’ll set it to a higher value since we are using multiple services at at some point it’ll be heavily loaded.
The detailed of configuration options and Content Enrichment Web Service can be found from MSDN. Following is a sample of PowerShell script to deploy the CEWS.
$ssa = Get-SPEnterpriseSearchServiceApplication
$config = New-SPEnterpriseSearchContentEnrichmentConfiguration
$config.Endpoint = http://Site_URL/<service name>.svc
$config.InputProperties = "OriginalPath,Body"
$config.OutputProperties = "OpProp1,OpProp2,OpProp3,OpProp4"
$config.SendRawData = $True
$config.MaxRawDataSize = 8192
$config.TimeOut = 10000
$ssa –ContentEnrichmentConfiguration $config
Putting It All Together
Schematic flow diagram for Overall Search Enrichment Process
Above we explained the entire logic of the Search Enrichment process through a schematic diagram.
The WCF routing service is configured as the endpoint of the Content Enrichment configuration. Only the contents in the "Content Source CEWS Multiple” Content Source needs to be updated with the managed property values from the SQL Server database, and therefore it behooves us to only forward our content processing request to the WCF Workflow Service when the document being crawled exists in the aforementioned Content Source. For documents in the other content sources, we only need to generate the document preview from BA-Insight. Therefore, instead of routing the request to the WCF Workflow Service, we are simply sending the request to the BA- Insight Longitude Preview Generation Service.
The routing service routes the request based on the routing filter. In this case the filter is configured on basis of the managed property named “ContentSource” and the value of the ContentSource. This concept can be implemented if there are more Content Sources and needs different repositories to populate the managed property values. The only thing needs to remember that the code needs to be very efficient as there are some very heavy processing involved during the processing of the documents and the SharePoint Search (noderunner.exe) Service itself is very memory hungry.