De-Duplication

If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing.

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the Signature class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented in a few ways:

  • MD5Signature: 128-bit hash used for exact duplicate detection.

  • Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index.

  • TextProfileSignature: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text.

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.

Adding in the de-duplication process will change the allowDups setting so that it applies to an update term (with signatureField in this case) rather than the unique field Term.

Of course the signatureField could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField.

Configuration Options

There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml.

In solrconfig.xml

The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update Request Processor Chain, as in this example:

<updateRequestProcessorChain name="dedupe">
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="signatureField">id</str>
    <bool name="overwriteDupes">false</bool>
    <str name="fields">name,features,cat</str>
    <str name="signatureClass">solr.processor.Lookup3Signature</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

The SignatureUpdateProcessorFactory takes several properties:

signatureClass

A Signature implementation for generating a signature hash. The default is org.apache.solr.update.processor.Lookup3Signature.

The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:

  • org.apache.solr.update.processor.Lookup3Signature

  • org.apache.solr.update.processor.MD5Signature

  • org.apache.solr.update.process.TextProfileSignature

fields
The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used.
signatureField
The name of the field used to hold the fingerprint/signature. The field should be defined in schema.xml. The default is signatureField.
enabled
Set to false to disable de-duplication processing. The default is true.
overwriteDupes
If true, the default, when a document exists that already matches this signature, it will be overwritten.

In schema.xml

If you are using a separate field for storing the signature, you must have it indexed:

<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />

Be sure to change your update handlers to use the defined chain, as below:

<requestHandler name="/update" class="solr.UpdateRequestHandler" >
  <lst name="defaults">
    <str name="update.chain">dedupe</str>
  </lst>
...
</requestHandler>

This example assumes you have other sections of your request handler defined.

The update processor can also be specified per request with a parameter of update.chain=dedupe.