public class TextProfileSignature
- extends MD5Signature
This implementation is copied from Apache Nutch.
An implementation of a page signature. It calculates an MD5 hash
of a plain text "profile" of a page.
The algorithm to calculate a page "profile" takes the plain text version of
a page and performs the following steps:
This list is then submitted to an MD5 hash calculation.
- remove all characters except letters and digits, and bring all characters
to lower case,
- split the text into tokens (all consecutive non-whitespace characters),
- discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
- sort the list of tokens by decreasing frequency,
- round down the counts of tokens to the nearest multiple of QUANT
QUANT = QUANT_RATE * maxFreq, where
QUANT_RATE is 0.01f
by default, and
maxFreq is the maximum token frequency). If
maxFreq is higher than 1, then QUANT is always higher than 2 (which
means that tokens with frequency 1 are always discarded).
- tokens, which frequency after quantization falls below QUANT, are discarded.
- create a list of tokens and their quantized frequency, separated by spaces,
in the order of decreasing frequency.
|Fields inherited from class org.apache.solr.update.processor.MD5Signature
|Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public void init(SolrParams params)
init in class
public byte getSignature()
getSignature in class
public void add(String content)
add in class
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.