Class MinHashFilter
- java.lang.Object
- 
- org.apache.lucene.util.AttributeSource
- 
- org.apache.lucene.analysis.TokenStream
- 
- org.apache.lucene.analysis.TokenFilter
- 
- org.apache.lucene.analysis.minhash.MinHashFilter
 
 
 
 
- 
- All Implemented Interfaces:
- Closeable,- AutoCloseable,- Unwrappable<TokenStream>
 
 public class MinHashFilter extends TokenFilter Generate min hash tokens from an incoming stream of tokens. The incoming tokens would typically be 5 word shingles.The number of hashes used and the number of minimum values for each hash can be set. You could have 1 hash and keep the 100 lowest values or 100 hashes and keep the lowest one for each. Hashes can also be bucketed in ranges over the 128-bit hash space, A 128-bit hash is used internally. 5 word shingles from 10e5 words generate 10e25 combinations So a 64 bit hash would have collisions (1.8e19) When using different hashes 32 bits are used for the hash position leaving scope for 8e28 unique hashes. A single hash will use all 128 bits. 
- 
- 
Nested Class Summary- 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSourceAttributeSource.State
 
- 
 - 
Field SummaryFields Modifier and Type Field Description static intDEFAULT_BUCKET_COUNTstatic intDEFAULT_HASH_COUNTstatic intDEFAULT_HASH_SET_SIZE- 
Fields inherited from class org.apache.lucene.analysis.TokenFilterinput
 - 
Fields inherited from class org.apache.lucene.analysis.TokenStreamDEFAULT_TOKEN_ATTRIBUTE_FACTORY
 
- 
 - 
Constructor SummaryConstructors Constructor Description MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation)create a MinHash filter
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description voidend()booleanincrementToken()voidreset()- 
Methods inherited from class org.apache.lucene.analysis.TokenFilterclose, unwrap
 - 
Methods inherited from class org.apache.lucene.util.AttributeSourceaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
 
- 
 
- 
- 
- 
Field Detail- 
DEFAULT_HASH_COUNTpublic static final int DEFAULT_HASH_COUNT - See Also:
- Constant Field Values
 
 - 
DEFAULT_HASH_SET_SIZEpublic static final int DEFAULT_HASH_SET_SIZE - See Also:
- Constant Field Values
 
 - 
DEFAULT_BUCKET_COUNTpublic static final int DEFAULT_BUCKET_COUNT - See Also:
- Constant Field Values
 
 
- 
 - 
Constructor Detail- 
MinHashFilterpublic MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation) create a MinHash filter- Parameters:
- input- the token stream
- hashCount- the no. of hashes
- bucketCount- the no. of buckets for hashing
- hashSetSize- the no. of min hashes to keep
- withRotation- whether rotate or not hashes while incrementing tokens
 
 
- 
 - 
Method Detail- 
incrementTokenpublic final boolean incrementToken() throws IOException- Specified by:
- incrementTokenin class- TokenStream
- Throws:
- IOException
 
 - 
endpublic void end() throws IOException- Overrides:
- endin class- TokenFilter
- Throws:
- IOException
 
 - 
resetpublic void reset() throws IOException- Overrides:
- resetin class- TokenFilter
- Throws:
- IOException
 
 
- 
 
-