Class MinHashFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.minhash.MinHashFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
Generate min hash tokens from an incoming stream of tokens. The incoming tokens would typically
be 5 word shingles.
The number of hashes used and the number of minimum values for each hash can be set. You could have 1 hash and keep the 100 lowest values or 100 hashes and keep the lowest one for each. Hashes can also be bucketed in ranges over the 128-bit hash space,
A 128-bit hash is used internally. 5 word shingles from 10e5 words generate 10e25 combinations So a 64 bit hash would have collisions (1.8e19)
When using different hashes 32 bits are used for the hash position leaving scope for 8e28 unique hashes. A single hash will use all 128 bits.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
static final int
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionMinHashFilter
(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation) create a MinHash filter -
Method Summary
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
DEFAULT_HASH_COUNT
public static final int DEFAULT_HASH_COUNT- See Also:
-
DEFAULT_HASH_SET_SIZE
public static final int DEFAULT_HASH_SET_SIZE- See Also:
-
DEFAULT_BUCKET_COUNT
public static final int DEFAULT_BUCKET_COUNT- See Also:
-
-
Constructor Details
-
MinHashFilter
public MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation) create a MinHash filter- Parameters:
input
- the token streamhashCount
- the no. of hashesbucketCount
- the no. of buckets for hashinghashSetSize
- the no. of min hashes to keepwithRotation
- whether rotate or not hashes while incrementing tokens
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
end
- Overrides:
end
in classTokenFilter
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-