Class MinHashFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.minhash.MinHashFilter
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class MinHashFilter extends TokenFilter
Generate min hash tokens from an incoming stream of tokens. The incoming tokens would typically be 5 word shingles.The number of hashes used and the number of minimum values for each hash can be set. You could have 1 hash and keep the 100 lowest values or 100 hashes and keep the lowest one for each. Hashes can also be bucketed in ranges over the 128-bit hash space,
A 128-bit hash is used internally. 5 word shingles from 10e5 words generate 10e25 combinations So a 64 bit hash would have collisions (1.8e19)
When using different hashes 32 bits are used for the hash position leaving scope for 8e28 unique hashes. A single hash will use all 128 bits.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_BUCKET_COUNT
static int
DEFAULT_HASH_COUNT
static int
DEFAULT_HASH_SET_SIZE
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation)
create a MinHash filter
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
end()
boolean
incrementToken()
void
reset()
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_HASH_COUNT
public static final int DEFAULT_HASH_COUNT
- See Also:
- Constant Field Values
-
DEFAULT_HASH_SET_SIZE
public static final int DEFAULT_HASH_SET_SIZE
- See Also:
- Constant Field Values
-
DEFAULT_BUCKET_COUNT
public static final int DEFAULT_BUCKET_COUNT
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
MinHashFilter
public MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation)
create a MinHash filter- Parameters:
input
- the token streamhashCount
- the no. of hashesbucketCount
- the no. of buckets for hashinghashSetSize
- the no. of min hashes to keepwithRotation
- whether rotate or not hashes while incrementing tokens
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classTokenFilter
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
-