Generate min hash tokens from an incoming stream of tokens. The incoming tokens would typically be 5 word shingles.
The number of hashes used and the number of minimum values for each hash can be set. You could have 1 hash and keep
the 100 lowest values or 100 hashes and keep the lowest one for each. Hashes can also be bucketed in ranges over the
128-bit hash space,
A 128-bit hash is used internally. 5 word shingles from 10e5 words generate 10e25 combinations So a 64 bit hash would
have collisions (1.8e19)
When using different hashes 32 bits are used for the hash position leaving scope for 8e28 unique hashes. A single
hash will use all 128 bits.
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource