Class MinHashFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class MinHashFilter
    extends TokenFilter
    Generate min hash tokens from an incoming stream of tokens. The incoming tokens would typically be 5 word shingles.

    The number of hashes used and the number of minimum values for each hash can be set. You could have 1 hash and keep the 100 lowest values or 100 hashes and keep the lowest one for each. Hashes can also be bucketed in ranges over the 128-bit hash space,

    A 128-bit hash is used internally. 5 word shingles from 10e5 words generate 10e25 combinations So a 64 bit hash would have collisions (1.8e19)

    When using different hashes 32 bits are used for the hash position leaving scope for 8e28 unique hashes. A single hash will use all 128 bits.

    • Constructor Detail

      • MinHashFilter

        public MinHashFilter​(TokenStream input,
                             int hashCount,
                             int bucketCount,
                             int hashSetSize,
                             boolean withRotation)
        create a MinHash filter
        Parameters:
        input - the token stream
        hashCount - the no. of hashes
        bucketCount - the no. of buckets for hashing
        hashSetSize - the no. of min hashes to keep
        withRotation - whether rotate or not hashes while incrementing tokens