Class FuzzySet

java.lang.Object
org.apache.lucene.codecs.bloom.FuzzySet
All Implemented Interfaces:
Accountable

public class FuzzySet extends Object implements Accountable
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

This class is NOT threadsafe.

Internally a Bitset is used to record values and once a client has finished recording a stream of values the downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

  • Method Details

    • hashFunctionForVersion

      public static HashFunction hashFunctionForVersion(int version)
    • getNearestSetSize

      public static int getNearestSetSize(int maxNumberOfBits)
      Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount.
    • getNearestSetSize

      public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
      Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.
      Parameters:
      desiredSaturation - A number between 0 and 1 expressing the % of bits set once all values have been recorded
      Returns:
      The size of the set nearest to the required size
    • createSetBasedOnMaxMemory

      public static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)
    • createSetBasedOnQuality

      public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
    • contains

      public FuzzySet.ContainsResult contains(BytesRef value)
      The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.
      Returns:
      NO or MAYBE
    • serialize

      public void serialize(DataOutput out) throws IOException
      Serializes the data set to file using the following format:
      • FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
      • HashFunctionName --> String The name of a ServiceProvider registered HashFunction
      • FuzzySetVersion --> Uint32 The version number of the FuzzySet class
      • BloomSize --> Uint32 The modulo value used to project hashes into the field's Bitset
      • NumBitSetWords --> Uint32 The number of longs (as returned from FixedBitSet.getBits())
      • BitSetWord --> Long A long from the array returned by FixedBitSet.getBits()
      Parameters:
      out - Data output stream
      Throws:
      IOException - If there is a low-level I/O error
    • deserialize

      public static FuzzySet deserialize(DataInput in) throws IOException
      Throws:
      IOException
    • addValue

      public void addValue(BytesRef value) throws IOException
      Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.
      Parameters:
      value - the key value to be hashed
      Throws:
      IOException - If there is a low-level I/O error
    • downsize

      public FuzzySet downsize(float targetMaxSaturation)
      Parameters:
      targetMaxSaturation - A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.
      Returns:
      a smaller FuzzySet or null if the current set is already over-saturated
    • getEstimatedUniqueValues

      public int getEstimatedUniqueValues()
    • getEstimatedNumberUniqueValuesAllowingForCollisions

      public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
    • getSaturation

      public float getSaturation()
    • ramBytesUsed

      public long ramBytesUsed()
      Specified by:
      ramBytesUsed in interface Accountable
    • toString

      public String toString()
      Overrides:
      toString in class Object