org.apache.lucene.codecs.bloom
Class FuzzySet

java.lang.Object
  extended by org.apache.lucene.codecs.bloom.FuzzySet

public class FuzzySet
extends Object

A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

This class is NOT threadsafe.

Internally a Bitset is used to record values and once a client has finished recording a stream of values the downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
static class FuzzySet.ContainsResult
          Result from contains(BytesRef): can never return definitively YES (always MAYBE), but can sometimes definitely return NO.
 
Field Summary
static int VERSION_CURRENT
           
static int VERSION_SPI
           
static int VERSION_START
           
 
Method Summary
 void addValue(BytesRef value)
          Records a value in the set.
 FuzzySet.ContainsResult contains(BytesRef value)
          The main method required for a Bloom filter which, given a value determines set membership.
static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)
           
static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
           
static FuzzySet deserialize(DataInput in)
           
 FuzzySet downsize(float targetMaxSaturation)
           
static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
           
 int getEstimatedUniqueValues()
           
static int getNearestSetSize(int maxNumberOfBits)
          Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.
static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
          Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.
 float getSaturation()
           
static HashFunction hashFunctionForVersion(int version)
           
 void serialize(DataOutput out)
          Serializes the data set to file using the following format: FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords HashFunctionName --> String The name of a ServiceProvider registered HashFunction FuzzySetVersion --> Uint32 The version number of the FuzzySet class BloomSize --> Uint32 The modulo value used to project hashes into the field's Bitset NumBitSetWords --> Uint32 The number of longs (as returned from FixedBitSet.getBits()) BitSetWord --> Long A long from the array returned by FixedBitSet.getBits()
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION_SPI

public static final int VERSION_SPI
See Also:
Constant Field Values

VERSION_START

public static final int VERSION_START
See Also:
Constant Field Values

VERSION_CURRENT

public static final int VERSION_CURRENT
See Also:
Constant Field Values
Method Detail

hashFunctionForVersion

public static HashFunction hashFunctionForVersion(int version)

getNearestSetSize

public static int getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount.


getNearestSetSize

public static int getNearestSetSize(int maxNumberOfValuesExpected,
                                    float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.

Parameters:
desiredSaturation - A number between 0 and 1 expressing the % of bits set once all values have been recorded
Returns:
The size of the set nearest to the required size

createSetBasedOnMaxMemory

public static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)

createSetBasedOnQuality

public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues,
                                               float desiredMaxSaturation)

contains

public FuzzySet.ContainsResult contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.

Returns:
NO or MAYBE

serialize

public void serialize(DataOutput out)
               throws IOException
Serializes the data set to file using the following format:

Parameters:
out - Data output stream
Throws:
IOException - If there is a low-level I/O error

deserialize

public static FuzzySet deserialize(DataInput in)
                            throws IOException
Throws:
IOException

addValue

public void addValue(BytesRef value)
              throws IOException
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.

Parameters:
value - the key value to be hashed
Throws:
IOException - If there is a low-level I/O error

downsize

public FuzzySet downsize(float targetMaxSaturation)
Parameters:
targetMaxSaturation - A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.
Returns:
a smaller FuzzySet or null if the current set is already over-saturated

getEstimatedUniqueValues

public int getEstimatedUniqueValues()

getEstimatedNumberUniqueValuesAllowingForCollisions

public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize,
                                                                      int numRecordedBits)

getSaturation

public float getSaturation()


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.