Package org.apache.lucene.codecs.bloom
Class FuzzySet
java.lang.Object
org.apache.lucene.codecs.bloom.FuzzySet
- All Implemented Interfaces:
Accountable
A class used to represent a set of many, potentially large, values (e.g. many long strings such
as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream
of values the downsize(float)
method can be used to create a suitably smaller set that
is sized appropriately for the number of values recorded and desired saturation levels.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
Result fromcontains(BytesRef)
: can never return definitively YES (always MAYBE), but can sometimes definitely return NO. -
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
static final int
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
Method Summary
Modifier and TypeMethodDescriptionvoid
Records a value in the set.The main method required for a Bloom filter which, given a value determines set membership.static FuzzySet
createSetBasedOnMaxMemory
(int maxNumBytes) static FuzzySet
createSetBasedOnQuality
(int maxNumUniqueValues, float desiredMaxSaturation) static FuzzySet
deserialize
(DataInput in) downsize
(float targetMaxSaturation) static int
getEstimatedNumberUniqueValuesAllowingForCollisions
(int setSize, int numRecordedBits) int
static int
getNearestSetSize
(int maxNumberOfBits) Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.static int
getNearestSetSize
(int maxNumberOfValuesExpected, float desiredSaturation) Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.float
static HashFunction
hashFunctionForVersion
(int version) long
void
serialize
(DataOutput out) Serializes the data set to file using the following format: FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords HashFunctionName -->String
The name of a ServiceProvider registeredHashFunction
FuzzySetVersion -->Uint32
The version number of theFuzzySet
class BloomSize -->Uint32
The modulo value used to project hashes into the field's Bitset NumBitSetWords -->Uint32
The number of longs (as returned fromFixedBitSet.getBits()
) BitSetWord -->Long
A long from the array returned byFixedBitSet.getBits()
toString()
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
Field Details
-
VERSION_SPI
public static final int VERSION_SPI- See Also:
-
VERSION_START
public static final int VERSION_START- See Also:
-
VERSION_CURRENT
public static final int VERSION_CURRENT- See Also:
-
-
Method Details
-
hashFunctionForVersion
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfBits) Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount. -
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation) Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.- Parameters:
desiredSaturation
- A number between 0 and 1 expressing the % of bits set once all values have been recorded- Returns:
- The size of the set nearest to the required size
-
createSetBasedOnMaxMemory
-
createSetBasedOnQuality
-
contains
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.- Returns:
- NO or MAYBE
-
serialize
Serializes the data set to file using the following format:- FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
- HashFunctionName -->
String
The name of a ServiceProvider registeredHashFunction
- FuzzySetVersion -->
Uint32
The version number of theFuzzySet
class - BloomSize -->
Uint32
The modulo value used to project hashes into the field's Bitset - NumBitSetWords -->
Uint32
The number of longs (as returned fromFixedBitSet.getBits()
) - BitSetWord -->
Long
A long from the array returned byFixedBitSet.getBits()
- Parameters:
out
- Data output stream- Throws:
IOException
- If there is a low-level I/O error
-
deserialize
- Throws:
IOException
-
addValue
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.- Parameters:
value
- the key value to be hashed- Throws:
IOException
- If there is a low-level I/O error
-
downsize
- Parameters:
targetMaxSaturation
- A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.- Returns:
- a smaller FuzzySet or null if the current set is already over-saturated
-
getEstimatedUniqueValues
public int getEstimatedUniqueValues() -
getEstimatedNumberUniqueValuesAllowingForCollisions
public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits) -
getSaturation
public float getSaturation() -
ramBytesUsed
public long ramBytesUsed()- Specified by:
ramBytesUsed
in interfaceAccountable
-
toString
-