public class FuzzySet extends Object
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.
Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording
a stream of values the downsize(float)
method can be used to create a suitably smaller set that
is sized appropriately for the number of values recorded and desired saturation levels.
Modifier and Type | Class and Description |
---|---|
static class |
FuzzySet.ContainsResult
Result from
contains(BytesRef) :
can never return definitively YES (always MAYBE),
but can sometimes definitely return NO. |
Modifier and Type | Field and Description |
---|---|
static int |
VERSION_CURRENT |
static int |
VERSION_SPI |
static int |
VERSION_START |
Modifier and Type | Method and Description |
---|---|
void |
addValue(BytesRef value)
Records a value in the set.
|
FuzzySet.ContainsResult |
contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership.
|
static FuzzySet |
createSetBasedOnMaxMemory(int maxNumBytes) |
static FuzzySet |
createSetBasedOnQuality(int maxNumUniqueValues,
float desiredMaxSaturation) |
static FuzzySet |
deserialize(DataInput in) |
FuzzySet |
downsize(float targetMaxSaturation) |
static int |
getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize,
int numRecordedBits) |
int |
getEstimatedUniqueValues() |
static int |
getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up
of all ones as a binary number.
|
static int |
getNearestSetSize(int maxNumberOfValuesExpected,
float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important
than deciding how much memory to throw at the problem.
|
float |
getSaturation() |
static HashFunction |
hashFunctionForVersion(int version) |
long |
ramBytesUsed() |
void |
serialize(DataOutput out)
Serializes the data set to file using the following format:
FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize,
NumBitSetWords,BitSetWordNumBitSetWords
HashFunctionName -->
String The
name of a ServiceProvider registered HashFunction
FuzzySetVersion --> Uint32 The version number of the FuzzySet class
BloomSize --> Uint32 The modulo value used
to project hashes into the field's Bitset
NumBitSetWords --> Uint32 The number of
longs (as returned from FixedBitSet.getBits() )
BitSetWord --> Long A long from the array
returned by FixedBitSet.getBits()
|
public static final int VERSION_SPI
public static final int VERSION_START
public static final int VERSION_CURRENT
public static HashFunction hashFunctionForVersion(int version)
public static int getNearestSetSize(int maxNumberOfBits)
public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
desiredSaturation
- A number between 0 and 1 expressing the % of bits set once all values have been recordedpublic static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)
public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
public FuzzySet.ContainsResult contains(BytesRef value)
public void serialize(DataOutput out) throws IOException
String
The
name of a ServiceProvider registered HashFunction
Uint32
The version number of the FuzzySet
classUint32
The modulo value used
to project hashes into the field's BitsetUint32
The number of
longs (as returned from FixedBitSet.getBits()
)Long
A long from the array
returned by FixedBitSet.getBits()
out
- Data output streamIOException
- If there is a low-level I/O errorpublic static FuzzySet deserialize(DataInput in) throws IOException
IOException
public void addValue(BytesRef value) throws IOException
value
- the key value to be hashedIOException
- If there is a low-level I/O errorpublic FuzzySet downsize(float targetMaxSaturation)
targetMaxSaturation
- A number between 0 and 1 describing the % of bits that would ideally be set in the
result. Lower values have better accuracy but require more space.public int getEstimatedUniqueValues()
public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
public float getSaturation()
public long ramBytesUsed()
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.