|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.search.similarities.Similarity org.apache.lucene.search.similarities.TFIDFSimilarity org.apache.lucene.search.similarities.DefaultSimilarity
public class DefaultSimilarity
Expert: Default scoring implementation which encodes
norm values as a single byte before being stored. At search time,
the norm byte value is read from the index
directory
and
decoded
back to a float norm value.
This encoding/decoding, while reducing index size, comes with the price of
precision loss - it is not guaranteed that decode(encode(x)) = x. For
instance, decode(encode(0.89)) = 0.75.
Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarity
for search.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity |
---|
Similarity.SimScorer, Similarity.SimWeight |
Field Summary | |
---|---|
protected boolean |
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length. |
Constructor Summary | |
---|---|
DefaultSimilarity()
Sole constructor: parameter-free |
Method Summary | |
---|---|
float |
coord(int overlap,
int maxOverlap)
Implemented as overlap / maxOverlap . |
float |
decodeNormValue(long norm)
Decodes the norm value, assuming it is a single byte. |
long |
encodeNormValue(float f)
Encodes a normalization factor for storage in an index. |
boolean |
getDiscountOverlaps()
Returns true if overlap tokens are discounted from the document's length. |
float |
idf(long docFreq,
long numDocs)
Implemented as log(numDocs/(docFreq+1)) + 1 . |
float |
lengthNorm(FieldInvertState state)
Implemented as state.getBoost()*lengthNorm(numTerms) , where
numTerms is FieldInvertState.getLength() if setDiscountOverlaps(boolean) is false, else it's FieldInvertState.getLength() - FieldInvertState.getNumOverlap() . |
float |
queryNorm(float sumOfSquaredWeights)
Implemented as 1/sqrt(sumOfSquaredWeights) . |
float |
scorePayload(int doc,
int start,
int end,
BytesRef payload)
The default implementation returns 1 |
void |
setDiscountOverlaps(boolean v)
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. |
float |
sloppyFreq(int distance)
Implemented as 1 / (distance + 1) . |
float |
tf(float freq)
Implemented as sqrt(freq) . |
String |
toString()
|
Methods inherited from class org.apache.lucene.search.similarities.TFIDFSimilarity |
---|
computeNorm, computeWeight, idfExplain, idfExplain, simScorer |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected boolean discountOverlaps
Constructor Detail |
---|
public DefaultSimilarity()
Method Detail |
---|
public float coord(int overlap, int maxOverlap)
overlap / maxOverlap
.
coord
in class TFIDFSimilarity
overlap
- the number of query terms matched in the documentmaxOverlap
- the total number of terms in the query
public float queryNorm(float sumOfSquaredWeights)
1/sqrt(sumOfSquaredWeights)
.
queryNorm
in class TFIDFSimilarity
sumOfSquaredWeights
- the sum of the squares of query term weights
public final long encodeNormValue(float f)
The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
encodeNormValue
in class TFIDFSimilarity
Field.setBoost(float)
,
SmallFloat
public final float decodeNormValue(long norm)
decodeNormValue
in class TFIDFSimilarity
encodeNormValue(float)
public float lengthNorm(FieldInvertState state)
state.getBoost()*lengthNorm(numTerms)
, where
numTerms
is FieldInvertState.getLength()
if setDiscountOverlaps(boolean)
is false, else it's FieldInvertState.getLength()
- FieldInvertState.getNumOverlap()
.
lengthNorm
in class TFIDFSimilarity
state
- statistics of the current field (such as length, boost, etc)
public float tf(float freq)
sqrt(freq)
.
tf
in class TFIDFSimilarity
freq
- the frequency of a term within a document
public float sloppyFreq(int distance)
1 / (distance + 1)
.
sloppyFreq
in class TFIDFSimilarity
distance
- the edit distance of this sloppy phrase match
PhraseQuery.setSlop(int)
public float scorePayload(int doc, int start, int end, BytesRef payload)
1
scorePayload
in class TFIDFSimilarity
doc
- The docId currently being scored.start
- The start position of the payloadend
- The end position of the payloadpayload
- The payload byte array to be scored
public float idf(long docFreq, long numDocs)
log(numDocs/(docFreq+1)) + 1
.
idf
in class TFIDFSimilarity
docFreq
- the number of documents which contain the termnumDocs
- the total number of documents in the collection
public void setDiscountOverlaps(boolean v)
TFIDFSimilarity.computeNorm(org.apache.lucene.index.FieldInvertState)
public boolean getDiscountOverlaps()
setDiscountOverlaps(boolean)
public String toString()
toString
in class Object
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |