public class ClassicSimilarity extends TFIDFSimilarity
encodes
norm values as a single byte before being stored. At search time,
the norm byte value is read from the index
directory
and
decoded
back to a float norm value.
This encoding/decoding, while reducing index size, comes with the price of
precision loss - it is not guaranteed that decode(encode(x)) = x. For
instance, decode(encode(0.89)) = 0.875.
Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarity
for search.
Similarity.SimScorer, Similarity.SimWeight
Modifier and Type | Field and Description |
---|---|
protected boolean |
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are
discounted from the document's length.
|
Constructor and Description |
---|
ClassicSimilarity()
Sole constructor: parameter-free
|
Modifier and Type | Method and Description |
---|---|
float |
coord(int overlap,
int maxOverlap)
Implemented as
overlap / maxOverlap . |
float |
decodeNormValue(long norm)
Decodes the norm value, assuming it is a single byte.
|
long |
encodeNormValue(float f)
Encodes a normalization factor for storage in an index.
|
boolean |
getDiscountOverlaps()
Returns true if overlap tokens are discounted from the document's length.
|
float |
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1 . |
Explanation |
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
float |
lengthNorm(FieldInvertState state)
Implemented as
state.getBoost()*lengthNorm(numTerms) , where
numTerms is FieldInvertState.getLength() if setDiscountOverlaps(boolean) is false, else it's FieldInvertState.getLength() - FieldInvertState.getNumOverlap() . |
float |
queryNorm(float sumOfSquaredWeights)
Implemented as
1/sqrt(sumOfSquaredWeights) . |
float |
scorePayload(int doc,
int start,
int end,
BytesRef payload)
The default implementation returns
1 |
void |
setDiscountOverlaps(boolean v)
Determines whether overlap tokens (Tokens with
0 position increment) are ignored when computing
norm.
|
float |
sloppyFreq(int distance)
Implemented as
1 / (distance + 1) . |
float |
tf(float freq)
Implemented as
sqrt(freq) . |
String |
toString() |
computeNorm, computeWeight, idfExplain, simScorer
protected boolean discountOverlaps
public float coord(int overlap, int maxOverlap)
overlap / maxOverlap
.coord
in class TFIDFSimilarity
overlap
- the number of query terms matched in the documentmaxOverlap
- the total number of terms in the querypublic float queryNorm(float sumOfSquaredWeights)
1/sqrt(sumOfSquaredWeights)
.queryNorm
in class TFIDFSimilarity
sumOfSquaredWeights
- the sum of the squares of query term weightspublic final long encodeNormValue(float f)
The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
encodeNormValue
in class TFIDFSimilarity
Field.setBoost(float)
,
SmallFloat
public final float decodeNormValue(long norm)
decodeNormValue
in class TFIDFSimilarity
encodeNormValue(float)
public float lengthNorm(FieldInvertState state)
state.getBoost()*lengthNorm(numTerms)
, where
numTerms
is FieldInvertState.getLength()
if setDiscountOverlaps(boolean)
is false, else it's FieldInvertState.getLength()
- FieldInvertState.getNumOverlap()
.lengthNorm
in class TFIDFSimilarity
state
- statistics of the current field (such as length, boost, etc)public float tf(float freq)
sqrt(freq)
.tf
in class TFIDFSimilarity
freq
- the frequency of a term within a documentpublic float sloppyFreq(int distance)
1 / (distance + 1)
.sloppyFreq
in class TFIDFSimilarity
distance
- the edit distance of this sloppy phrase matchPhraseQuery.getSlop()
public float scorePayload(int doc, int start, int end, BytesRef payload)
1
scorePayload
in class TFIDFSimilarity
doc
- The docId currently being scored.start
- The start position of the payloadend
- The end position of the payloadpayload
- The payload byte array to be scoredpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarity
The default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount()
is used instead of
IndexReader#numDocs()
because also
TermStatistics.docFreq()
is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount()
, and in the same direction.
In addition, CollectionStatistics.docCount()
does not skew when fields are sparse.idfExplain
in class TFIDFSimilarity
collectionStats
- collection-level statisticstermStats
- term-level statistics for the termpublic float idf(long docFreq, long docCount)
log((docCount+1)/(docFreq+1)) + 1
.idf
in class TFIDFSimilarity
docFreq
- the number of documents which contain the termdocCount
- the total number of documents in the collectionpublic void setDiscountOverlaps(boolean v)
TFIDFSimilarity.computeNorm(org.apache.lucene.index.FieldInvertState)
public boolean getDiscountOverlaps()
setDiscountOverlaps(boolean)
Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.