public class ClassicSimilarity extends TFIDFSimilarity
BM25Similarity
instead, which is generally considered superior to
TF-IDF.Similarity.SimScorer, Similarity.SimWeight
discountOverlaps
Constructor and Description |
---|
ClassicSimilarity()
Sole constructor: parameter-free
|
Modifier and Type | Method and Description |
---|---|
float |
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1 . |
Explanation |
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
float |
lengthNorm(int numTerms)
Implemented as
1/sqrt(length) . |
float |
scorePayload(int doc,
int start,
int end,
BytesRef payload)
The default implementation returns
1 |
float |
sloppyFreq(int distance)
Implemented as
1 / (distance + 1) . |
float |
tf(float freq)
Implemented as
sqrt(freq) . |
String |
toString() |
computeNorm, computeWeight, getDiscountOverlaps, idfExplain, setDiscountOverlaps, simScorer
public float lengthNorm(int numTerms)
1/sqrt(length)
.lengthNorm
in class TFIDFSimilarity
numTerms
- the number of terms in the field, optionally discounting overlaps
public float tf(float freq)
sqrt(freq)
.tf
in class TFIDFSimilarity
freq
- the frequency of a term within a documentpublic float sloppyFreq(int distance)
1 / (distance + 1)
.sloppyFreq
in class TFIDFSimilarity
distance
- the edit distance of this sloppy phrase matchPhraseQuery.getSlop()
public float scorePayload(int doc, int start, int end, BytesRef payload)
1
scorePayload
in class TFIDFSimilarity
doc
- The docId currently being scored.start
- The start position of the payloadend
- The end position of the payloadpayload
- The payload byte array to be scoredpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarity
The default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount()
is used instead of
IndexReader#numDocs()
because also
TermStatistics.docFreq()
is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount()
, and in the same direction.
In addition, CollectionStatistics.docCount()
does not skew when fields are sparse.idfExplain
in class TFIDFSimilarity
collectionStats
- collection-level statisticstermStats
- term-level statistics for the termpublic float idf(long docFreq, long docCount)
log((docCount+1)/(docFreq+1)) + 1
.idf
in class TFIDFSimilarity
docFreq
- the number of documents which contain the termdocCount
- the total number of documents in the collectionCopyright © 2000-2018 Apache Software Foundation. All Rights Reserved.