public class ClassicSimilarity extends TFIDFSimilarity
BM25Similarity
instead, which is generally considered superior to
TF-IDF.Similarity.SimScorer
discountOverlaps
Constructor and Description |
---|
ClassicSimilarity()
Sole constructor: parameter-free
|
Modifier and Type | Method and Description |
---|---|
float |
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1 . |
Explanation |
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
float |
lengthNorm(int numTerms)
Implemented as
1/sqrt(length) . |
float |
tf(float freq)
Implemented as
sqrt(freq) . |
String |
toString() |
computeNorm, getDiscountOverlaps, idfExplain, scorer, setDiscountOverlaps
public float lengthNorm(int numTerms)
1/sqrt(length)
.lengthNorm
in class TFIDFSimilarity
numTerms
- the number of terms in the field, optionally discounting overlaps
public float tf(float freq)
sqrt(freq)
.tf
in class TFIDFSimilarity
freq
- the frequency of a term within a documentpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarity
The default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount()
is used instead of
IndexReader#numDocs()
because also
TermStatistics.docFreq()
is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount()
, and in the same direction.
In addition, CollectionStatistics.docCount()
does not skew when fields are sparse.idfExplain
in class TFIDFSimilarity
collectionStats
- collection-level statisticstermStats
- term-level statistics for the termpublic float idf(long docFreq, long docCount)
log((docCount+1)/(docFreq+1)) + 1
.idf
in class TFIDFSimilarity
docFreq
- the number of documents which contain the termdocCount
- the total number of documents in the collectionCopyright © 2000-2019 Apache Software Foundation. All Rights Reserved.