public class ClassicSimilarity extends TFIDFSimilarity
BM25Similarity instead, which is generally considered superior to
TF-IDF.Similarity.SimScorerdiscountOverlaps| Constructor and Description |
|---|
ClassicSimilarity()
Sole constructor: parameter-free
|
| Modifier and Type | Method and Description |
|---|---|
float |
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1. |
Explanation |
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
float |
lengthNorm(int numTerms)
Implemented as
1/sqrt(length). |
float |
tf(float freq)
Implemented as
sqrt(freq). |
String |
toString() |
computeNorm, getDiscountOverlaps, idfExplain, scorer, setDiscountOverlapspublic float lengthNorm(int numTerms)
1/sqrt(length).lengthNorm in class TFIDFSimilaritynumTerms - the number of terms in the field, optionally discounting overlapspublic float tf(float freq)
sqrt(freq).tf in class TFIDFSimilarityfreq - the frequency of a term within a documentpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarityThe default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount() is used instead of
IndexReader#numDocs() because also
TermStatistics.docFreq() is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount(), and in the same direction.
In addition, CollectionStatistics.docCount() does not skew when fields are sparse.idfExplain in class TFIDFSimilaritycollectionStats - collection-level statisticstermStats - term-level statistics for the termpublic float idf(long docFreq,
long docCount)
log((docCount+1)/(docFreq+1)) + 1.idf in class TFIDFSimilaritydocFreq - the number of documents which contain the termdocCount - the total number of documents in the collectionCopyright © 2000-2020 Apache Software Foundation. All Rights Reserved.