public class ClassicSimilarity extends TFIDFSimilarity
BM25Similarity instead, which is generally considered superior to
TF-IDF.Similarity.SimScorer, Similarity.SimWeightdiscountOverlaps| Constructor and Description |
|---|
ClassicSimilarity()
Sole constructor: parameter-free
|
| Modifier and Type | Method and Description |
|---|---|
float |
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1. |
Explanation |
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
float |
lengthNorm(int numTerms)
Implemented as
1/sqrt(length). |
float |
scorePayload(int doc,
int start,
int end,
BytesRef payload)
The default implementation returns
1 |
float |
sloppyFreq(int distance)
Implemented as
1 / (distance + 1). |
float |
tf(float freq)
Implemented as
sqrt(freq). |
String |
toString() |
computeNorm, computeWeight, getDiscountOverlaps, idfExplain, setDiscountOverlaps, simScorerpublic float lengthNorm(int numTerms)
1/sqrt(length).lengthNorm in class TFIDFSimilaritynumTerms - the number of terms in the field, optionally discounting overlapspublic float tf(float freq)
sqrt(freq).tf in class TFIDFSimilarityfreq - the frequency of a term within a documentpublic float sloppyFreq(int distance)
1 / (distance + 1).sloppyFreq in class TFIDFSimilaritydistance - the edit distance of this sloppy phrase matchPhraseQuery.getSlop()public float scorePayload(int doc,
int start,
int end,
BytesRef payload)
1scorePayload in class TFIDFSimilaritydoc - The docId currently being scored.start - The start position of the payloadend - The end position of the payloadpayload - The payload byte array to be scoredpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarityThe default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount() is used instead of
IndexReader#numDocs() because also
TermStatistics.docFreq() is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount(), and in the same direction.
In addition, CollectionStatistics.docCount() does not skew when fields are sparse.idfExplain in class TFIDFSimilaritycollectionStats - collection-level statisticstermStats - term-level statistics for the termpublic float idf(long docFreq,
long docCount)
log((docCount+1)/(docFreq+1)) + 1.idf in class TFIDFSimilaritydocFreq - the number of documents which contain the termdocCount - the total number of documents in the collectionCopyright © 2000-2017 Apache Software Foundation. All Rights Reserved.