Class BM25Similarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.BM25Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline
Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text
REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer
-
Constructor Summary
ConstructorsConstructorDescriptionBM25 with these default values:k1 = 1.2
b = 0.75
discountOverlaps = true
BM25Similarity
(boolean discountOverlaps) BM25 with these default values:k1 = 1.2
b = 0.75
and the supplied parameter value:BM25Similarity
(float k1, float b) BM25 with the supplied parameter values.BM25Similarity
(float k1, float b, boolean discountOverlaps) BM25 with the supplied parameter values. -
Method Summary
Modifier and TypeMethodDescriptionprotected float
avgFieldLength
(CollectionStatistics collectionStats) The default implementation computes the average assumTotalTermFreq / docCount
final float
getB()
Returns theb
parameterfinal float
getK1()
Returns thek1
parameterprotected float
idf
(long docFreq, long docCount) Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))
.idfExplain
(CollectionStatistics collectionStats, TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.idfExplain
(CollectionStatistics collectionStats, TermStatistics[] termStats) Computes a score factor for a phrase.final Similarity.SimScorer
scorer
(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) Compute any collection-level weight (e.g.toString()
Methods inherited from class org.apache.lucene.search.similarities.Similarity
computeNorm, getDiscountOverlaps
-
Constructor Details
-
BM25Similarity
public BM25Similarity(float k1, float b, boolean discountOverlaps) BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.discountOverlaps
- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.- Throws:
IllegalArgumentException
- ifk1
is infinite or negative, or ifb
is not within the range[0..1]
-
BM25Similarity
public BM25Similarity(float k1, float b) BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.- Throws:
IllegalArgumentException
- ifk1
is infinite or negative, or ifb
is not within the range[0..1]
-
BM25Similarity
public BM25Similarity(boolean discountOverlaps) BM25 with these default values:k1 = 1.2
b = 0.75
- Parameters:
discountOverlaps
- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
BM25Similarity
public BM25Similarity()BM25 with these default values:k1 = 1.2
b = 0.75
discountOverlaps = true
-
-
Method Details
-
idf
protected float idf(long docFreq, long docCount) Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))
. -
avgFieldLength
The default implementation computes the average assumTotalTermFreq / docCount
-
idfExplain
Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, docCount);
Note thatCollectionStatistics.docCount()
is used instead ofIndexReader#numDocs()
because alsoTermStatistics.docFreq()
is used, and when the latter is inaccurate, so isCollectionStatistics.docCount()
, and in the same direction. In addition,CollectionStatistics.docCount()
does not skew when fields are sparse.- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
scorer
public final Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) Description copied from class:Similarity
Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.- Specified by:
scorer
in classSimilarity
- Parameters:
boost
- a multiplicative factor to apply to the produces scorescollectionStats
- collection-level statistics, such as the number of tokens in the collection.termStats
- term-level statistics, such as the document frequency of a term across the collection.- Returns:
- SimWeight object with the information this Similarity needs to score a query.
-
toString
-
getK1
public final float getK1()Returns thek1
parameter- See Also:
-
getB
public final float getB()Returns theb
parameter- See Also:
-