Class BM25Similarity

java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.BM25Similarity

public class BM25Similarity extends Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
  • Constructor Details

    • BM25Similarity

      public BM25Similarity(float k1, float b, boolean discountOverlaps)
      BM25 with the supplied parameter values.
      Parameters:
      k1 - Controls non-linear term frequency normalization (saturation).
      b - Controls to what degree document length normalizes tf values.
      discountOverlaps - True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
      Throws:
      IllegalArgumentException - if k1 is infinite or negative, or if b is not within the range [0..1]
    • BM25Similarity

      public BM25Similarity(float k1, float b)
      BM25 with the supplied parameter values.
      Parameters:
      k1 - Controls non-linear term frequency normalization (saturation).
      b - Controls to what degree document length normalizes tf values.
      Throws:
      IllegalArgumentException - if k1 is infinite or negative, or if b is not within the range [0..1]
    • BM25Similarity

      public BM25Similarity(boolean discountOverlaps)
      BM25 with these default values:
      • k1 = 1.2
      • b = 0.75
      and the supplied parameter value:
      Parameters:
      discountOverlaps - True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
    • BM25Similarity

      public BM25Similarity()
      BM25 with these default values:
      • k1 = 1.2
      • b = 0.75
      • discountOverlaps = true
  • Method Details

    • idf

      protected float idf(long docFreq, long docCount)
      Implemented as log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5)).
    • avgFieldLength

      protected float avgFieldLength(CollectionStatistics collectionStats)
      The default implementation computes the average as sumTotalTermFreq / docCount
    • idfExplain

      public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
      Computes a score factor for a simple term and returns an explanation for that score factor.

      The default implementation uses:

       idf(docFreq, docCount);
       
      Note that CollectionStatistics.docCount() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.docCount(), and in the same direction. In addition, CollectionStatistics.docCount() does not skew when fields are sparse.
      Parameters:
      collectionStats - collection-level statistics
      termStats - term-level statistics for the term
      Returns:
      an Explain object that includes both an idf score factor and an explanation for the term.
    • idfExplain

      public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats)
      Computes a score factor for a phrase.

      The default implementation sums the idf factor for each term in the phrase.

      Parameters:
      collectionStats - collection-level statistics
      termStats - term-level statistics for the terms in the phrase
      Returns:
      an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
    • scorer

      public final Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats)
      Description copied from class: Similarity
      Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.
      Specified by:
      scorer in class Similarity
      Parameters:
      boost - a multiplicative factor to apply to the produces scores
      collectionStats - collection-level statistics, such as the number of tokens in the collection.
      termStats - term-level statistics, such as the document frequency of a term across the collection.
      Returns:
      SimWeight object with the information this Similarity needs to score a query.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getK1

      public final float getK1()
      Returns the k1 parameter
      See Also:
    • getB

      public final float getB()
      Returns the b parameter
      See Also: