Class ClassicSimilarity

public class ClassicSimilarity extends TFIDFSimilarity
Expert: Historical scoring implementation. You might want to consider using BM25Similarity instead, which is generally considered superior to TF-IDF.
  • Constructor Details

    • ClassicSimilarity

      public ClassicSimilarity()
      Sole constructor: parameter-free
  • Method Details

    • lengthNorm

      public float lengthNorm(int numTerms)
      Implemented as 1/sqrt(length).
      Specified by:
      lengthNorm in class TFIDFSimilarity
      numTerms - the number of terms in the field, optionally discounting overlaps
      a length normalization value
      WARNING: This API is experimental and might change in incompatible ways in the next release.
    • tf

      public float tf(float freq)
      Implemented as sqrt(freq).
      Specified by:
      tf in class TFIDFSimilarity
      freq - the frequency of a term within a document
      a score factor based on a term's within-document frequency
    • idfExplain

      public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
      Description copied from class: TFIDFSimilarity
      Computes a score factor for a simple term and returns an explanation for that score factor.

      The default implementation uses:

       idf(docFreq, docCount);
      Note that CollectionStatistics.docCount() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.docCount(), and in the same direction. In addition, CollectionStatistics.docCount() does not skew when fields are sparse.
      idfExplain in class TFIDFSimilarity
      collectionStats - collection-level statistics
      termStats - term-level statistics for the term
      an Explain object that includes both an idf score factor and an explanation for the term.
    • idf

      public float idf(long docFreq, long docCount)
      Implemented as log((docCount+1)/(docFreq+1)) + 1.
      Specified by:
      idf in class TFIDFSimilarity
      docFreq - the number of documents which contain the term
      docCount - the total number of documents in the collection
      a score factor based on the term's document frequency
    • toString

      public String toString()
      toString in class Object