Class ClassicSimilarity


public class ClassicSimilarity extends TFIDFSimilarity
Expert: Historical scoring implementation. You might want to consider using BM25Similarity instead, which is generally considered superior to TF-IDF.
  • Constructor Details

    • ClassicSimilarity

      public ClassicSimilarity()
      Sole constructor: parameter-free
  • Method Details

    • lengthNorm

      public float lengthNorm(int numTerms)
      Implemented as 1/sqrt(length).
      Specified by:
      lengthNorm in class TFIDFSimilarity
      Parameters:
      numTerms - the number of terms in the field, optionally discounting overlaps
      Returns:
      a length normalization value
      WARNING: This API is experimental and might change in incompatible ways in the next release.
    • tf

      public float tf(float freq)
      Implemented as sqrt(freq).
      Specified by:
      tf in class TFIDFSimilarity
      Parameters:
      freq - the frequency of a term within a document
      Returns:
      a score factor based on a term's within-document frequency
    • idfExplain

      public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
      Description copied from class: TFIDFSimilarity
      Computes a score factor for a simple term and returns an explanation for that score factor.

      The default implementation uses:

       idf(docFreq, docCount);
       
      Note that CollectionStatistics.docCount() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.docCount(), and in the same direction. In addition, CollectionStatistics.docCount() does not skew when fields are sparse.
      Overrides:
      idfExplain in class TFIDFSimilarity
      Parameters:
      collectionStats - collection-level statistics
      termStats - term-level statistics for the term
      Returns:
      an Explain object that includes both an idf score factor and an explanation for the term.
    • idf

      public float idf(long docFreq, long docCount)
      Implemented as log((docCount+1)/(docFreq+1)) + 1.
      Specified by:
      idf in class TFIDFSimilarity
      Parameters:
      docFreq - the number of documents which contain the term
      docCount - the total number of documents in the collection
      Returns:
      a score factor based on the term's document frequency
    • toString

      public String toString()
      Overrides:
      toString in class Object