Class LMSimilarity

Direct Known Subclasses:
IndriDirichletSimilarity, LMDirichletSimilarity, LMJelinekMercerSimilarity

public abstract class LMSimilarity extends SimilarityBase
Abstract superclass for language modeling Similarities. The following inner types are introduced:
  • LMSimilarity.LMStats, which defines a new statistic, the probability that the collection language model generates the current term;
  • LMSimilarity.CollectionModel, which is a strategy interface for object that compute the collection language model p(w|C);
  • LMSimilarity.DefaultCollectionModel, an implementation of the former, that computes the term probability as the number of occurrences of the term in the collection, divided by the total number of tokens.
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

  • Constructor Details

    • LMSimilarity

      public LMSimilarity(LMSimilarity.CollectionModel collectionModel)
      Creates a new instance with the specified collection language model.
    • LMSimilarity

      public LMSimilarity()
      Creates a new instance with the default collection language model.
  • Method Details

    • newStats

      protected BasicStats newStats(String field, double boost)
      Description copied from class: SimilarityBase
      Factory method to return a custom stats object
      Overrides:
      newStats in class SimilarityBase
    • fillBasicStats

      protected void fillBasicStats(BasicStats stats, CollectionStatistics collectionStats, TermStatistics termStats)
      Computes the collection probability of the current term in addition to the usual statistics.
      Overrides:
      fillBasicStats in class SimilarityBase
    • explain

      protected void explain(List<Explanation> subExpls, BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Subclasses should implement this method to explain the score. expl already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.

      The default implementation does nothing.

      Overrides:
      explain in class SimilarityBase
      Parameters:
      subExpls - the list of details of the explanation to extend
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
    • getName

      public abstract String getName()
      Returns the name of the LM method. The values of the parameters should be included as well.

      Used in toString().

    • toString

      public String toString()
      Returns the name of the LM method. If a custom collection model strategy is used, its name is included as well.
      Specified by:
      toString in class SimilarityBase
      See Also: