Class IBSimilarity


public class IBSimilarity extends SimilarityBase
Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.

The retrieval function is of the form RSV(q, d) = ∑ -xqw log Prob(Xw ≥ tdw | λw), where

  • xqw is the query boost;
  • Xw is a random variable that counts the occurrences of word w;
  • tdw is the normalized term frequency;
  • λw is a parameter.

The framework described in the paper has many similarities to the DFR framework (see DFRSimilarity). It is possible that the two Similarities will be merged at one point.

To construct an IBSimilarity, you must specify the implementations for all three components of the Information-Based model.

  1. Distribution: Probabilistic distribution used to model term occurrence
  2. Lambda: λw parameter of the probability distribution
    • LambdaDF: Nw/N or average number of documents where w occurs
    • LambdaTTF: Fw/N or average number of occurrences of w in the collection
  3. Normalization: Term frequency normalization
    Any supported DFR normalization (listed in DFRSimilarity)
See Also:
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

    • distribution

      protected final Distribution distribution
      The probabilistic distribution used to model term occurrence.
    • lambda

      protected final Lambda lambda
      The lambda (λw) parameter.
    • normalization

      protected final Normalization normalization
      The term frequency normalization.
  • Constructor Details

    • IBSimilarity

      public IBSimilarity(Distribution distribution, Lambda lambda, Normalization normalization)
      Creates IBSimilarity from the three components.

      Note that null values are not allowed: if you want no normalization, instead pass Normalization.NoNormalization.

      Parameters:
      distribution - probabilistic distribution modeling term occurrence
      lambda - distribution's λw parameter
      normalization - term frequency normalization
  • Method Details

    • score

      protected double score(BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Scores the document doc.

      Subclasses must apply their scoring formula in this class.

      Specified by:
      score in class SimilarityBase
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
      Returns:
      the score.
    • explain

      protected void explain(List<Explanation> subs, BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Subclasses should implement this method to explain the score. expl already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.

      The default implementation does nothing.

      Overrides:
      explain in class SimilarityBase
      Parameters:
      subs - the list of details of the explanation to extend
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
    • explain

      protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
      Description copied from class: SimilarityBase
      Explains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via the SimilarityBase.score(BasicStats, double, double) method) and the explanation for the term frequency. Subclasses content with this format may add additional details in SimilarityBase.explain(List, BasicStats, double, double).
      Overrides:
      explain in class SimilarityBase
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency and its explanation.
      docLen - the document length.
      Returns:
      the explanation.
    • toString

      public String toString()
      The name of IB methods follow the pattern IB <distribution> <lambda><normalization>. The name of the distribution is the same as in the original paper; for the names of lambda parameters, refer to the javadoc of the Lambda classes.
      Specified by:
      toString in class SimilarityBase
    • getDistribution

      public Distribution getDistribution()
      Returns the distribution
    • getLambda

      public Lambda getLambda()
      Returns the distribution's lambda parameter
    • getNormalization

      public Normalization getNormalization()
      Returns the term frequency normalization