Class IndriDirichletSimilarity


public class IndriDirichletSimilarity extends LMSimilarity
Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!
 tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu
 mu*P(t|C) + tf_D where P(t|D)= doclen + mu
 

A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.

  • Constructor Details

    • IndriDirichletSimilarity

      public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel, float mu)
      Instantiates the similarity with the provided μ parameter.
    • IndriDirichletSimilarity

      public IndriDirichletSimilarity(float mu)
      Instantiates the similarity with the provided μ parameter.
    • IndriDirichletSimilarity

      public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel)
      Instantiates the similarity with the default μ value of 2000.
    • IndriDirichletSimilarity

      public IndriDirichletSimilarity()
      Instantiates the similarity with the default μ value of 2000.
  • Method Details

    • score

      protected double score(BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Scores the document doc.

      Subclasses must apply their scoring formula in this class.

      Specified by:
      score in class SimilarityBase
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
      Returns:
      the score.
    • explain

      protected void explain(List<Explanation> subs, BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Subclasses should implement this method to explain the score. expl already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.

      The default implementation does nothing.

      Overrides:
      explain in class LMSimilarity
      Parameters:
      subs - the list of details of the explanation to extend
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
    • getMu

      public float getMu()
      Returns the μ parameter.
    • getName

      public String getName()
      Description copied from class: LMSimilarity
      Returns the name of the LM method. The values of the parameters should be included as well.

      Used in LMSimilarity.toString().

      Specified by:
      getName in class LMSimilarity