Class IndriDirichletSimilarity


  • public class IndriDirichletSimilarity
    extends LMSimilarity
    Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!
     tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu
     mu*P(t|C) + tf_D where P(t|D)= doclen + mu
     

    A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.

    • Constructor Detail

      • IndriDirichletSimilarity

        public IndriDirichletSimilarity​(LMSimilarity.CollectionModel collectionModel,
                                        float mu)
        Instantiates the similarity with the provided μ parameter.
      • IndriDirichletSimilarity

        public IndriDirichletSimilarity​(float mu)
        Instantiates the similarity with the provided μ parameter.
      • IndriDirichletSimilarity

        public IndriDirichletSimilarity​(LMSimilarity.CollectionModel collectionModel)
        Instantiates the similarity with the default μ value of 2000.
      • IndriDirichletSimilarity

        public IndriDirichletSimilarity()
        Instantiates the similarity with the default μ value of 2000.
    • Method Detail

      • score

        protected double score​(BasicStats stats,
                               double freq,
                               double docLen)
        Description copied from class: SimilarityBase
        Scores the document doc.

        Subclasses must apply their scoring formula in this class.

        Specified by:
        score in class SimilarityBase
        Parameters:
        stats - the corpus level statistics.
        freq - the term frequency.
        docLen - the document length.
        Returns:
        the score.
      • explain

        protected void explain​(List<Explanation> subs,
                               BasicStats stats,
                               double freq,
                               double docLen)
        Description copied from class: SimilarityBase
        Subclasses should implement this method to explain the score. expl already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.

        The default implementation does nothing.

        Overrides:
        explain in class LMSimilarity
        Parameters:
        subs - the list of details of the explanation to extend
        stats - the corpus level statistics.
        freq - the term frequency.
        docLen - the document length.
      • getMu

        public float getMu()
        Returns the μ parameter.