Class IndriDirichletSimilarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.SimilarityBase
org.apache.lucene.search.similarities.LMSimilarity
org.apache.lucene.search.similarities.IndriDirichletSimilarity
Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine
(http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!
tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu mu*P(t|C) + tf_D where P(t|D)= doclen + mu
A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Modelsp(w|C)
as the number of occurrences of the term in the collection, divided by the total number of tokens+ 1
.Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.LMSimilarity
LMSimilarity.CollectionModel, LMSimilarity.DefaultCollectionModel, LMSimilarity.LMStats
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer
-
Field Summary
Fields inherited from class org.apache.lucene.search.similarities.LMSimilarity
collectionModel
-
Constructor Summary
ConstructorsConstructorDescriptionInstantiates the similarity with the default μ value of 2000.IndriDirichletSimilarity
(float mu) Instantiates the similarity with the provided μ parameter.IndriDirichletSimilarity
(LMSimilarity.CollectionModel collectionModel) Instantiates the similarity with the default μ value of 2000.IndriDirichletSimilarity
(LMSimilarity.CollectionModel collectionModel, boolean discountOverlaps, float mu) Instantiates the similarity with the provided parameters.IndriDirichletSimilarity
(LMSimilarity.CollectionModel collectionModel, float mu) Instantiates the similarity with the provided μ parameter. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
explain
(List<Explanation> subs, BasicStats stats, double freq, double docLen) Subclasses should implement this method to explain the score.float
getMu()
Returns the μ parameter.getName()
Returns the name of the LM method.protected double
score
(BasicStats stats, double freq, double docLen) Scores the documentdoc
.Methods inherited from class org.apache.lucene.search.similarities.LMSimilarity
fillBasicStats, newStats, toString
Methods inherited from class org.apache.lucene.search.similarities.SimilarityBase
explain, log2, scorer
Methods inherited from class org.apache.lucene.search.similarities.Similarity
computeNorm, getDiscountOverlaps
-
Constructor Details
-
IndriDirichletSimilarity
public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel, boolean discountOverlaps, float mu) Instantiates the similarity with the provided parameters. -
IndriDirichletSimilarity
Instantiates the similarity with the provided μ parameter. -
IndriDirichletSimilarity
public IndriDirichletSimilarity(float mu) Instantiates the similarity with the provided μ parameter. -
IndriDirichletSimilarity
Instantiates the similarity with the default μ value of 2000. -
IndriDirichletSimilarity
public IndriDirichletSimilarity()Instantiates the similarity with the default μ value of 2000.
-
-
Method Details
-
score
Description copied from class:SimilarityBase
Scores the documentdoc
.Subclasses must apply their scoring formula in this class.
- Specified by:
score
in classSimilarityBase
- Parameters:
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.- Returns:
- the score.
-
explain
Description copied from class:SimilarityBase
Subclasses should implement this method to explain the score.expl
already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.The default implementation does nothing.
- Overrides:
explain
in classLMSimilarity
- Parameters:
subs
- the list of details of the explanation to extendstats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.
-
getMu
public float getMu()Returns the μ parameter. -
getName
Description copied from class:LMSimilarity
Returns the name of the LM method. The values of the parameters should be included as well.Used in
LMSimilarity.toString()
.- Specified by:
getName
in classLMSimilarity
-