public class IndriDirichletSimilarity extends LMSimilarity
tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu mu*P(t|C) + tf_D where P(t|D)= doclen + mu
A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.
| Modifier and Type | Class and Description |
|---|---|
static class |
IndriDirichletSimilarity.IndriCollectionModel
Models
p(w|C) as the number of occurrences of the term in the collection, divided by
the total number of tokens + 1. |
LMSimilarity.CollectionModel, LMSimilarity.DefaultCollectionModel, LMSimilarity.LMStatsSimilarity.SimScorercollectionModeldiscountOverlaps| Constructor and Description |
|---|
IndriDirichletSimilarity()
Instantiates the similarity with the default μ value of 2000.
|
IndriDirichletSimilarity(float mu)
Instantiates the similarity with the provided μ parameter.
|
IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel)
Instantiates the similarity with the default μ value of 2000.
|
IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel,
float mu)
Instantiates the similarity with the provided μ parameter.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
explain(List<Explanation> subs,
BasicStats stats,
double freq,
double docLen)
Subclasses should implement this method to explain the score.
|
float |
getMu()
Returns the μ parameter.
|
String |
getName()
Returns the name of the LM method.
|
protected double |
score(BasicStats stats,
double freq,
double docLen)
Scores the document
doc. |
fillBasicStats, newStats, toStringcomputeNorm, explain, getDiscountOverlaps, log2, scorer, setDiscountOverlapspublic IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel, float mu)
public IndriDirichletSimilarity(float mu)
public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel)
public IndriDirichletSimilarity()
protected double score(BasicStats stats, double freq, double docLen)
SimilarityBasedoc.
Subclasses must apply their scoring formula in this class.
score in class SimilarityBasestats - the corpus level statistics.freq - the term frequency.docLen - the document length.protected void explain(List<Explanation> subs, BasicStats stats, double freq, double docLen)
SimilarityBaseexpl
already contains the score, the name of the class and the doc id, as well
as the term frequency and its explanation; subclasses can add additional
clauses to explain details of their scoring formulae.
The default implementation does nothing.
explain in class LMSimilaritysubs - the list of details of the explanation to extendstats - the corpus level statistics.freq - the term frequency.docLen - the document length.public float getMu()
public String getName()
LMSimilarityUsed in LMSimilarity.toString()
getName in class LMSimilarityCopyright © 2000-2021 Apache Software Foundation. All Rights Reserved.