Class DFISimilarity
- java.lang.Object
-
- org.apache.lucene.search.similarities.Similarity
-
- org.apache.lucene.search.similarities.SimilarityBase
-
- org.apache.lucene.search.similarities.DFISimilarity
-
public class DFISimilarity extends SimilarityBase
Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).DFI is both parameter-free and non-parametric:
- parameter-free: it does not require any parameter tuning or training.
- non-parametric: it does not make any assumptions about word frequency distributions on document collections.
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
- See Also:
IndependenceStandardized
,IndependenceSaturated
,IndependenceChiSquared
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.search.similarities.SimilarityBase
discountOverlaps
-
-
Constructor Summary
Constructors Constructor Description DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected Explanation
explain(BasicStats stats, Explanation freq, double docLen)
Explains the score.Independence
getIndependence()
Returns the measure of independenceprotected double
score(BasicStats stats, double freq, double docLen)
Scores the documentdoc
.String
toString()
Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.-
Methods inherited from class org.apache.lucene.search.similarities.SimilarityBase
computeNorm, explain, fillBasicStats, getDiscountOverlaps, log2, newStats, scorer, setDiscountOverlaps
-
-
-
-
Constructor Detail
-
DFISimilarity
public DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure- Parameters:
independenceMeasure
- measure of divergence from independence
-
-
Method Detail
-
score
protected double score(BasicStats stats, double freq, double docLen)
Description copied from class:SimilarityBase
Scores the documentdoc
.Subclasses must apply their scoring formula in this class.
- Specified by:
score
in classSimilarityBase
- Parameters:
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.- Returns:
- the score.
-
getIndependence
public Independence getIndependence()
Returns the measure of independence
-
explain
protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
Description copied from class:SimilarityBase
Explains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via theSimilarityBase.score(BasicStats, double, double)
method) and the explanation for the term frequency. Subclasses content with this format may add additional details inSimilarityBase.explain(List, BasicStats, double, double)
.- Overrides:
explain
in classSimilarityBase
- Parameters:
stats
- the corpus level statistics.freq
- the term frequency and its explanation.docLen
- the document length.- Returns:
- the explanation.
-
toString
public String toString()
Description copied from class:SimilarityBase
Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.- Specified by:
toString
in classSimilarityBase
-
-