public class DFISimilarity extends SimilarityBase
DFI is both parameter-free and non-parametric:
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
IndependenceStandardized
,
IndependenceSaturated
,
IndependenceChiSquared
Similarity.SimScorer, Similarity.SimWeight
discountOverlaps
Constructor and Description |
---|
DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure
|
Modifier and Type | Method and Description |
---|---|
Independence |
getIndependence()
Returns the measure of independence
|
protected float |
score(BasicStats stats,
float freq,
float docLen)
Scores the document
doc . |
String |
toString()
Subclasses must override this method to return the name of the Similarity
and preferably the values of parameters (if any) as well.
|
computeNorm, computeWeight, explain, explain, fillBasicStats, getDiscountOverlaps, log2, newStats, setDiscountOverlaps, simScorer
public DFISimilarity(Independence independenceMeasure)
independenceMeasure
- measure of divergence from independenceprotected float score(BasicStats stats, float freq, float docLen)
SimilarityBase
doc
.
Subclasses must apply their scoring formula in this class.
score
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.public Independence getIndependence()
public String toString()
SimilarityBase
toString
in class SimilarityBase
Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.