Class DFISimilarity


public class DFISimilarity extends SimilarityBase
Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).

DFI is both parameter-free and non-parametric:

  • parameter-free: it does not require any parameter tuning or training.
  • non-parametric: it does not make any assumptions about word frequency distributions on document collections.

It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.

For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

See Also:
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Constructor Details

    • DFISimilarity

      public DFISimilarity(Independence independenceMeasure)
      Create DFI with the specified divergence from independence measure
      Parameters:
      independenceMeasure - measure of divergence from independence
  • Method Details

    • score

      protected double score(BasicStats stats, double freq, double docLen)
      Description copied from class: SimilarityBase
      Scores the document doc.

      Subclasses must apply their scoring formula in this class.

      Specified by:
      score in class SimilarityBase
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
      Returns:
      the score.
    • getIndependence

      public Independence getIndependence()
      Returns the measure of independence
    • explain

      protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
      Description copied from class: SimilarityBase
      Explains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via the SimilarityBase.score(BasicStats, double, double) method) and the explanation for the term frequency. Subclasses content with this format may add additional details in SimilarityBase.explain(List, BasicStats, double, double).
      Overrides:
      explain in class SimilarityBase
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency and its explanation.
      docLen - the document length.
      Returns:
      the explanation.
    • toString

      public String toString()
      Description copied from class: SimilarityBase
      Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.
      Specified by:
      toString in class SimilarityBase