org.apache.lucene.search.similarities (Lucene 5.5.4 API)

Interface Summary
Interface Description

LMSimilarity.CollectionModel
A strategy for computing the collection language model.

Interface Summary
Interface	Description
LMSimilarity.CollectionModel	A strategy for computing the collection language model.

Class Summary
Class	Description
AfterEffect	This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework.
AfterEffect.NoAfterEffect	Implementation used when there is no aftereffect.
AfterEffectB	Model of the information gain based on the ratio of two Bernoulli processes.
AfterEffectL	Model of the information gain based on Laplace's law of succession.
BasicModel	This class acts as the base class for the specific basic model implementations in the DFR framework.
BasicModelBE	Limiting form of the Bose-Einstein model.
BasicModelD	Implements the approximation of the binomial model with the divergence for DFR.
BasicModelG	Geometric as limiting form of the Bose-Einstein model.
BasicModelIF	An approximation of the I(n_e) model.
BasicModelIn	The basic tf-idf model of randomness.
BasicModelIne	Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.
BasicModelP	Implements the Poisson approximation for the binomial model for DFR.
BasicStats	Stores all statistics commonly used ranking methods.
BM25Similarity	BM25 Similarity.
ClassicSimilarity	Expert: Default scoring implementation which `encodes` norm values as a single byte before being stored.
DefaultSimilarity	Deprecated Use `ClassicSimilarity` for equivilent behavior, or consider switching to `BM25Similarity` which will become the new default in Lucene 6.0
DFISimilarity	Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).
DFRSimilarity	Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen.
Distribution	The probabilistic distribution used to model term occurrence in information-based models.
DistributionLL	Log-logistic distribution.
DistributionSPL	The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.
IBSimilarity	Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier.
Independence	Computes the measure of divergence from independence for DFI scoring functions.
IndependenceChiSquared	Normalized chi-squared measure of distance from independence
IndependenceSaturated	Saturated measure of distance from independence
IndependenceStandardized	Standardized measure of distance from independence
Lambda	The lambda (λ_w) parameter in information-based models.
LambdaDF	Computes lambda as `docFreq+1 / numberOfDocuments+1`.
LambdaTTF	Computes lambda as `totalTermFreq+1 / numberOfDocuments+1`.
LMDirichletSimilarity	Bayesian smoothing using Dirichlet priors.
LMJelinekMercerSimilarity	Language model based on the Jelinek-Mercer smoothing method.
LMSimilarity	Abstract superclass for language modeling Similarities.
LMSimilarity.DefaultCollectionModel	Models `p(w\|C)` as the number of occurrences of the term in the collection, divided by the total number of tokens `+ 1`.
LMSimilarity.LMStats	Stores the collection distribution of the current term.
MultiSimilarity	Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A.
Normalization	This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.
Normalization.NoNormalization	Implementation used when there is no normalization.
NormalizationH1	Normalization model that assumes a uniform distribution of the term frequency.
NormalizationH2	Normalization model in which the term frequency is inversely related to the length.
NormalizationH3	Dirichlet Priors normalization
NormalizationZ	Pareto-Zipf Normalization
PerFieldSimilarityWrapper	Provides the ability to use a different `Similarity` for different fields.
Similarity	Similarity defines the components of Lucene scoring.
Similarity.SimScorer	API for scoring "sloppy" queries such as `TermQuery`, `SpanQuery`, and `PhraseQuery`.
Similarity.SimWeight	Stores the weight for a query across the indexed collection.
SimilarityBase	A subclass of `Similarity` that provides a simplified API for its descendants.
TFIDFSimilarity	Implementation of `Similarity` with the Vector Space Model.

Package org.apache.lucene.search.similarities Description

This package contains the various ranking models that can be used in Lucene. The abstract class Similarity serves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.

Summary of the Ranking Methods
Changing the Similarity

Summary of the Ranking Methods

DefaultSimilarity is the original Lucene scoring function. It is based on a highly optimized Vector Space Model. For more information, see TFIDFSimilarity.

BM25Similarity is an optimized implementation of the successful Okapi BM25 model.

SimilarityBase provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase:

Amati and Rijsbergen's DFR framework;
Clinchant and Gaussier's Information-based models for IR;
The implementation of two language models from Zhai and Lafferty's paper.
Divergence from independence models as described in "IRRA at TREC 2012" (Dinçer).

Since SimilarityBase is not optimized to the same extent as DefaultSimilarity and BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see below.

Changing Similarity

Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).

To change Similarity, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.

To make this change, implement your own Similarity (likely you'll want to simply subclass an existing method, be it DefaultSimilarity or a descendant of SimilarityBase), and then register the new class by calling IndexWriterConfig.setSimilarity(Similarity) before indexing and IndexSearcher.setSimilarity(Similarity) before searching.

Extending SimilarityBase

The easiest way to quickly implement a new ranking method is to extend SimilarityBase, which provides basic implementations for the low level . Subclasses are only required to implement the SimilarityBase.score(BasicStats, float, float) and SimilarityBase.toString() methods.

Another option is to extend one of the frameworks based on SimilarityBase. These Similarities are implemented modularly, e.g. DFRSimilarity delegates computation of the three parts of its formula to the classes BasicModel, AfterEffect and Normalization. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell DFRSimilarity to use it.

Changing DefaultSimilarity

If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:

The SweetSpotSimilarity in org.apache.lucene.misc gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization — By overriding Similarity.computeNorm(org.apache.lucene.index.FieldInvertState state), it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".

In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):

[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.

Package org.apache.lucene.search.similarities

Package org.apache.lucene.search.similarities Description

Table Of Contents

Summary of the Ranking Methods

Changing Similarity

Extending SimilarityBase

Changing DefaultSimilarity