Package org.apache.lucene.search.similarities
Similarity
serves as the base for ranking
functions. For searching, users can employ the models already implemented or create their own by
extending one of the classes in this package.
Table Of Contents
Summary of the Ranking Methods
BM25Similarity
is an optimized implementation of
the successful Okapi BM25 model.
ClassicSimilarity
is the original Lucene scoring
function. It is based on the Vector
Space Model. For more information, see TFIDFSimilarity
.
SimilarityBase
provides a basic implementation
of the Similarity contract and exposes a highly simplified interface, which makes it an ideal
starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase
:
- Amati and Rijsbergen's DFR framework;
- Clinchant and Gaussier's Information-based models for IR;
- The implementation of two language models from Zhai and Lafferty's paper.
- Divergence from independence models as described in "IRRA at TREC 2012" (Dinçer).
SimilarityBase
is not optimized to the same
extent as ClassicSimilarity
and BM25Similarity
, a difference in performance is to be
expected when using the methods listed above. However, optimizations can always be implemented in
subclasses; see below.
Changing Similarity
Chances are the available Similarities are sufficient for all your searching needs. However,
in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need
to distinguish between shorter and longer documents and could set BM25's b
parameter to
0
.
To change Similarity
, one must do so for both
indexing and searching, and the changes must happen before either of these actions take place.
Although in theory there is nothing stopping you from changing mid-stream, it just isn't
well-defined what is going to happen.
To make this change, implement your own Similarity
(likely you'll want to simply subclass SimilarityBase
), and then register the new class by
calling IndexWriterConfig.setSimilarity(Similarity)
before
indexing and IndexSearcher.setSimilarity(Similarity)
before
searching.
Tuning BM25Similarity
BM25Similarity
has two parameters that may be
tuned:
k1
, which calibrates term frequency saturation and must be positive or null. A value of0
makes term frequency completely ignored, making documents scored only based on the value of theIDF
of the matched terms. Higher values ofk1
increase the impact of term frequency on the final score. Default value is1.2
.b
, which controls how much document length should normalize term frequency values and must be in[0, 1]
. A value of0
disables length normalization completely. Default value is0.75
.
Extending SimilarityBase
The easiest way to quickly implement a new ranking method is to extend SimilarityBase
, which provides basic implementations for
the low level . Subclasses are only required to implement the SimilarityBase.score(BasicStats, double, double)
and
SimilarityBase.toString()
methods.
Another option is to extend one of the frameworks based on SimilarityBase
. These Similarities are implemented
modularly, e.g. DFRSimilarity
delegates computation
of the three parts of its formula to the classes BasicModel
, AfterEffect
and Normalization
. Instead of subclassing the Similarity, one
can simply introduce a new basic model and tell DFRSimilarity
to use it.
-
Interface Summary Interface Description LMSimilarity.CollectionModel A strategy for computing the collection language model. -
Class Summary Class Description AfterEffect This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework.AfterEffectB Model of the information gain based on the ratio of two Bernoulli processes.AfterEffectL Model of the information gain based on Laplace's law of succession.Axiomatic Axiomatic approaches for IR.AxiomaticF1EXP F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freqAxiomaticF1LOG F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freqAxiomaticF2EXP F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freqAxiomaticF2LOG F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freqAxiomaticF3EXP F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scoresAxiomaticF3LOG F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scoresBasicModel This class acts as the base class for the specific basic model implementations in the DFR framework.BasicModelG Geometric as limiting form of the Bose-Einstein model.BasicModelIF An approximation of the I(ne) model.BasicModelIn The basic tf-idf model of randomness.BasicModelIne Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.BasicStats Stores all statistics commonly used ranking methods.BM25Similarity BM25 Similarity.BooleanSimilarity Simple similarity that gives terms a score that is equal to their query boost.ClassicSimilarity Expert: Historical scoring implementation.DFISimilarity Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).DFRSimilarity Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen.Distribution The probabilistic distribution used to model term occurrence in information-based models.DistributionLL Log-logistic distribution.DistributionSPL The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.IBSimilarity Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier.Independence Computes the measure of divergence from independence for DFI scoring functions.IndependenceChiSquared Normalized chi-squared measure of distance from independenceIndependenceSaturated Saturated measure of distance from independenceIndependenceStandardized Standardized measure of distance from independenceIndriDirichletSimilarity Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php).IndriDirichletSimilarity.IndriCollectionModel Modelsp(w|C)
as the number of occurrences of the term in the collection, divided by the total number of tokens+ 1
.Lambda The lambda (λw) parameter in information-based models.LambdaDF Computes lambda asdocFreq+1 / numberOfDocuments+1
.LambdaTTF Computes lambda astotalTermFreq+1 / numberOfDocuments+1
.LMDirichletSimilarity Bayesian smoothing using Dirichlet priors.LMJelinekMercerSimilarity Language model based on the Jelinek-Mercer smoothing method.LMSimilarity Abstract superclass for language modeling Similarities.LMSimilarity.DefaultCollectionModel Modelsp(w|C)
as the number of occurrences of the term in the collection, divided by the total number of tokens+ 1
.LMSimilarity.LMStats Stores the collection distribution of the current term.MultiSimilarity Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A.Normalization This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.Normalization.NoNormalization Implementation used when there is no normalization.NormalizationH1 Normalization model that assumes a uniform distribution of the term frequency.NormalizationH2 Normalization model in which the term frequency is inversely related to the length.NormalizationH3 Dirichlet Priors normalizationNormalizationZ Pareto-Zipf NormalizationPerFieldSimilarityWrapper Provides the ability to use a differentSimilarity
for different fields.Similarity Similarity defines the components of Lucene scoring.Similarity.SimScorer Stores the weight for a query across the indexed collection.SimilarityBase A subclass ofSimilarity
that provides a simplified API for its descendants.TFIDFSimilarity Implementation ofSimilarity
with the Vector Space Model.