Similarityserves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.
Table Of Contents
Summary of the Ranking Methods
BM25Similarity is an optimized
implementation of the successful Okapi BM25 model.
SimilarityBase provides a basic
implementation of the Similarity contract and exposes a highly simplified
interface, which makes it an ideal starting point for new ranking functions.
Lucene ships the following methods built on
- Amati and Rijsbergen's DFR framework;
- Clinchant and Gaussier's Information-based models for IR;
- The implementation of two language models from Zhai and Lafferty's paper.
- Divergence from independence models as described in "IRRA at TREC 2012" (Dinçer).
SimilarityBaseis not optimized to the same extent as
BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see below.
Chances are the available Similarities are sufficient for all
your searching needs.
However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
applications do not need to distinguish between shorter and longer documents
and could set BM25's
Similarity, one must do so for both indexing and
searching, and the changes must happen before
either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
just isn't well-defined what is going to happen.
To make this change, implement your own
you'll want to simply subclass
then register the new class by calling
before indexing and
two parameters that may be tuned:
- k1, which calibrates term frequency saturation and must be
positive or null. A value of
0makes term frequency completely ignored, making documents scored only based on the value of the IDF of the matched terms. Higher values of k1 increase the impact of term frequency on the final score. Default value is
- b, which controls how much document length should normalize
term frequency values and must be in
[0, 1]. A value of
0disables length normalization completely. Default value is
The easiest way to quickly implement a new ranking method is to extend
SimilarityBase, which provides
basic implementations for the low level . Subclasses are only required to
SimilarityBase.score(BasicStats, double, double)
Another option is to extend one of the frameworks
Similarities are implemented modularly, e.g.
computation of the three parts of its formula to the classes
Normalization. Instead of
subclassing the Similarity, one can simply introduce a new basic model and tell
DFRSimilarity to use it.
Interface Summary Interface Description LMSimilarity.CollectionModelA strategy for computing the collection language model.
Class Summary Class Description AfterEffectThis class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. AfterEffectBModel of the information gain based on the ratio of two Bernoulli processes. AfterEffectLModel of the information gain based on Laplace's law of succession. AxiomaticAxiomatic approaches for IR. AxiomaticF1EXPF1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq AxiomaticF1LOGF1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq AxiomaticF2EXPF2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq AxiomaticF2LOGF2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq AxiomaticF3EXPF3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scores AxiomaticF3LOGF3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scores BasicModelThis class acts as the base class for the specific basic model implementations in the DFR framework. BasicModelGGeometric as limiting form of the Bose-Einstein model. BasicModelIFAn approximation of the I(ne) model. BasicModelInThe basic tf-idf model of randomness. BasicModelIneTf-idf model of randomness, based on a mixture of Poisson and inverse document frequency. BasicStatsStores all statistics commonly used ranking methods. BM25SimilarityBM25 Similarity. BooleanSimilaritySimple similarity that gives terms a score that is equal to their query boost. ClassicSimilarityExpert: Historical scoring implementation. DFISimilarityImplements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf). DFRSimilarityImplements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. DistributionThe probabilistic distribution used to model term occurrence in information-based models. DistributionLLLog-logistic distribution. DistributionSPLThe smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper. IBSimilarityProvides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. IndependenceComputes the measure of divergence from independence for DFI scoring functions. IndependenceChiSquaredNormalized chi-squared measure of distance from independence IndependenceSaturatedSaturated measure of distance from independence IndependenceStandardizedStandardized measure of distance from independence LambdaThe lambda (λw) parameter in information-based models. LambdaDFComputes lambda as
docFreq+1 / numberOfDocuments+1.
LambdaTTFComputes lambda as
totalTermFreq+1 / numberOfDocuments+1.
LMDirichletSimilarityBayesian smoothing using Dirichlet priors. LMJelinekMercerSimilarityLanguage model based on the Jelinek-Mercer smoothing method. LMSimilarityAbstract superclass for language modeling Similarities. LMSimilarity.DefaultCollectionModelModels
p(w|C)as the number of occurrences of the term in the collection, divided by the total number of tokens
LMSimilarity.LMStatsStores the collection distribution of the current term. MultiSimilarityImplements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. NormalizationThis class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework. Normalization.NoNormalizationImplementation used when there is no normalization. NormalizationH1Normalization model that assumes a uniform distribution of the term frequency. NormalizationH2Normalization model in which the term frequency is inversely related to the length. NormalizationH3Dirichlet Priors normalization NormalizationZPareto-Zipf Normalization PerFieldSimilarityWrapperProvides the ability to use a different
Similarityfor different fields.
SimilaritySimilarity defines the components of Lucene scoring. Similarity.SimScorerStores the weight for a query across the indexed collection. SimilarityBaseA subclass of
Similaritythat provides a simplified API for its descendants.
Similaritywith the Vector Space Model.