org.apache.lucene.search.similarities (Lucene 10.2.1 core API)

package org.apache.lucene.search.similarities

This package contains the various ranking models that can be used in Lucene. The abstract class Similarity serves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.

Summary of the Ranking Methods
Changing the Similarity

Summary of the Ranking Methods

BM25Similarity is an optimized implementation of the successful Okapi BM25 model.

ClassicSimilarity is the original Lucene scoring function. It is based on the Vector Space Model. For more information, see TFIDFSimilarity.

SimilarityBase provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase:

Amati and Rijsbergen's DFR framework;
Clinchant and Gaussier's Information-based models for IR;
The implementation of two language models from Zhai and Lafferty's paper.
Divergence from independence models as described in "IRRA at TREC 2012" (Dinçer).

Since SimilarityBase is not optimized to the same extent as ClassicSimilarity and BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see below.

Changing Similarity

Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents and could set BM25's b parameter to 0.

To switch to a Similarity that encodes the length normalization differently, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Note that all of Lucene's built-in similarities - and more generally all Similarity sub-classes that don't override Similarity.computeNorm(org.apache.lucene.index.FieldInvertState) - encode the length normalization factor the same way, so it is fine to change the similarity at search-time without recreating the index.

To make this change, implement your own Similarity (likely you'll want to simply subclass SimilarityBase), and then register the new class by calling IndexWriterConfig.setSimilarity(Similarity) before indexing and IndexSearcher.setSimilarity(Similarity) before searching.

Tuning BM25Similarity

BM25Similarity has two parameters that may be tuned:

k1, which calibrates term frequency saturation and must be positive or null. A value of 0 makes term frequency completely ignored, making documents scored only based on the value of the IDF of the matched terms. Higher values of k1 increase the impact of term frequency on the final score. Default value is 1.2.
b, which controls how much document length should normalize term frequency values and must be in [0, 1]. A value of 0 disables length normalization completely. Default value is 0.75.

Extending SimilarityBase

The easiest way to quickly implement a new ranking method is to extend SimilarityBase, which provides basic implementations for the low level . Subclasses are only required to implement the SimilarityBase.score(BasicStats, double, double) and SimilarityBase.toString() methods.

Another option is to extend one of the frameworks based on SimilarityBase. These Similarities are implemented modularly, e.g. DFRSimilarity delegates computation of the three parts of its formula to the classes BasicModel, AfterEffect and Normalization. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell DFRSimilarity to use it.

Related Packages

Package

Description

org.apache.lucene.search

Code to search indices.

org.apache.lucene.search.comparators

Comparators, used to compare hits so as to determine their sort order when collecting the top results with TopFieldCollector.

org.apache.lucene.search.knn

Classes related to vector search: knn and vector fields.
Class

Description

AfterEffect

This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework.

AfterEffectB

Model of the information gain based on the ratio of two Bernoulli processes.

AfterEffectL

Model of the information gain based on Laplace's law of succession.

Axiomatic

Axiomatic approaches for IR.

AxiomaticF1EXP

F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF1LOG

F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF2EXP

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF2LOG

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF3EXP

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scores

AxiomaticF3LOG

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl NOTE: the gamma function of this similarity creates negative scores

BasicModel

This class acts as the base class for the specific basic model implementations in the DFR framework.

BasicModelG

Geometric as limiting form of the Bose-Einstein model.

BasicModelIF

An approximation of the I(n_e) model.

BasicModelIn

The basic tf-idf model of randomness.

BasicModelIne

Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.

BasicStats

Stores all statistics commonly used ranking methods.

BM25Similarity

BM25 Similarity.

BooleanSimilarity

Simple similarity that gives terms a score that is equal to their query boost.

ClassicSimilarity

Expert: Historical scoring implementation.

DFISimilarity

Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).

DFRSimilarity

Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen.

Distribution

The probabilistic distribution used to model term occurrence in information-based models.

DistributionLL

Log-logistic distribution.

DistributionSPL

The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.

IBSimilarity

Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier.

Independence

Computes the measure of divergence from independence for DFI scoring functions.

IndependenceChiSquared

Normalized chi-squared measure of distance from independence

IndependenceSaturated

Saturated measure of distance from independence

IndependenceStandardized

Standardized measure of distance from independence

IndriDirichletSimilarity

Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php).

IndriDirichletSimilarity.IndriCollectionModel

Models p(w|C) as the number of occurrences of the term in the collection, divided by the total number of tokens + 1.

Lambda

The lambda (λ_w) parameter in information-based models.

LambdaDF

Computes lambda as docFreq+1 / numberOfDocuments+1.

LambdaTTF

Computes lambda as totalTermFreq+1 / numberOfDocuments+1.

LMDirichletSimilarity

Bayesian smoothing using Dirichlet priors.

LMJelinekMercerSimilarity

Language model based on the Jelinek-Mercer smoothing method.

LMSimilarity

Abstract superclass for language modeling Similarities.

LMSimilarity.CollectionModel

A strategy for computing the collection language model.

LMSimilarity.DefaultCollectionModel

Models p(w|C) as the number of occurrences of the term in the collection, divided by the total number of tokens + 1.

LMSimilarity.LMStats

Stores the collection distribution of the current term.

MultiSimilarity

Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A.

Normalization

This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.

Normalization.NoNormalization

Implementation used when there is no normalization.

NormalizationH1

Normalization model that assumes a uniform distribution of the term frequency.

NormalizationH2

Normalization model in which the term frequency is inversely related to the length.

NormalizationH3

Dirichlet Priors normalization

NormalizationZ

Pareto-Zipf Normalization

PerFieldSimilarityWrapper

Provides the ability to use a different Similarity for different fields.

RawTFSimilarity

Similarity that returns the raw TF as score.

Similarity

Similarity defines the components of Lucene scoring.

Similarity.SimScorer

Stores the weight for a query across the indexed collection.

SimilarityBase

A subclass of Similarity that provides a simplified API for its descendants.

TFIDFSimilarity

Implementation of Similarity with the Vector Space Model.

Package org.apache.lucene.search.similarities

Table Of Contents

Summary of the Ranking Methods

Changing Similarity

Tuning BM25Similarity

Extending SimilarityBase