See: Description
| Interface | Description | 
|---|---|
| LMSimilarity.CollectionModel | A strategy for computing the collection language model. | 
| Class | Description | 
|---|---|
| AfterEffect | This class acts as the base class for the implementations of the first
 normalization of the informative content in the DFR framework. | 
| AfterEffectB | Model of the information gain based on the ratio of two Bernoulli processes. | 
| AfterEffectL | Model of the information gain based on Laplace's law of succession. | 
| Axiomatic | Axiomatic approaches for IR. | 
| AxiomaticF1EXP | F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term))
 where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq | 
| AxiomaticF1LOG | F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term))
 where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq | 
| AxiomaticF2EXP | F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term))
 where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq | 
| AxiomaticF2LOG | F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term))
 where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq | 
| AxiomaticF3EXP | F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen))
 where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq
 gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl
 NOTE: the gamma function of this similarity creates negative scores | 
| AxiomaticF3LOG | F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen))
 where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq
 gamma(docLen, queryLen) = (docLen-queryLen)*queryLen*s/avdl
 NOTE: the gamma function of this similarity creates negative scores | 
| BasicModel | This class acts as the base class for the specific basic model
 implementations in the DFR framework. | 
| BasicModelG | Geometric as limiting form of the Bose-Einstein model. | 
| BasicModelIF | An approximation of the I(ne) model. | 
| BasicModelIn | The basic tf-idf model of randomness. | 
| BasicModelIne | Tf-idf model of randomness, based on a mixture of Poisson and inverse
 document frequency. | 
| BasicStats | Stores all statistics commonly used ranking methods. | 
| BM25Similarity | BM25 Similarity. | 
| BooleanSimilarity | Simple similarity that gives terms a score that is equal to their query
 boost. | 
| ClassicSimilarity | Expert: Historical scoring implementation. | 
| DFISimilarity | Implements the Divergence from Independence (DFI) model based on Chi-square statistics
 (i.e., standardized Chi-squared distance from independence in term frequency tf). | 
| DFRSimilarity | Implements the divergence from randomness (DFR) framework
 introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. | 
| Distribution | The probabilistic distribution used to model term occurrence
 in information-based models. | 
| DistributionLL | Log-logistic distribution. | 
| DistributionSPL | The smoothed power-law (SPL) distribution for the information-based framework
 that is described in the original paper. | 
| IBSimilarity | Provides a framework for the family of information-based models, as described
 in Stéphane Clinchant and Eric Gaussier. | 
| Independence | Computes the measure of divergence from independence for DFI
 scoring functions. | 
| IndependenceChiSquared | Normalized chi-squared measure of distance from independence | 
| IndependenceSaturated | Saturated measure of distance from independence | 
| IndependenceStandardized | Standardized measure of distance from independence | 
| Lambda | The lambda (λw) parameter in information-based
 models. | 
| LambdaDF | Computes lambda as  docFreq+1 / numberOfDocuments+1. | 
| LambdaTTF | Computes lambda as  totalTermFreq+1 / numberOfDocuments+1. | 
| LMDirichletSimilarity | Bayesian smoothing using Dirichlet priors. | 
| LMJelinekMercerSimilarity | Language model based on the Jelinek-Mercer smoothing method. | 
| LMSimilarity | Abstract superclass for language modeling Similarities. | 
| LMSimilarity.DefaultCollectionModel | Models  p(w|C)as the number of occurrences of the term in the
 collection, divided by the total number of tokens+ 1. | 
| LMSimilarity.LMStats | Stores the collection distribution of the current term. | 
| MultiSimilarity | Implements the CombSUM method for combining evidence from multiple
 similarity values described in: Joseph A. | 
| Normalization | This class acts as the base class for the implementations of the term
 frequency normalization methods in the DFR framework. | 
| Normalization.NoNormalization | Implementation used when there is no normalization. | 
| NormalizationH1 | Normalization model that assumes a uniform distribution of the term frequency. | 
| NormalizationH2 | Normalization model in which the term frequency is inversely related to the
 length. | 
| NormalizationH3 | Dirichlet Priors normalization | 
| NormalizationZ | Pareto-Zipf Normalization | 
| PerFieldSimilarityWrapper | Provides the ability to use a different  Similarityfor different fields. | 
| Similarity | Similarity defines the components of Lucene scoring. | 
| Similarity.SimScorer | Stores the weight for a query across the indexed collection. | 
| SimilarityBase | A subclass of  Similaritythat provides a simplified API for its
 descendants. | 
| TFIDFSimilarity | Implementation of  Similaritywith the Vector Space Model. | 
Similarity serves
 as the base for ranking functions. For searching, users can employ the models
 already implemented or create their own by extending one of the classes in this
 package.
 
 BM25Similarity is an optimized
 implementation of the successful Okapi BM25 model.
 
ClassicSimilarity is the original Lucene
 scoring function. It is based on the
 Vector Space Model. For more
 information, see TFIDFSimilarity.
 
 
SimilarityBase provides a basic
 implementation of the Similarity contract and exposes a highly simplified
 interface, which makes it an ideal starting point for new ranking functions.
 Lucene ships the following methods built on
 SimilarityBase:
 
 
 
SimilarityBase is not
 optimized to the same extent as
 ClassicSimilarity and
 BM25Similarity, a difference in
 performance is to be expected when using the methods listed above. However,
 optimizations can always be implemented in subclasses; see
 below.
 
 
 Chances are the available Similarities are sufficient for all
     your searching needs.
     However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
     applications do not need to distinguish between shorter and longer documents
     and could set BM25's b
     parameter to 0.
 
 
To change Similarity, one must do so for both indexing and
     searching, and the changes must happen before
     either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
     just isn't well-defined what is going to happen.
 
 
To make this change, implement your own Similarity (likely
     you'll want to simply subclass SimilarityBase), and
     then register the new class by calling
     IndexWriterConfig.setSimilarity(Similarity)
     before indexing and
     IndexSearcher.setSimilarity(Similarity)
     before searching.
 
 
BM25Similarity has
 two parameters that may be tuned:
 
0 makes term frequency completely
   ignored, making documents scored only based on the value of the IDF
   of the matched terms. Higher values of k1 increase the impact of
   term frequency on the final score. Default value is 1.2.[0, 1]. A value of 0
   disables length normalization completely. Default value is 0.75.
 The easiest way to quickly implement a new ranking method is to extend
 SimilarityBase, which provides
 basic implementations for the low level . Subclasses are only required to
 implement the SimilarityBase.score(BasicStats, double, double)
 and SimilarityBase.toString()
 methods.
 
 
Another option is to extend one of the frameworks
 based on SimilarityBase. These
 Similarities are implemented modularly, e.g.
 DFRSimilarity delegates
 computation of the three parts of its formula to the classes
 BasicModel,
 AfterEffect and
 Normalization. Instead of
 subclassing the Similarity, one can simply introduce a new basic model and tell
 DFRSimilarity to use it.
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.