See: Description
| Interface | Description | 
|---|---|
| LMSimilarity.CollectionModel | A strategy for computing the collection language model. | 
| Class | Description | 
|---|---|
| AfterEffect | This class acts as the base class for the implementations of the first
 normalization of the informative content in the DFR framework. | 
| AfterEffect.NoAfterEffect | Implementation used when there is no aftereffect. | 
| AfterEffectB | Model of the information gain based on the ratio of two Bernoulli processes. | 
| AfterEffectL | Model of the information gain based on Laplace's law of succession. | 
| BasicModel | This class acts as the base class for the specific basic model
 implementations in the DFR framework. | 
| BasicModelBE | Limiting form of the Bose-Einstein model. | 
| BasicModelD | Implements the approximation of the binomial model with the divergence
 for DFR. | 
| BasicModelG | Geometric as limiting form of the Bose-Einstein model. | 
| BasicModelIF | An approximation of the I(ne) model. | 
| BasicModelIn | The basic tf-idf model of randomness. | 
| BasicModelIne | Tf-idf model of randomness, based on a mixture of Poisson and inverse
 document frequency. | 
| BasicModelP | Implements the Poisson approximation for the binomial model for DFR. | 
| BasicStats | Stores all statistics commonly used ranking methods. | 
| BM25Similarity | BM25 Similarity. | 
| DefaultSimilarity | Expert: Default scoring implementation which  encodesnorm values as a single byte before being stored. | 
| DFRSimilarity | Implements the divergence from randomness (DFR) framework
 introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. | 
| Distribution | The probabilistic distribution used to model term occurrence
 in information-based models. | 
| DistributionLL | Log-logistic distribution. | 
| DistributionSPL | The smoothed power-law (SPL) distribution for the information-based framework
 that is described in the original paper. | 
| IBSimilarity | Provides a framework for the family of information-based models, as described
 in Stéphane Clinchant and Eric Gaussier. | 
| Lambda | The lambda (λw) parameter in information-based
 models. | 
| LambdaDF | Computes lambda as  docFreq+1 / numberOfDocuments+1. | 
| LambdaTTF | Computes lambda as  totalTermFreq+1 / numberOfDocuments+1. | 
| LMDirichletSimilarity | Bayesian smoothing using Dirichlet priors. | 
| LMJelinekMercerSimilarity | Language model based on the Jelinek-Mercer smoothing method. | 
| LMSimilarity | Abstract superclass for language modeling Similarities. | 
| LMSimilarity.DefaultCollectionModel | Models  p(w|C)as the number of occurrences of the term in the
 collection, divided by the total number of tokens+ 1. | 
| LMSimilarity.LMStats | Stores the collection distribution of the current term. | 
| MultiSimilarity | Implements the CombSUM method for combining evidence from multiple
 similarity values described in: Joseph A. | 
| Normalization | This class acts as the base class for the implementations of the term
 frequency normalization methods in the DFR framework. | 
| Normalization.NoNormalization | Implementation used when there is no normalization. | 
| NormalizationH1 | Normalization model that assumes a uniform distribution of the term frequency. | 
| NormalizationH2 | Normalization model in which the term frequency is inversely related to the
 length. | 
| NormalizationH3 | Dirichlet Priors normalization | 
| NormalizationZ | Pareto-Zipf Normalization | 
| PerFieldSimilarityWrapper | Provides the ability to use a different  Similarityfor different fields. | 
| Similarity | Similarity defines the components of Lucene scoring. | 
| Similarity.SimScorer | |
| Similarity.SimWeight | Stores the weight for a query across the indexed collection. | 
| SimilarityBase | A subclass of  Similaritythat provides a simplified API for its
 descendants. | 
| TFIDFSimilarity | Implementation of  Similaritywith the Vector Space Model. | 
Similarity serves
as the base for ranking functions. For searching, users can employ the models
already implemented or create their own by extending one of the classes in this
package.
DefaultSimilarity is the original Lucene
scoring function. It is based on a highly optimized 
Vector Space Model. For more
information, see TFIDFSimilarity.
BM25Similarity is an optimized
implementation of the successful Okapi BM25 model.
SimilarityBase provides a basic
implementation of the Similarity contract and exposes a highly simplified
interface, which makes it an ideal starting point for new ranking functions.
Lucene ships the following methods built on
SimilarityBase:
SimilarityBase is not
optimized to the same extent as
DefaultSimilarity and
BM25Similarity, a difference in
performance is to be expected when using the methods listed above. However,
optimizations can always be implemented in subclasses; see
below.
Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).
To change Similarity, one must do so for both indexing and
    searching, and the changes must happen before
    either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
    just isn't well-defined what is going to happen.
To make this change, implement your own Similarity (likely
    you'll want to simply subclass an existing method, be it
    DefaultSimilarity or a descendant of
    SimilarityBase), and
    then register the new class by calling
    IndexWriterConfig.setSimilarity(Similarity)
    before indexing and
    IndexSearcher.setSimilarity(Similarity)
    before searching.
The easiest way to quickly implement a new ranking method is to extend
SimilarityBase, which provides
basic implementations for the low level . Subclasses are only required to
implement the SimilarityBase.score(BasicStats, float, float)
and SimilarityBase.toString()
methods.
Another option is to extend one of the frameworks
based on SimilarityBase. These
Similarities are implemented modularly, e.g.
DFRSimilarity delegates
computation of the three parts of its formula to the classes
BasicModel,
AfterEffect and
Normalization. Instead of
subclassing the Similarity, one can simply introduce a new basic model and tell
DFRSimilarity to use it.
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:
The SweetSpotSimilarity in
            org.apache.lucene.misc gives small
            increases as the frequency increases a small amount
            and then greater increases when you hit the "sweet spot", i.e. where
            you think the frequency of terms is more significant.
Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization — By overriding
            Similarity.computeNorm(FieldInvertState state),
            it is possible to discount how the length of a field contributes
            to a score. In DefaultSimilarity,
            lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
            1 / (numTerms in field), all fields will be treated
            "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.
Copyright © 2000-2015 Apache Software Foundation. All Rights Reserved.