Class FeatureField
- java.lang.Object
-
- org.apache.lucene.document.Field
-
- org.apache.lucene.document.FeatureField
-
- All Implemented Interfaces:
IndexableField
public final class FeatureField extends Field
Field
that can be used to store static scoring factors into documents. This is mostly inspired from the work from Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting for query independent evidence. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. August 15-19, 2005, Salvador, Brazil.Feature values are internally encoded as term frequencies. Putting feature queries as
BooleanClause.Occur.SHOULD
clauses of aBooleanQuery
allows to combine query-dependent scores (eg. BM25) with query-independent scores using a linear combination. The fact that feature values are stored as frequencies also allows search logic to efficiently skip documents that can't be competitive when total hit counts are not requested. This makes it a compelling option compared to storing such factors eg. in a doc-value field.This field may only store factors that are positively correlated with the final score, like pagerank. In case of factors that are inversely correlated with the score like url length, the inverse of the scoring factor should be stored, ie.
1/urlLength
.This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2-8 = 0.00390625.
Given a scoring factor
S > 0
and its weightw > 0
, there are four ways that S can be turned into a score:w * log(a + S)
, with a ≥ 1. This function usually makes sense because the distribution of scoring factors often follows a power law. This is typically the case for pagerank for instance. However the paper suggested that thesatu
andsigm
functions give even better results.satu(S) = w * S / (S + k)
, with k > 0. This function is similar to the one used byBM25Similarity
in order to incorporate term frequency into the final score and produces values between 0 and 1. A value of 0.5 is obtained when S and k are equal.sigm(S) = w * Sa/ (Sa+ ka)
, with k > 0, a > 0. This function provided even better results than the two above but is also harder to tune due to the fact it has 2 parameters. Like withsatu
, values are in the 0..1 range and 0.5 is obtained when S and k are equal.w * S
. Expert: This function doesn't apply any transformation to an indexed feature value, and the indexed value itself, multiplied by weight, determines the score. Thus, there is an expectation that a feature value is encoded in the index in a way that makes sense for scoring.
The constants in the above formulas typically need training in order to compute optimal values. If you don't know where to start, the
newSaturationQuery(String, String)
method uses1f
as a weight and tries to guess a sensible value for thepivot
parameter of the saturation function based on index statistics, which shouldn't perform too bad. Here is an example, assuming that documents have aFeatureField
called 'features' with values for the 'pagerank' feature.Query query = new BooleanQuery.Builder() .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD) .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD) .build(); Query boost = FeatureField.newSaturationQuery("features", "pagerank"); Query boostedQuery = new BooleanQuery.Builder() .add(query, Occur.MUST) .add(boost, Occur.SHOULD) .build(); TopDocs topDocs = searcher.search(boostedQuery, 10);
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.document.Field
Field.Store
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.document.Field
fieldsData, name, tokenStream, type
-
-
Constructor Summary
Constructors Constructor Description FeatureField(String fieldName, String featureName, float featureValue)
Create a feature.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static DoubleValuesSource
newDoubleValues(String field, String featureName)
Creates aDoubleValuesSource
instance which can be used to read the values of a feature from the aFeatureField
for documents.static SortField
newFeatureSort(String field, String featureName)
Creates a SortField for sorting by the value of a feature.static Query
newLinearQuery(String fieldName, String featureName, float weight)
Return a newQuery
that will score documents asweight * S
where S is the value of the static feature.static Query
newLogQuery(String fieldName, String featureName, float weight, float scalingFactor)
Return a newQuery
that will score documents asweight * Math.log(scalingFactor + S)
where S is the value of the static feature.static Query
newSaturationQuery(String fieldName, String featureName)
Same asnewSaturationQuery(String, String, float, float)
but1f
is used as a weight and a reasonably good default pivot value is computed based on index statistics and is approximately equal to the geometric mean of all values that exist in the index.static Query
newSaturationQuery(String fieldName, String featureName, float weight, float pivot)
Return a newQuery
that will score documents asweight * S / (S + pivot)
where S is the value of the static feature.static Query
newSigmoidQuery(String fieldName, String featureName, float weight, float pivot, float exp)
Return a newQuery
that will score documents asweight * S^a / (S^a + pivot^a)
where S is the value of the static feature.void
setFeatureValue(float featureValue)
Update the feature value of this field.TokenStream
tokenStream(Analyzer analyzer, TokenStream reuse)
Creates the TokenStream used for indexing this field.-
Methods inherited from class org.apache.lucene.document.Field
binaryValue, fieldType, getCharSequenceValue, invertableType, name, numericValue, readerValue, setBytesValue, setBytesValue, setByteValue, setDoubleValue, setFloatValue, setIntValue, setLongValue, setReaderValue, setShortValue, setStringValue, setTokenStream, storedValue, stringValue, tokenStreamValue, toString
-
-
-
-
Constructor Detail
-
FeatureField
public FeatureField(String fieldName, String featureName, float featureValue)
Create a feature.- Parameters:
fieldName
- The name of the field to store the information into. All features may be stored in the same field.featureName
- The name of the feature, eg. 'pagerank`. It will be indexed as a term.featureValue
- The value of the feature, must be a positive, finite, normal float.
-
-
Method Detail
-
setFeatureValue
public void setFeatureValue(float featureValue)
Update the feature value of this field.
-
tokenStream
public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse)
Description copied from interface:IndexableField
Creates the TokenStream used for indexing this field. If appropriate, implementations should use the given Analyzer to create the TokenStreams.- Specified by:
tokenStream
in interfaceIndexableField
- Overrides:
tokenStream
in classField
- Parameters:
analyzer
- Analyzer that should be used to create the TokenStreams fromreuse
- TokenStream for a previous instance of this field name. This allows custom field types (like StringField and NumericField) that do not use the analyzer to still have good performance. Note: the passed-in type may be inappropriate, for example if you mix up different types of Fields for the same field name. So it's the responsibility of the implementation to check.- Returns:
- TokenStream value for indexing the document. Should always return a non-null value if the field is to be indexed
-
newLinearQuery
public static Query newLinearQuery(String fieldName, String featureName, float weight)
Return a newQuery
that will score documents asweight * S
where S is the value of the static feature.- Parameters:
fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]- Throws:
IllegalArgumentException
- if weight is not in (0,64]
-
newLogQuery
public static Query newLogQuery(String fieldName, String featureName, float weight, float scalingFactor)
Return a newQuery
that will score documents asweight * Math.log(scalingFactor + S)
where S is the value of the static feature.- Parameters:
fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]scalingFactor
- scaling factor applied before taking the logarithm, must be in [1, +Infinity)- Throws:
IllegalArgumentException
- if weight is not in (0,64] or scalingFactor is not in [1, +Infinity)
-
newSaturationQuery
public static Query newSaturationQuery(String fieldName, String featureName, float weight, float pivot)
Return a newQuery
that will score documents asweight * S / (S + pivot)
where S is the value of the static feature.- Parameters:
fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]pivot
- feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)- Throws:
IllegalArgumentException
- if weight is not in (0,64] or pivot is not in (0, +Infinity)
-
newSaturationQuery
public static Query newSaturationQuery(String fieldName, String featureName)
Same asnewSaturationQuery(String, String, float, float)
but1f
is used as a weight and a reasonably good default pivot value is computed based on index statistics and is approximately equal to the geometric mean of all values that exist in the index.- Parameters:
fieldName
- field that stores featuresfeatureName
- name of the feature- Throws:
IllegalArgumentException
- if weight is not in (0,64] or pivot is not in (0, +Infinity)
-
newSigmoidQuery
public static Query newSigmoidQuery(String fieldName, String featureName, float weight, float pivot, float exp)
Return a newQuery
that will score documents asweight * S^a / (S^a + pivot^a)
where S is the value of the static feature.- Parameters:
fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]pivot
- feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)exp
- exponent, higher values make the function grow slower before 'pivot' and faster after 'pivot', must be in (0, +Infinity)- Throws:
IllegalArgumentException
- if w is not in (0,64] or either k or a are not in (0, +Infinity)
-
newFeatureSort
public static SortField newFeatureSort(String field, String featureName)
Creates a SortField for sorting by the value of a feature.This sort orders documents by descending value of a feature. The value returned in
FieldDoc
for the hits contains a Float instance with the feature value.If a document is missing the field, then it is treated as having a value of
0.0f
.- Parameters:
field
- field name. Must not be null.featureName
- feature name. Must not be null.- Returns:
- SortField ordering documents by the value of the feature
- Throws:
NullPointerException
- iffield
orfeatureName
is null.
-
newDoubleValues
public static DoubleValuesSource newDoubleValues(String field, String featureName)
Creates aDoubleValuesSource
instance which can be used to read the values of a feature from the aFeatureField
for documents.- Parameters:
field
- field name. Must not be null.featureName
- feature name. Must not be null.- Returns:
- a
DoubleValuesSource
which can be used to access the values of the feature for documents - Throws:
NullPointerException
- iffield
orfeatureName
is null.
-
-