public final class FeatureField extends Field
Field
that can be used to store static scoring factors into
documents. This is mostly inspired from the work from Nick Craswell,
Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting
for query independent evidence. Proceedings of the 28th annual international
ACM SIGIR conference on Research and development in information retrieval.
August 15-19, 2005, Salvador, Brazil.
Feature values are internally encoded as term frequencies. Putting
feature queries as
BooleanClause.Occur.SHOULD
clauses of a
BooleanQuery
allows to combine query-dependent scores (eg. BM25)
with query-independent scores using a linear combination. The fact that
feature values are stored as frequencies also allows search logic to
efficiently skip documents that can't be competitive when total hit counts
are not requested. This makes it a compelling option compared to storing
such factors eg. in a doc-value field.
This field may only store factors that are positively correlated with the
final score, like pagerank. In case of factors that are inversely correlated
with the score like url length, the inverse of the scoring factor should be
stored, ie. 1/urlLength
.
This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2-8 = 0.00390625.
Given a scoring factor S > 0
and its weight w > 0
, there
are three ways that S can be turned into a score:
w * log(a + S)
, with a ≥ 1. This function
usually makes sense because the distribution of scoring factors
often follows a power law. This is typically the case for pagerank for
instance. However the paper suggested that the satu
and
sigm
functions give even better results.
satu(S) = w * S / (S + k)
, with k > 0. This
function is similar to the one used by BM25Similarity
in order
to incorporate term frequency into the final score and produces values
between 0 and 1. A value of 0.5 is obtained when S and k are equal.
sigm(S) = w * Sa / (Sa + ka)
,
with k > 0, a > 0. This function provided even better results
than the two above but is also harder to tune due to the fact it has
2 parameters. Like with satu
, values are in the 0..1 range and
0.5 is obtained when S and k are equal.
The constants in the above formulas typically need training in order to
compute optimal values. If you don't know where to start, the
newSaturationQuery(String, String)
method uses
1f
as a weight and tries to guess a sensible value for the
pivot
parameter of the saturation function based on index
statistics, which shouldn't perform too bad. Here is an example, assuming
that documents have a FeatureField
called 'features' with values for
the 'pagerank' feature.
Query query = new BooleanQuery.Builder() .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD) .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD) .build(); Query boost = FeatureField.newSaturationQuery("features", "pagerank"); Query boostedQuery = new BooleanQuery.Builder() .add(query, Occur.MUST) .add(boost, Occur.SHOULD) .build(); TopDocs topDocs = searcher.search(boostedQuery, 10);
Field.Store
fieldsData, name, tokenStream, type
Constructor and Description |
---|
FeatureField(String fieldName,
String featureName,
float featureValue)
Create a feature.
|
Modifier and Type | Method and Description |
---|---|
static Query |
newLogQuery(String fieldName,
String featureName,
float weight,
float scalingFactor)
Return a new
Query that will score documents as
weight * Math.log(scalingFactor + S) where S is the value of the static feature. |
static Query |
newSaturationQuery(String fieldName,
String featureName)
Same as
newSaturationQuery(String, String, float, float) but
1f is used as a weight and a reasonably good default pivot value
is computed based on index statistics and is approximately equal to the
geometric mean of all values that exist in the index. |
static Query |
newSaturationQuery(String fieldName,
String featureName,
float weight,
float pivot)
Return a new
Query that will score documents as
weight * S / (S + pivot) where S is the value of the static feature. |
static Query |
newSigmoidQuery(String fieldName,
String featureName,
float weight,
float pivot,
float exp)
Return a new
Query that will score documents as
weight * S^a / (S^a + pivot^a) where S is the value of the static feature. |
void |
setFeatureValue(float featureValue)
Update the feature value of this field.
|
TokenStream |
tokenStream(Analyzer analyzer,
TokenStream reuse)
Creates the TokenStream used for indexing this field.
|
binaryValue, fieldType, getCharSequenceValue, name, numericValue, readerValue, setBytesValue, setBytesValue, setByteValue, setDoubleValue, setFloatValue, setIntValue, setLongValue, setReaderValue, setShortValue, setStringValue, setTokenStream, stringValue, tokenStreamValue, toString
public FeatureField(String fieldName, String featureName, float featureValue)
fieldName
- The name of the field to store the information into. All features may be stored in the same field.featureName
- The name of the feature, eg. 'pagerank`. It will be indexed as a term.featureValue
- The value of the feature, must be a positive, finite, normal float.public void setFeatureValue(float featureValue)
public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse)
IndexableField
tokenStream
in interface IndexableField
tokenStream
in class Field
analyzer
- Analyzer that should be used to create the TokenStreams fromreuse
- TokenStream for a previous instance of this field name. This allows
custom field types (like StringField and NumericField) that do not use
the analyzer to still have good performance. Note: the passed-in type
may be inappropriate, for example if you mix up different types of Fields
for the same field name. So it's the responsibility of the implementation to
check.public static Query newLogQuery(String fieldName, String featureName, float weight, float scalingFactor)
Query
that will score documents as
weight * Math.log(scalingFactor + S)
where S is the value of the static feature.fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]scalingFactor
- scaling factor applied before taking the logarithm, must be in [1, +Infinity)IllegalArgumentException
- if weight is not in (0,64] or scalingFactor is not in [1, +Infinity)public static Query newSaturationQuery(String fieldName, String featureName, float weight, float pivot)
Query
that will score documents as
weight * S / (S + pivot)
where S is the value of the static feature.fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]pivot
- feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)IllegalArgumentException
- if weight is not in (0,64] or pivot is not in (0, +Infinity)public static Query newSaturationQuery(String fieldName, String featureName)
newSaturationQuery(String, String, float, float)
but
1f
is used as a weight and a reasonably good default pivot value
is computed based on index statistics and is approximately equal to the
geometric mean of all values that exist in the index.fieldName
- field that stores featuresfeatureName
- name of the featureIllegalArgumentException
- if weight is not in (0,64] or pivot is not in (0, +Infinity)public static Query newSigmoidQuery(String fieldName, String featureName, float weight, float pivot, float exp)
Query
that will score documents as
weight * S^a / (S^a + pivot^a)
where S is the value of the static feature.fieldName
- field that stores featuresfeatureName
- name of the featureweight
- weight to give to this feature, must be in (0,64]pivot
- feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)exp
- exponent, higher values make the function grow slower before 'pivot' and faster after 'pivot', must be in (0, +Infinity)IllegalArgumentException
- if w is not in (0,64] or either k or a are not in (0, +Infinity)Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.