public final class FeatureField extends Field
Field
that can be used to store static scoring factors into
documents. This is mostly inspired from the work from Nick Craswell,
Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting
for query independent evidence. Proceedings of the 28th annual international
ACM SIGIR conference on Research and development in information retrieval.
August 1519, 2005, Salvador, Brazil.
Feature values are internally encoded as term frequencies. Putting
feature queries as
BooleanClause.Occur.SHOULD
clauses of a
BooleanQuery
allows to combine querydependent scores (eg. BM25)
with queryindependent scores using a linear combination. The fact that
feature values are stored as frequencies will allow search logic to
efficiently skip documents that can't be competitive when total hit counts
are not requested in the future. This makes it a compelling option compared
to storing such factors eg. in a docvalue field.
This field may only store factors that are positively correlated with the
final score, like pagerank. In case of factors that are inversely correlated
with the score like url length, the inverse of the scoring factor should be
stored, ie. 1/urlLength
.
This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2^{8} = 0.00390625.
Given a scoring factor S > 0
and its weight w > 0
, there
are three ways that S can be turned into a score:
w * log(a + S)
, with a ≥ 1. This function
usually makes sense because the distribution of scoring factors
often follows a power law. This is typically the case for pagerank for
instance. However the paper suggested that the satu
and
sigm
functions give even better results.
satu(S) = w * S / (S + k)
, with k > 0. This
function is similar to the one used by BM25Similarity
in order
to incorporate term frequency into the final score and produces values
between 0 and 1. A value of 0.5 is obtained when S and k are equal.
sigm(S) = w * S^{a} / (S^{a} + k^{a})
,
with k > 0, a > 0. This function provided even better results
than the two above but is also harder to tune due to the fact it has
2 parameters. Like with satu
, values are in the 0..1 range and
0.5 is obtained when S and k are equal.
The constants in the above formulas typically need training in order to
compute optimal values. If you don't know where to start, the
newSaturationQuery(String, String)
method uses
1f
as a weight and tries to guess a sensible value for the
pivot
parameter of the saturation function based on index
statistics, which shouldn't perform too bad. Here is an example, assuming
that documents have a FeatureField
called 'features' with values for
the 'pagerank' feature.
Query query = new BooleanQuery.Builder() .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD) .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD) .build(); Query boost = FeatureField.newSaturationQuery("features", "pagerank"); Query boostedQuery = new BooleanQuery.Builder() .add(query, Occur.MUST) .add(boost, Occur.SHOULD) .build(); TopDocs topDocs = searcher.search(boostedQuery, 10);
Field.Store
fieldsData, name, tokenStream, type
Constructor and Description 

FeatureField(String fieldName,
String featureName,
float featureValue)
Create a feature.

Modifier and Type  Method and Description 

static Query 
newLogQuery(String fieldName,
String featureName,
float weight,
float scalingFactor)
Return a new
Query that will score documents as
weight * Math.log(scalingFactor + S) where S is the value of the static feature. 
static Query 
newSaturationQuery(String fieldName,
String featureName)
Same as
newSaturationQuery(String, String, float, float) but
1f is used as a weight and a reasonably good default pivot value
is computed based on index statistics and is approximately equal to the
geometric mean of all values that exist in the index. 
static Query 
newSaturationQuery(String fieldName,
String featureName,
float weight,
float pivot)
Return a new
Query that will score documents as
weight * S / (S + pivot) where S is the value of the static feature. 
static Query 
newSigmoidQuery(String fieldName,
String featureName,
float weight,
float pivot,
float exp)
Return a new
Query that will score documents as
weight * S^a / (S^a + pivot^a) where S is the value of the static feature. 
void 
setFeatureValue(float featureValue)
Update the feature value of this field.

TokenStream 
tokenStream(Analyzer analyzer,
TokenStream reuse)
Creates the TokenStream used for indexing this field.

binaryValue, fieldType, name, numericValue, readerValue, setBytesValue, setBytesValue, setByteValue, setDoubleValue, setFloatValue, setIntValue, setLongValue, setReaderValue, setShortValue, setStringValue, setTokenStream, stringValue, tokenStreamValue, toString
public FeatureField(String fieldName, String featureName, float featureValue)
fieldName
 The name of the field to store the information into. All features may be stored in the same field.featureName
 The name of the feature, eg. 'pagerank`. It will be indexed as a term.featureValue
 The value of the feature, must be a positive, finite, normal float.public void setFeatureValue(float featureValue)
public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse)
IndexableField
tokenStream
in interface IndexableField
tokenStream
in class Field
analyzer
 Analyzer that should be used to create the TokenStreams fromreuse
 TokenStream for a previous instance of this field name. This allows
custom field types (like StringField and NumericField) that do not use
the analyzer to still have good performance. Note: the passedin type
may be inappropriate, for example if you mix up different types of Fields
for the same field name. So it's the responsibility of the implementation to
check.public static Query newLogQuery(String fieldName, String featureName, float weight, float scalingFactor)
Query
that will score documents as
weight * Math.log(scalingFactor + S)
where S is the value of the static feature.fieldName
 field that stores featuresfeatureName
 name of the featureweight
 weight to give to this feature, must be in (0,64]scalingFactor
 scaling factor applied before taking the logarithm, must be in [1, +Infinity)IllegalArgumentException
 if weight is not in (0,64] or scalingFactor is not in [1, +Infinity)public static Query newSaturationQuery(String fieldName, String featureName, float weight, float pivot)
Query
that will score documents as
weight * S / (S + pivot)
where S is the value of the static feature.fieldName
 field that stores featuresfeatureName
 name of the featureweight
 weight to give to this feature, must be in (0,64]pivot
 feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)IllegalArgumentException
 if weight is not in (0,64] or pivot is not in (0, +Infinity)public static Query newSaturationQuery(String fieldName, String featureName)
newSaturationQuery(String, String, float, float)
but
1f
is used as a weight and a reasonably good default pivot value
is computed based on index statistics and is approximately equal to the
geometric mean of all values that exist in the index.fieldName
 field that stores featuresfeatureName
 name of the featureIllegalArgumentException
 if weight is not in (0,64] or pivot is not in (0, +Infinity)public static Query newSigmoidQuery(String fieldName, String featureName, float weight, float pivot, float exp)
Query
that will score documents as
weight * S^a / (S^a + pivot^a)
where S is the value of the static feature.fieldName
 field that stores featuresfeatureName
 name of the featureweight
 weight to give to this feature, must be in (0,64]pivot
 feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)exp
 exponent, higher values make the function grow slower before 'pivot' and faster after 'pivot', must be in (0, +Infinity)IllegalArgumentException
 if w is not in (0,64] or either k or a are not in (0, +Infinity)Copyright © 20002018 Apache Software Foundation. All Rights Reserved.