Class FeatureField

  • All Implemented Interfaces:
    IndexableField

    public final class FeatureField
    extends Field
    Field that can be used to store static scoring factors into documents. This is mostly inspired from the work from Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting for query independent evidence. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. August 15-19, 2005, Salvador, Brazil.

    Feature values are internally encoded as term frequencies. Putting feature queries as BooleanClause.Occur.SHOULD clauses of a BooleanQuery allows to combine query-dependent scores (eg. BM25) with query-independent scores using a linear combination. The fact that feature values are stored as frequencies also allows search logic to efficiently skip documents that can't be competitive when total hit counts are not requested. This makes it a compelling option compared to storing such factors eg. in a doc-value field.

    This field may only store factors that are positively correlated with the final score, like pagerank. In case of factors that are inversely correlated with the score like url length, the inverse of the scoring factor should be stored, ie. 1/urlLength.

    This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2-8 = 0.00390625.

    Given a scoring factor S > 0 and its weight w > 0, there are four ways that S can be turned into a score:

    • w * log(a + S), with a ≥ 1. This function usually makes sense because the distribution of scoring factors often follows a power law. This is typically the case for pagerank for instance. However the paper suggested that the satu and sigm functions give even better results.
    • satu(S) = w * S / (S + k), with k > 0. This function is similar to the one used by BM25Similarity in order to incorporate term frequency into the final score and produces values between 0 and 1. A value of 0.5 is obtained when S and k are equal.
    • sigm(S) = w * Sa/ (Sa+ ka), with k > 0, a > 0. This function provided even better results than the two above but is also harder to tune due to the fact it has 2 parameters. Like with satu, values are in the 0..1 range and 0.5 is obtained when S and k are equal.
    • w * S. Expert: This function doesn't apply any transformation to an indexed feature value, and the indexed value itself, multiplied by weight, determines the score. Thus, there is an expectation that a feature value is encoded in the index in a way that makes sense for scoring.

    The constants in the above formulas typically need training in order to compute optimal values. If you don't know where to start, the newSaturationQuery(String, String) method uses 1f as a weight and tries to guess a sensible value for the pivot parameter of the saturation function based on index statistics, which shouldn't perform too bad. Here is an example, assuming that documents have a FeatureField called 'features' with values for the 'pagerank' feature.

     Query query = new BooleanQuery.Builder()
         .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
         .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
         .build();
     Query boost = FeatureField.newSaturationQuery("features", "pagerank");
     Query boostedQuery = new BooleanQuery.Builder()
         .add(query, Occur.MUST)
         .add(boost, Occur.SHOULD)
         .build();
     TopDocs topDocs = searcher.search(boostedQuery, 10);
     
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Constructor Detail

      • FeatureField

        public FeatureField​(String fieldName,
                            String featureName,
                            float featureValue)
        Create a feature.
        Parameters:
        fieldName - The name of the field to store the information into. All features may be stored in the same field.
        featureName - The name of the feature, eg. 'pagerank`. It will be indexed as a term.
        featureValue - The value of the feature, must be a positive, finite, normal float.
    • Method Detail

      • setFeatureValue

        public void setFeatureValue​(float featureValue)
        Update the feature value of this field.
      • tokenStream

        public TokenStream tokenStream​(Analyzer analyzer,
                                       TokenStream reuse)
        Description copied from interface: IndexableField
        Creates the TokenStream used for indexing this field. If appropriate, implementations should use the given Analyzer to create the TokenStreams.
        Specified by:
        tokenStream in interface IndexableField
        Overrides:
        tokenStream in class Field
        Parameters:
        analyzer - Analyzer that should be used to create the TokenStreams from
        reuse - TokenStream for a previous instance of this field name. This allows custom field types (like StringField and NumericField) that do not use the analyzer to still have good performance. Note: the passed-in type may be inappropriate, for example if you mix up different types of Fields for the same field name. So it's the responsibility of the implementation to check.
        Returns:
        TokenStream value for indexing the document. Should always return a non-null value if the field is to be indexed
      • newLinearQuery

        public static Query newLinearQuery​(String fieldName,
                                           String featureName,
                                           float weight)
        Return a new Query that will score documents as weight * S where S is the value of the static feature.
        Parameters:
        fieldName - field that stores features
        featureName - name of the feature
        weight - weight to give to this feature, must be in (0,64]
        Throws:
        IllegalArgumentException - if weight is not in (0,64]
      • newLogQuery

        public static Query newLogQuery​(String fieldName,
                                        String featureName,
                                        float weight,
                                        float scalingFactor)
        Return a new Query that will score documents as weight * Math.log(scalingFactor + S) where S is the value of the static feature.
        Parameters:
        fieldName - field that stores features
        featureName - name of the feature
        weight - weight to give to this feature, must be in (0,64]
        scalingFactor - scaling factor applied before taking the logarithm, must be in [1, +Infinity)
        Throws:
        IllegalArgumentException - if weight is not in (0,64] or scalingFactor is not in [1, +Infinity)
      • newSaturationQuery

        public static Query newSaturationQuery​(String fieldName,
                                               String featureName,
                                               float weight,
                                               float pivot)
        Return a new Query that will score documents as weight * S / (S + pivot) where S is the value of the static feature.
        Parameters:
        fieldName - field that stores features
        featureName - name of the feature
        weight - weight to give to this feature, must be in (0,64]
        pivot - feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)
        Throws:
        IllegalArgumentException - if weight is not in (0,64] or pivot is not in (0, +Infinity)
      • newSaturationQuery

        public static Query newSaturationQuery​(String fieldName,
                                               String featureName)
        Same as newSaturationQuery(String, String, float, float) but 1f is used as a weight and a reasonably good default pivot value is computed based on index statistics and is approximately equal to the geometric mean of all values that exist in the index.
        Parameters:
        fieldName - field that stores features
        featureName - name of the feature
        Throws:
        IllegalArgumentException - if weight is not in (0,64] or pivot is not in (0, +Infinity)
      • newSigmoidQuery

        public static Query newSigmoidQuery​(String fieldName,
                                            String featureName,
                                            float weight,
                                            float pivot,
                                            float exp)
        Return a new Query that will score documents as weight * S^a / (S^a + pivot^a) where S is the value of the static feature.
        Parameters:
        fieldName - field that stores features
        featureName - name of the feature
        weight - weight to give to this feature, must be in (0,64]
        pivot - feature value that would give a score contribution equal to weight/2, must be in (0, +Infinity)
        exp - exponent, higher values make the function grow slower before 'pivot' and faster after 'pivot', must be in (0, +Infinity)
        Throws:
        IllegalArgumentException - if w is not in (0,64] or either k or a are not in (0, +Infinity)
      • newFeatureSort

        public static SortField newFeatureSort​(String field,
                                               String featureName)
        Creates a SortField for sorting by the value of a feature.

        This sort orders documents by descending value of a feature. The value returned in FieldDoc for the hits contains a Float instance with the feature value.

        If a document is missing the field, then it is treated as having a value of 0.0f .

        Parameters:
        field - field name. Must not be null.
        featureName - feature name. Must not be null.
        Returns:
        SortField ordering documents by the value of the feature
        Throws:
        NullPointerException - if field or featureName is null.
      • newDoubleValues

        public static DoubleValuesSource newDoubleValues​(String field,
                                                         String featureName)
        Creates a DoubleValuesSource instance which can be used to read the values of a feature from the a FeatureField for documents.
        Parameters:
        field - field name. Must not be null.
        featureName - feature name. Must not be null.
        Returns:
        a DoubleValuesSource which can be used to access the values of the feature for documents
        Throws:
        NullPointerException - if field or featureName is null.