org.apache.lucene.queries
Class CommonTermsQuery

java.lang.Object
  extended by org.apache.lucene.search.Query
      extended by org.apache.lucene.queries.CommonTermsQuery
All Implemented Interfaces:
Cloneable

public class CommonTermsQuery
extends Query

A query that executes high-frequency terms in a optional sub-query to prevent slow queries due to "common" terms like stopwords. This query basically builds 2 queries off the added terms where low-frequency terms are added to a required boolean clause and high-frequency terms are added to an optional boolean clause. The optional clause is only executed if the required "low-frequency' clause matches. Scores produced by this query will be slightly different to plain BooleanQuery scorer mainly due to differences in the number of leave queries in the required boolean clause. In the most cases high-frequency terms are unlikely to significantly contribute to the document score unless at least one of the low-frequency terms are matched such that this query can improve query execution times significantly if applicable.

CommonTermsQuery has several advantages over stopword filtering at index or query time since a term can be "classified" based on the actual document frequency in the index and can prevent slow queries even across domains without specialized stopword files.

Note: if the query only contains high-frequency terms the query is rewritten into a plain conjunction query ie. all high-frequency terms need to match in order to match a document.


Field Summary
protected  boolean disableCoord
           
protected  float highFreqBoost
           
protected  BooleanClause.Occur highFreqOccur
           
protected  float lowFreqBoost
           
protected  BooleanClause.Occur lowFreqOccur
           
protected  float maxTermFrequency
           
protected  float minNrShouldMatch
           
protected  List<Term> terms
           
 
Constructor Summary
CommonTermsQuery(BooleanClause.Occur highFreqOccur, BooleanClause.Occur lowFreqOccur, float maxTermFrequency)
          Creates a new CommonTermsQuery
CommonTermsQuery(BooleanClause.Occur highFreqOccur, BooleanClause.Occur lowFreqOccur, float maxTermFrequency, boolean disableCoord)
          Creates a new CommonTermsQuery
 
Method Summary
 void add(Term term)
          Adds a term to the CommonTermsQuery
protected  Query buildQuery(int maxDoc, TermContext[] contextArray, Term[] queryTerms)
           
protected  int calcLowFreqMinimumNumberShouldMatch(int numOptional)
           
 void collectTermContext(IndexReader reader, List<AtomicReaderContext> leaves, TermContext[] contextArray, Term[] queryTerms)
           
 boolean equals(Object obj)
           
 void extractTerms(Set<Term> terms)
           
 float getMinimumNumberShouldMatch()
          Gets the minimum number of the optional BooleanClauses which must be satisfied.
 int hashCode()
           
 boolean isCoordDisabled()
          Returns true iff Similarity.coord(int,int) is disabled in scoring for the high and low frequency query instance.
 Query rewrite(IndexReader reader)
           
 void setMinimumNumberShouldMatch(float min)
          Specifies a minimum number of the optional BooleanClauses which must be satisfied in order to produce a match on the low frequency terms query part.
 String toString(String field)
           
 
Methods inherited from class org.apache.lucene.search.Query
clone, createWeight, getBoost, setBoost, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

terms

protected final List<Term> terms

disableCoord

protected final boolean disableCoord

maxTermFrequency

protected final float maxTermFrequency

lowFreqOccur

protected final BooleanClause.Occur lowFreqOccur

highFreqOccur

protected final BooleanClause.Occur highFreqOccur

lowFreqBoost

protected float lowFreqBoost

highFreqBoost

protected float highFreqBoost

minNrShouldMatch

protected float minNrShouldMatch
Constructor Detail

CommonTermsQuery

public CommonTermsQuery(BooleanClause.Occur highFreqOccur,
                        BooleanClause.Occur lowFreqOccur,
                        float maxTermFrequency)
Creates a new CommonTermsQuery

Parameters:
highFreqOccur - BooleanClause.Occur used for high frequency terms
lowFreqOccur - BooleanClause.Occur used for low frequency terms
maxTermFrequency - a value in [0..1) (or absolute number >=1) representing the maximum threshold of a terms document frequency to be considered a low frequency term.
Throws:
IllegalArgumentException - if BooleanClause.Occur.MUST_NOT is pass as lowFreqOccur or highFreqOccur

CommonTermsQuery

public CommonTermsQuery(BooleanClause.Occur highFreqOccur,
                        BooleanClause.Occur lowFreqOccur,
                        float maxTermFrequency,
                        boolean disableCoord)
Creates a new CommonTermsQuery

Parameters:
highFreqOccur - BooleanClause.Occur used for high frequency terms
lowFreqOccur - BooleanClause.Occur used for low frequency terms
maxTermFrequency - a value in [0..1) (or absolute number >=1) representing the maximum threshold of a terms document frequency to be considered a low frequency term.
disableCoord - disables Similarity.coord(int,int) in scoring for the low / high frequency sub-queries
Throws:
IllegalArgumentException - if BooleanClause.Occur.MUST_NOT is pass as lowFreqOccur or highFreqOccur
Method Detail

add

public void add(Term term)
Adds a term to the CommonTermsQuery

Parameters:
term - the term to add

rewrite

public Query rewrite(IndexReader reader)
              throws IOException
Overrides:
rewrite in class Query
Throws:
IOException

calcLowFreqMinimumNumberShouldMatch

protected int calcLowFreqMinimumNumberShouldMatch(int numOptional)

buildQuery

protected Query buildQuery(int maxDoc,
                           TermContext[] contextArray,
                           Term[] queryTerms)

collectTermContext

public void collectTermContext(IndexReader reader,
                               List<AtomicReaderContext> leaves,
                               TermContext[] contextArray,
                               Term[] queryTerms)
                        throws IOException
Throws:
IOException

isCoordDisabled

public boolean isCoordDisabled()
Returns true iff Similarity.coord(int,int) is disabled in scoring for the high and low frequency query instance. The top level query will always disable coords.


setMinimumNumberShouldMatch

public void setMinimumNumberShouldMatch(float min)
Specifies a minimum number of the optional BooleanClauses which must be satisfied in order to produce a match on the low frequency terms query part. This method accepts a float value in the range [0..1) as a fraction of the actual query terms in the low frequent clause or a number >=1 as an absolut number of clauses that need to match.

By default no optional clauses are necessary for a match (unless there are no required clauses). If this method is used, then the specified number of clauses is required.

Parameters:
min - the number of optional clauses that must match

getMinimumNumberShouldMatch

public float getMinimumNumberShouldMatch()
Gets the minimum number of the optional BooleanClauses which must be satisfied.


extractTerms

public void extractTerms(Set<Term> terms)
Overrides:
extractTerms in class Query

toString

public String toString(String field)
Specified by:
toString in class Query

hashCode

public int hashCode()
Overrides:
hashCode in class Query

equals

public boolean equals(Object obj)
Overrides:
equals in class Query


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.