org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
      extended by org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer

public class QueryAutoStopWordAnalyzer
extends org.apache.lucene.analysis.Analyzer

An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.


Field Summary
static float defaultMaxDocFreqPercent
           
 
Fields inherited from class org.apache.lucene.analysis.Analyzer
overridesTokenStreamMethod
 
Constructor Summary
QueryAutoStopWordAnalyzer(org.apache.lucene.analysis.Analyzer delegate)
          Deprecated. Use QueryAutoStopWordAnalyzer(Version, Analyzer) instead
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate)
          Initializes this analyzer with the Analyzer object that actually produces the tokens
 
Method Summary
 int addStopWords(org.apache.lucene.index.IndexReader reader)
          Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent
 int addStopWords(org.apache.lucene.index.IndexReader reader, float maxPercentDocs)
          Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
 int addStopWords(org.apache.lucene.index.IndexReader reader, int maxDocFreq)
          Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
 int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, float maxPercentDocs)
          Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
 int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, int maxDocFreq)
          Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
 org.apache.lucene.index.Term[] getStopWords()
          Provides information on which stop words have been identified for all fields
 String[] getStopWords(String fieldName)
          Provides information on which stop words have been identified for a field
 org.apache.lucene.analysis.TokenStream reusableTokenStream(String fieldName, Reader reader)
           
 org.apache.lucene.analysis.TokenStream tokenStream(String fieldName, Reader reader)
           
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, setOverridesTokenStreamMethod, setPreviousTokenStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

defaultMaxDocFreqPercent

public static final float defaultMaxDocFreqPercent
See Also:
Constant Field Values
Constructor Detail

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(org.apache.lucene.analysis.Analyzer delegate)
Deprecated. Use QueryAutoStopWordAnalyzer(Version, Analyzer) instead

Initializes this analyzer with the Analyzer object that actually produces the tokens

Parameters:
delegate - The choice of Analyzer that is used to produce the token stream which needs filtering

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
                                 org.apache.lucene.analysis.Analyzer delegate)
Initializes this analyzer with the Analyzer object that actually produces the tokens

Parameters:
delegate - The choice of Analyzer that is used to produce the token stream which needs filtering
Method Detail

addStopWords

public int addStopWords(org.apache.lucene.index.IndexReader reader)
                 throws IOException
Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent

Parameters:
reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
Returns:
The number of stop words identified.
Throws:
IOException

addStopWords

public int addStopWords(org.apache.lucene.index.IndexReader reader,
                        int maxDocFreq)
                 throws IOException
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent

Parameters:
reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word
Returns:
The number of stop words identified.
Throws:
IOException

addStopWords

public int addStopWords(org.apache.lucene.index.IndexReader reader,
                        float maxPercentDocs)
                 throws IOException
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent

Parameters:
reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
Returns:
The number of stop words identified.
Throws:
IOException

addStopWords

public int addStopWords(org.apache.lucene.index.IndexReader reader,
                        String fieldName,
                        float maxPercentDocs)
                 throws IOException
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs

Parameters:
reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
fieldName - The field for which stopwords will be added
maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
Returns:
The number of stop words identified.
Throws:
IOException

addStopWords

public int addStopWords(org.apache.lucene.index.IndexReader reader,
                        String fieldName,
                        int maxDocFreq)
                 throws IOException
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs

Parameters:
reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
fieldName - The field for which stopwords will be added
maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word.
Returns:
The number of stop words identified.
Throws:
IOException

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(String fieldName,
                                                          Reader reader)
Specified by:
tokenStream in class org.apache.lucene.analysis.Analyzer

reusableTokenStream

public org.apache.lucene.analysis.TokenStream reusableTokenStream(String fieldName,
                                                                  Reader reader)
                                                           throws IOException
Overrides:
reusableTokenStream in class org.apache.lucene.analysis.Analyzer
Throws:
IOException

getStopWords

public String[] getStopWords(String fieldName)
Provides information on which stop words have been identified for a field

Parameters:
fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
Returns:
the stop words identified for a field

getStopWords

public org.apache.lucene.index.Term[] getStopWords()
Provides information on which stop words have been identified for all fields

Returns:
the stop words (as terms)


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.