org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
      extended by org.apache.lucene.analysis.AnalyzerWrapper
          extended by org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
All Implemented Interfaces:
Closeable

public final class QueryAutoStopWordAnalyzer
extends AnalyzerWrapper

An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.GlobalReuseStrategy, Analyzer.PerFieldReuseStrategy, Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
 
Field Summary
static float defaultMaxDocFreqPercent
           
 
Constructor Summary
QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader)
          Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs)
          Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq)
          Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, float maxPercentDocs)
          Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, int maxDocFreq)
          Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq
 
Method Summary
 Term[] getStopWords()
          Provides information on which stop words have been identified for all fields
 String[] getStopWords(String fieldName)
          Provides information on which stop words have been identified for a field
protected  Analyzer getWrappedAnalyzer(String fieldName)
           
protected  Analyzer.TokenStreamComponents wrapComponents(String fieldName, Analyzer.TokenStreamComponents components)
           
 
Methods inherited from class org.apache.lucene.analysis.AnalyzerWrapper
createComponents, getOffsetGap, getPositionIncrementGap, initReader
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, tokenStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

defaultMaxDocFreqPercent

public static final float defaultMaxDocFreqPercent
See Also:
Constant Field Values
Constructor Detail

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader)
                          throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent

Parameters:
matchVersion - Version to be used in StopFilter
delegate - Analyzer whose TokenStream will be filtered
indexReader - IndexReader to identify the stopwords from
Throws:
IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 int maxDocFreq)
                          throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq

Parameters:
matchVersion - Version to be used in StopFilter
delegate - Analyzer whose TokenStream will be filtered
indexReader - IndexReader to identify the stopwords from
maxDocFreq - Document frequency terms should be above in order to be stopwords
Throws:
IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 float maxPercentDocs)
                          throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs

Parameters:
matchVersion - Version to be used in StopFilter
delegate - Analyzer whose TokenStream will be filtered
indexReader - IndexReader to identify the stopwords from
maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
Throws:
IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 Collection<String> fields,
                                 float maxPercentDocs)
                          throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs

Parameters:
matchVersion - Version to be used in StopFilter
delegate - Analyzer whose TokenStream will be filtered
indexReader - IndexReader to identify the stopwords from
fields - Selection of fields to calculate stopwords for
maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
Throws:
IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 Collection<String> fields,
                                 int maxDocFreq)
                          throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq

Parameters:
matchVersion - Version to be used in StopFilter
delegate - Analyzer whose TokenStream will be filtered
indexReader - IndexReader to identify the stopwords from
fields - Selection of fields to calculate stopwords for
maxDocFreq - Document frequency terms should be above in order to be stopwords
Throws:
IOException - Can be thrown while reading from the IndexReader
Method Detail

getWrappedAnalyzer

protected Analyzer getWrappedAnalyzer(String fieldName)
Specified by:
getWrappedAnalyzer in class AnalyzerWrapper

wrapComponents

protected Analyzer.TokenStreamComponents wrapComponents(String fieldName,
                                                        Analyzer.TokenStreamComponents components)
Specified by:
wrapComponents in class AnalyzerWrapper

getStopWords

public String[] getStopWords(String fieldName)
Provides information on which stop words have been identified for a field

Parameters:
fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
Returns:
the stop words identified for a field

getStopWords

public Term[] getStopWords()
Provides information on which stop words have been identified for all fields

Returns:
the stop words (as terms)


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.