Package org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.AnalyzerWrapper
org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
- All Implemented Interfaces:
Closeable
,AutoCloseable
An
Analyzer
used primarily at query time to wrap another analyzer and provide a layer of
protection which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
- Since:
- 3.1
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
Constructor Summary
ConstructorDescriptionQueryAutoStopWordAnalyzer
(Analyzer delegate, IndexReader indexReader) Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
QueryAutoStopWordAnalyzer
(Analyzer delegate, IndexReader indexReader, float maxPercentDocs) Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer
(Analyzer delegate, IndexReader indexReader, int maxDocFreq) Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreqQueryAutoStopWordAnalyzer
(Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs) Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer
(Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq) Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq -
Method Summary
Modifier and TypeMethodDescriptionTerm[]
Provides information on which stop words have been identified for all fieldsString[]
getStopWords
(String fieldName) Provides information on which stop words have been identified for a fieldprotected Analyzer
getWrappedAnalyzer
(String fieldName) protected Analyzer.TokenStreamComponents
wrapComponents
(String fieldName, Analyzer.TokenStreamComponents components) Methods inherited from class org.apache.lucene.analysis.AnalyzerWrapper
attributeFactory, createComponents, getOffsetGap, getPositionIncrementGap, initReader, initReaderForNormalization, normalize, wrapReader, wrapReaderForNormalization, wrapTokenStreamForNormalization
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getReuseStrategy, normalize, tokenStream, tokenStream
-
Field Details
-
defaultMaxDocFreqPercent
public static final float defaultMaxDocFreqPercent- See Also:
-
-
Constructor Details
-
QueryAutoStopWordAnalyzer
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords from- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, int maxDocFreq) throws IOException Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, float maxPercentDocs) throws IOException Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs) throws IOException Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq) throws IOException Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
-
Method Details
-
getWrappedAnalyzer
- Specified by:
getWrappedAnalyzer
in classAnalyzerWrapper
-
wrapComponents
protected Analyzer.TokenStreamComponents wrapComponents(String fieldName, Analyzer.TokenStreamComponents components) - Overrides:
wrapComponents
in classAnalyzerWrapper
-
getStopWords
Provides information on which stop words have been identified for a field- Parameters:
fieldName
- The field for which stop words identified in "addStopWords" method calls will be returned- Returns:
- the stop words identified for a field
-
getStopWords
Provides information on which stop words have been identified for all fields- Returns:
- the stop words (as terms)
-