public final class QueryAutoStopWordAnalyzer
extends org.apache.lucene.analysis.Analyzer
Analyzer
used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.
Modifier and Type | Field and Description |
---|---|
static float |
defaultMaxDocFreqPercent |
Constructor and Description |
---|
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate)
Deprecated.
Stopwords should be calculated at instantiation using one of the other constructors
|
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate,
org.apache.lucene.index.IndexReader indexReader)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
defaultMaxDocFreqPercent |
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate,
org.apache.lucene.index.IndexReader indexReader,
Collection<String> fields,
float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency percentage
greater than the given maxPercentDocs
|
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate,
org.apache.lucene.index.IndexReader indexReader,
Collection<String> fields,
int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency greater than
the given maxDocFreq
|
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate,
org.apache.lucene.index.IndexReader indexReader,
float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
the given maxPercentDocs
|
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate,
org.apache.lucene.index.IndexReader indexReader,
int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency greater than the given
maxDocFreq
|
Modifier and Type | Method and Description |
---|---|
int |
addStopWords(org.apache.lucene.index.IndexReader reader)
Deprecated.
Stopwords should be calculated at instantiation using
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader) |
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
float maxPercentDocs)
Deprecated.
Stowords should be calculated at instantiation using
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float) |
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
int maxDocFreq)
Deprecated.
Stopwords should be calculated at instantiation using
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int) |
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
String fieldName,
float maxPercentDocs)
Deprecated.
Stowords should be calculated at instantiation using
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float) |
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
String fieldName,
int maxDocFreq)
Deprecated.
Stowords should be calculated at instantiation using
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int) |
org.apache.lucene.index.Term[] |
getStopWords()
Provides information on which stop words have been identified for all fields
|
String[] |
getStopWords(String fieldName)
Provides information on which stop words have been identified for a field
|
org.apache.lucene.analysis.TokenStream |
reusableTokenStream(String fieldName,
Reader reader) |
org.apache.lucene.analysis.TokenStream |
tokenStream(String fieldName,
Reader reader) |
public static final float defaultMaxDocFreqPercent
@Deprecated public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate)
delegate
- The choice of Analyzer
that is used to produce the token stream which needs filteringpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader) throws IOException
defaultMaxDocFreqPercent
matchVersion
- Version to be used in StopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, int maxDocFreq) throws IOException
matchVersion
- Version to be used in StopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxDocFreq
- Document frequency terms should be above in order to be stopwordsIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, float maxPercentDocs) throws IOException
matchVersion
- Version to be used in StopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop wordIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, float maxPercentDocs) throws IOException
matchVersion
- Version to be used in StopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop wordIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, int maxDocFreq) throws IOException
matchVersion
- Version to be used in StopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxDocFreq
- Document frequency terms should be above in order to be stopwordsIOException
- Can be thrown while reading from the IndexReader@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader) throws IOException
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencyIOException
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, int maxDocFreq) throws IOException
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencymaxDocFreq
- The maximum number of index documents which can contain a term, after which
the term is considered to be a stop wordIOException
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, float maxPercentDocs) throws IOException
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencymaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.IOException
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, float maxPercentDocs) throws IOException
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.IOException
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, int maxDocFreq) throws IOException
QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxDocFreq
- The maximum number of index documents which
can contain a term, after which the term is considered to be a stop word.IOException
public org.apache.lucene.analysis.TokenStream tokenStream(String fieldName, Reader reader)
tokenStream
in class org.apache.lucene.analysis.Analyzer
public org.apache.lucene.analysis.TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException
reusableTokenStream
in class org.apache.lucene.analysis.Analyzer
IOException
public String[] getStopWords(String fieldName)
fieldName
- The field for which stop words identified in "addStopWords"
method calls will be returnedpublic org.apache.lucene.index.Term[] getStopWords()