|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.analysis.Analyzer org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
public final class QueryAutoStopWordAnalyzer
An Analyzer
used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.
Field Summary | |
---|---|
static float |
defaultMaxDocFreqPercent
|
Constructor Summary | |
---|---|
QueryAutoStopWordAnalyzer(Version matchVersion,
Analyzer delegate)
Initializes this analyzer with the Analyzer object that actually produces the tokens |
Method Summary | |
---|---|
int |
addStopWords(IndexReader reader)
Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent |
int |
addStopWords(IndexReader reader,
float maxPercentDocs)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent |
int |
addStopWords(IndexReader reader,
int maxDocFreq)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent |
int |
addStopWords(IndexReader reader,
String fieldName,
float maxPercentDocs)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs |
int |
addStopWords(IndexReader reader,
String fieldName,
int maxDocFreq)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs |
Term[] |
getStopWords()
Provides information on which stop words have been identified for all fields |
String[] |
getStopWords(String fieldName)
Provides information on which stop words have been identified for a field |
TokenStream |
reusableTokenStream(String fieldName,
Reader reader)
Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method. |
TokenStream |
tokenStream(String fieldName,
Reader reader)
Creates a TokenStream which tokenizes all the text in the provided Reader. |
Methods inherited from class org.apache.lucene.analysis.Analyzer |
---|
close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, setPreviousTokenStream |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final float defaultMaxDocFreqPercent
Constructor Detail |
---|
public QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate)
delegate
- The choice of Analyzer
that is used to produce the token stream which needs filteringMethod Detail |
---|
public int addStopWords(IndexReader reader) throws IOException
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequency
IOException
public int addStopWords(IndexReader reader, int maxDocFreq) throws IOException
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencymaxDocFreq
- The maximum number of index documents which can contain a term, after which
the term is considered to be a stop word
IOException
public int addStopWords(IndexReader reader, float maxPercentDocs) throws IOException
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencymaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.
IOException
public int addStopWords(IndexReader reader, String fieldName, float maxPercentDocs) throws IOException
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.
IOException
public int addStopWords(IndexReader reader, String fieldName, int maxDocFreq) throws IOException
reader
- The IndexReader
which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxDocFreq
- The maximum number of index documents which
can contain a term, after which the term is considered to be a stop word.
IOException
public TokenStream tokenStream(String fieldName, Reader reader)
Analyzer
tokenStream
in class Analyzer
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException
Analyzer
reusableTokenStream
in class Analyzer
IOException
public String[] getStopWords(String fieldName)
fieldName
- The field for which stop words identified in "addStopWords"
method calls will be returned
public Term[] getStopWords()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |