org.apache.lucene.analysis.standard
Class StandardAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.util.StopwordAnalyzerBase
org.apache.lucene.analysis.standard.StandardAnalyzer
- All Implemented Interfaces:
- Closeable
public final class StandardAnalyzer
- extends StopwordAnalyzerBase
Filters StandardTokenizer
with StandardFilter
, LowerCaseFilter
and StopFilter
, using a list of
English stop words.
You must specify the required Version
compatibility when creating StandardAnalyzer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
- As of 3.1, StandardTokenizer implements Unicode text segmentation,
and StopFilter correctly handles Unicode 4.0 supplementary characters
in stopwords.
ClassicTokenizer
and ClassicAnalyzer
are the pre-3.1 implementations of StandardTokenizer and
StandardAnalyzer.
- As of 2.9, StopFilter preserves position increments
- As of 2.4, Tokens incorrectly identified as acronyms
are corrected (see LUCENE-1068)
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTH
- Default maximum allowed token length
- See Also:
- Constant Field Values
STOP_WORDS_SET
public static final CharArraySet STOP_WORDS_SET
- An unmodifiable set containing some common English words that are usually not
useful for searching.
StandardAnalyzer
public StandardAnalyzer(Version matchVersion,
CharArraySet stopWords)
- Builds an analyzer with the given stop words.
- Parameters:
matchVersion
- Lucene version to match See abovestopWords
- stop words
StandardAnalyzer
public StandardAnalyzer(Version matchVersion)
- Builds an analyzer with the default stop words (
STOP_WORDS_SET
).
- Parameters:
matchVersion
- Lucene version to match See above
StandardAnalyzer
public StandardAnalyzer(Version matchVersion,
Reader stopwords)
throws IOException
- Builds an analyzer with the stop words from the given reader.
- Parameters:
matchVersion
- Lucene version to match See abovestopwords
- Reader to read stop words from
- Throws:
IOException
- See Also:
WordlistLoader.getWordSet(Reader, Version)
setMaxTokenLength
public void setMaxTokenLength(int length)
- Set maximum allowed token length. If a token is seen
that exceeds this length then it is discarded. This
setting only takes effect the next time tokenStream or
tokenStream is called.
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
setMaxTokenLength(int)
createComponents
protected Analyzer.TokenStreamComponents createComponents(String fieldName,
Reader reader)
- Specified by:
createComponents
in class Analyzer
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.