Class Analyzer
- java.lang.Object
- 
- org.apache.lucene.analysis.Analyzer
 
- 
- All Implemented Interfaces:
- Closeable,- AutoCloseable
 - Direct Known Subclasses:
- AnalyzerWrapper,- StopwordAnalyzerBase
 
 public abstract class Analyzer extends Object implements Closeable An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.In order to define what analysis is done, subclasses must define their TokenStreamComponentsincreateComponents(String). The components are then reused in each call totokenStream(String, Reader).Simple example: Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); } @Override protected TokenStream normalize(TokenStream in) { // Assuming FooFilter is about normalization and BarFilter is about // stemming, only FooFilter should be applied return new FooFilter(in); } };For more examples, see theAnalysis package documentation.For some concrete implementations bundled with Lucene, look in the analysis modules: - Common: Analyzers for indexing content in different languages and domains.
- ICU: Exposes functionality from ICU to Apache Lucene.
- Kuromoji: Morphological analyzer for Japanese text.
- Morfologik: Dictionary-driven lemmatization for the Polish language.
- Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
- Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
- Stempel: Algorithmic Stemmer for the Polish Language.
 - Since:
- 3.1
 
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static classAnalyzer.ReuseStrategyStrategy defining how TokenStreamComponents are reused per call totokenStream(String, java.io.Reader).static classAnalyzer.TokenStreamComponentsThis class encapsulates the outer components of a token stream.
 - 
Field SummaryFields Modifier and Type Field Description static Analyzer.ReuseStrategyGLOBAL_REUSE_STRATEGYA predefinedAnalyzer.ReuseStrategythat reuses the same components for every field.static Analyzer.ReuseStrategyPER_FIELD_REUSE_STRATEGYA predefinedAnalyzer.ReuseStrategythat reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
 - 
Constructor SummaryConstructors Constructor Description Analyzer()Create a new Analyzer, reusing the same set of components per-thread across calls totokenStream(String, Reader).Analyzer(Analyzer.ReuseStrategy reuseStrategy)Expert: create a new Analyzer with a customAnalyzer.ReuseStrategy.
 - 
Method SummaryAll Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected AttributeFactoryattributeFactory(String fieldName)voidclose()Frees persistent resources used by this Analyzerprotected abstract Analyzer.TokenStreamComponentscreateComponents(String fieldName)Creates a newAnalyzer.TokenStreamComponentsinstance for this analyzer.intgetOffsetGap(String fieldName)Just likegetPositionIncrementGap(java.lang.String), except for Token offsets instead.intgetPositionIncrementGap(String fieldName)Invoked before indexing a IndexableField instance if terms have already been added to that field.Analyzer.ReuseStrategygetReuseStrategy()Returns the usedAnalyzer.ReuseStrategy.VersiongetVersion()Return the version of Lucene this analyzer will mimic the behavior of for analysis.protected ReaderinitReader(String fieldName, Reader reader)Override this if you want to add a CharFilter chain.protected ReaderinitReaderForNormalization(String fieldName, Reader reader)Wrap the givenReaderwithCharFilters that make sense for normalization.BytesRefnormalize(String fieldName, String text)Normalize a string down to the representation that it would have in the index.protected TokenStreamnormalize(String fieldName, TokenStream in)Wrap the givenTokenStreamin order to apply normalization filters.voidsetVersion(Version v)Set the version of Lucene this analyzer should mimic the behavior for for analysis.TokenStreamtokenStream(String fieldName, Reader reader)Returns a TokenStream suitable forfieldName, tokenizing the contents ofreader.TokenStreamtokenStream(String fieldName, String text)Returns a TokenStream suitable forfieldName, tokenizing the contents oftext.
 
- 
- 
- 
Field Detail- 
GLOBAL_REUSE_STRATEGYpublic static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY A predefinedAnalyzer.ReuseStrategythat reuses the same components for every field.
 - 
PER_FIELD_REUSE_STRATEGYpublic static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY A predefinedAnalyzer.ReuseStrategythat reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
 
- 
 - 
Constructor Detail- 
Analyzerpublic Analyzer() Create a new Analyzer, reusing the same set of components per-thread across calls totokenStream(String, Reader).
 - 
Analyzerpublic Analyzer(Analyzer.ReuseStrategy reuseStrategy) Expert: create a new Analyzer with a customAnalyzer.ReuseStrategy.NOTE: if you just want to reuse on a per-field basis, it's easier to use a subclass of AnalyzerWrappersuch as PerFieldAnalyerWrapper instead.
 
- 
 - 
Method Detail- 
createComponentsprotected abstract Analyzer.TokenStreamComponents createComponents(String fieldName) Creates a newAnalyzer.TokenStreamComponentsinstance for this analyzer.- Parameters:
- fieldName- the name of the fields content passed to the- Analyzer.TokenStreamComponentssink as a reader
- Returns:
- the Analyzer.TokenStreamComponentsfor this analyzer.
 
 - 
normalizeprotected TokenStream normalize(String fieldName, TokenStream in) Wrap the givenTokenStreamin order to apply normalization filters. The default implementation returns theTokenStreamas-is. This is used bynormalize(String, String).
 - 
tokenStreampublic final TokenStream tokenStream(String fieldName, Reader reader) Returns a TokenStream suitable forfieldName, tokenizing the contents ofreader.This method uses createComponents(String)to obtain an instance ofAnalyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them throughAnalyzer.TokenStreamComponents.setReader(Reader).NOTE: After calling this method, the consumer must follow the workflow described in TokenStreamto properly consume its contents. See theAnalysis package documentationfor some examples demonstrating this. NOTE: If your data is available as aString, usetokenStream(String, String)which reuses aStringReader-like instance internally.- Parameters:
- fieldName- the name of the field the created TokenStream is used for
- reader- the reader the streams source reads from
- Returns:
- TokenStream for iterating the analyzed content of reader
- Throws:
- AlreadyClosedException- if the Analyzer is closed.
- See Also:
- tokenStream(String, String)
 
 - 
tokenStreampublic final TokenStream tokenStream(String fieldName, String text) Returns a TokenStream suitable forfieldName, tokenizing the contents oftext.This method uses createComponents(String)to obtain an instance ofAnalyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them throughAnalyzer.TokenStreamComponents.setReader(Reader).NOTE: After calling this method, the consumer must follow the workflow described in TokenStreamto properly consume its contents. See theAnalysis package documentationfor some examples demonstrating this.- Parameters:
- fieldName- the name of the field the created TokenStream is used for
- text- the String the streams source reads from
- Returns:
- TokenStream for iterating the analyzed content of reader
- Throws:
- AlreadyClosedException- if the Analyzer is closed.
- See Also:
- tokenStream(String, Reader)
 
 - 
normalizepublic final BytesRef normalize(String fieldName, String text) Normalize a string down to the representation that it would have in the index.This is typically used by query parsers in order to generate a query on a given term, without tokenizing or stemming, which are undesirable if the string to analyze is a partial word (eg. in case of a wildcard or fuzzy query). This method uses initReaderForNormalization(String, Reader)in order to apply necessary character-level normalization and thennormalize(String, TokenStream)in order to apply the normalizing token filters.
 - 
initReaderprotected Reader initReader(String fieldName, Reader reader) Override this if you want to add a CharFilter chain.The default implementation returns readerunchanged.- Parameters:
- fieldName- IndexableField name being indexed
- reader- original Reader
- Returns:
- reader, optionally decorated with CharFilter(s)
 
 - 
initReaderForNormalizationprotected Reader initReaderForNormalization(String fieldName, Reader reader) Wrap the givenReaderwithCharFilters that make sense for normalization. This is typically a subset of theCharFilters that are applied ininitReader(String, Reader). This is used bynormalize(String, String).
 - 
attributeFactoryprotected AttributeFactory attributeFactory(String fieldName) Return theAttributeFactoryto be used foranalysisandnormalizationon the givenFieldName. The default implementation returnsTokenStream.DEFAULT_TOKEN_ATTRIBUTE_FACTORY.
 - 
getPositionIncrementGappublic int getPositionIncrementGap(String fieldName) Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.- Parameters:
- fieldName- IndexableField name being indexed.
- Returns:
- position increment gap, added to the next token emitted from tokenStream(String,Reader). This value must be>= 0.
 
 - 
getOffsetGappublic int getOffsetGap(String fieldName) Just likegetPositionIncrementGap(java.lang.String), except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.- Parameters:
- fieldName- the field just indexed
- Returns:
- offset gap, added to the next token emitted from tokenStream(String,Reader). This value must be>= 0.
 
 - 
getReuseStrategypublic final Analyzer.ReuseStrategy getReuseStrategy() Returns the usedAnalyzer.ReuseStrategy.
 - 
setVersionpublic void setVersion(Version v) Set the version of Lucene this analyzer should mimic the behavior for for analysis.
 - 
getVersionpublic Version getVersion() Return the version of Lucene this analyzer will mimic the behavior of for analysis.
 - 
closepublic void close() Frees persistent resources used by this Analyzer- Specified by:
- closein interface- AutoCloseable
- Specified by:
- closein interface- Closeable
 
 
- 
 
-