java.lang.Object

org.apache.lucene.analysis.Analyzer

All Implemented Interfaces:: Closeable, AutoCloseable

Direct Known Subclasses:: AnalyzerWrapper, StopwordAnalyzerBase

public abstract class Analyzer extends Object implements Closeable

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in createComponents(String). The components are then reused in each call to tokenStream(String, Reader).

Simple example:

 Analyzer analyzer = new Analyzer() {
  @Override
   protected TokenStreamComponents createComponents(String fieldName) {
     Tokenizer source = new FooTokenizer(reader);
     TokenStream filter = new FooFilter(source);
     filter = new BarFilter(filter);
     return new TokenStreamComponents(source, filter);
   }
   @Override
   protected TokenStream normalize(TokenStream in) {
     // Assuming FooFilter is about normalization and BarFilter is about
     // stemming, only FooFilter should be applied
     return new FooFilter(in);
   }
 };

For more examples, see the Analysis package documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

Common: Analyzers for indexing content in different languages and domains.
ICU: Exposes functionality from ICU to Apache Lucene.
Kuromoji: Morphological analyzer for Japanese text.
Morfologik: Dictionary-driven lemmatization for the Polish language.
Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
Stempel: Algorithmic Stemmer for the Polish Language.

Since:: 3.1

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

Analyzer.ReuseStrategy

Strategy defining how TokenStreamComponents are reused per call to tokenStream(String, java.io.Reader).

static final class

Analyzer.TokenStreamComponents

This class encapsulates the outer components of a token stream.
Field Summary

Fields

Modifier and Type

Field

Description

static final Analyzer.ReuseStrategy

GLOBAL_REUSE_STRATEGY

A predefined Analyzer.ReuseStrategy that reuses the same components for every field.

static final Analyzer.ReuseStrategy

PER_FIELD_REUSE_STRATEGY

A predefined Analyzer.ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

Analyzer()

Create a new Analyzer, reusing the same set of components per-thread across calls to tokenStream(String, Reader).

protected

Analyzer(Analyzer.ReuseStrategy reuseStrategy)

Expert: create a new Analyzer with a custom Analyzer.ReuseStrategy.
Method Summary

Modifier and Type

Method

Description

protected AttributeFactory

attributeFactory(String fieldName)

Return the AttributeFactory to be used for analysis and normalization on the given FieldName.

void

close()

Frees persistent resources used by this Analyzer

protected abstract Analyzer.TokenStreamComponents

createComponents(String fieldName)

Creates a new Analyzer.TokenStreamComponents instance for this analyzer.

int

getOffsetGap(String fieldName)

Just like getPositionIncrementGap(java.lang.String), except for Token offsets instead.

int

getPositionIncrementGap(String fieldName)

Invoked before indexing a IndexableField instance if terms have already been added to that field.

final Analyzer.ReuseStrategy

getReuseStrategy()

Returns the used Analyzer.ReuseStrategy.

protected Reader

initReader(String fieldName, Reader reader)

Override this if you want to add a CharFilter chain.

protected Reader

initReaderForNormalization(String fieldName, Reader reader)

Wrap the given Reader with CharFilters that make sense for normalization.

final BytesRef

normalize(String fieldName, String text)

Normalize a string down to the representation that it would have in the index.

protected TokenStream

normalize(String fieldName, TokenStream in)

Wrap the given TokenStream in order to apply normalization filters.

final TokenStream

tokenStream(String fieldName, Reader reader)

Returns a TokenStream suitable for fieldName, tokenizing the contents of reader.

final TokenStream

tokenStream(String fieldName, String text)

Returns a TokenStream suitable for fieldName, tokenizing the contents of text.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- GLOBAL_REUSE_STRATEGY
  
  public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY
  
  A predefined Analyzer.ReuseStrategy that reuses the same components for every field.
- PER_FIELD_REUSE_STRATEGY
  
  public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY
  
  A predefined Analyzer.ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
Constructor Details
- Analyzer
  
  protected Analyzer()
  
  Create a new Analyzer, reusing the same set of components per-thread across calls to tokenStream(String, Reader).
- Analyzer
  
  protected Analyzer(Analyzer.ReuseStrategy reuseStrategy)
  
  Expert: create a new Analyzer with a custom Analyzer.ReuseStrategy.
  NOTE: if you just want to reuse on a per-field basis, it's easier to use a subclass of AnalyzerWrapper such as PerFieldAnalyzerWrapper instead.
Method Details
- createComponents
  
  protected abstract Analyzer.TokenStreamComponents createComponents(String fieldName)
  
  Creates a new Analyzer.TokenStreamComponents instance for this analyzer.
  
  Parameters:
  
  fieldName - the name of the fields content passed to the Analyzer.TokenStreamComponents sink as a reader
  
  Returns:
  
  the Analyzer.TokenStreamComponents for this analyzer.
- normalize
  
  protected TokenStream normalize(String fieldName, TokenStream in)
  
  Wrap the given TokenStream in order to apply normalization filters. The default implementation returns the TokenStream as-is. This is used by normalize(String, String).
- tokenStream
  
  public final TokenStream tokenStream(String fieldName, Reader reader)
  
  Returns a TokenStream suitable for fieldName, tokenizing the contents of reader.
  This method uses createComponents(String) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).
  NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this.
  NOTE: If your data is available as a String, use tokenStream(String, String) which reuses a StringReader-like instance internally.
  Parameters:
  
  fieldName - the name of the field the created TokenStream is used for
  
  reader - the reader the streams source reads from
  
  Returns:
  
  TokenStream for iterating the analyzed content of reader
  
  Throws:
  
  AlreadyClosedException - if the Analyzer is closed.
  
  See Also:
  
  tokenStream(String, String)
- tokenStream
  
  public final TokenStream tokenStream(String fieldName, String text)
  
  Returns a TokenStream suitable for fieldName, tokenizing the contents of text.
  This method uses createComponents(String) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).
  NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this.
  Parameters:
  
  fieldName - the name of the field the created TokenStream is used for
  
  text - the String the streams source reads from
  
  Returns:
  
  TokenStream for iterating the analyzed content of reader
  
  Throws:
  
  AlreadyClosedException - if the Analyzer is closed.
  
  See Also:
  
  tokenStream(String, Reader)
- normalize
  
  public final BytesRef normalize(String fieldName, String text)
  
  Normalize a string down to the representation that it would have in the index.
  This is typically used by query parsers in order to generate a query on a given term, without tokenizing or stemming, which are undesirable if the string to analyze is a partial word (eg. in case of a wildcard or fuzzy query).
  This method uses initReaderForNormalization(String, Reader) in order to apply necessary character-level normalization and then normalize(String, TokenStream) in order to apply the normalizing token filters.
- initReader
  
  protected Reader initReader(String fieldName, Reader reader)
  
  Override this if you want to add a CharFilter chain.
  The default implementation returns reader unchanged.
  
  Parameters:
  
  fieldName - IndexableField name being indexed
  
  reader - original Reader
  
  Returns:
  
  reader, optionally decorated with CharFilter(s)
- initReaderForNormalization
  
  protected Reader initReaderForNormalization(String fieldName, Reader reader)
  
  Wrap the given Reader with CharFilters that make sense for normalization. This is typically a subset of the CharFilters that are applied in initReader(String, Reader). This is used by normalize(String, String).
- attributeFactory
  
  protected AttributeFactory attributeFactory(String fieldName)
  
  Return the AttributeFactory to be used for analysis and normalization on the given FieldName. The default implementation returns TokenStream.DEFAULT_TOKEN_ATTRIBUTE_FACTORY.
- getPositionIncrementGap
  
  public int getPositionIncrementGap(String fieldName)
  
  Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.
  
  Parameters:
  
  fieldName - IndexableField name being indexed.
  
  Returns:
  
  position increment gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.
- getOffsetGap
  
  public int getOffsetGap(String fieldName)
  
  Just like getPositionIncrementGap(java.lang.String), except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.
  
  Parameters:
  
  fieldName - the field just indexed
  
  Returns:
  
  offset gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.
- getReuseStrategy
  
  public final Analyzer.ReuseStrategy getReuseStrategy()
  
  Returns the used Analyzer.ReuseStrategy.
- close
  
  public void close()
  
  Frees persistent resources used by this Analyzer
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable

Class Analyzer

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

GLOBAL_REUSE_STRATEGY

PER_FIELD_REUSE_STRATEGY

Constructor Details

Analyzer

Analyzer

Method Details

createComponents

normalize

tokenStream

tokenStream

normalize

initReader

initReaderForNormalization

attributeFactory

getPositionIncrementGap

getOffsetGap

getReuseStrategy

close