Class Analyzer

  • All Implemented Interfaces:
    Closeable, AutoCloseable
    Direct Known Subclasses:
    AnalyzerWrapper, StopwordAnalyzerBase

    public abstract class Analyzer
    extends Object
    implements Closeable
    An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

    In order to define what analysis is done, subclasses must define their TokenStreamComponents in createComponents(String). The components are then reused in each call to tokenStream(String, Reader).

    Simple example:

     Analyzer analyzer = new Analyzer() {
      @Override
       protected TokenStreamComponents createComponents(String fieldName) {
         Tokenizer source = new FooTokenizer(reader);
         TokenStream filter = new FooFilter(source);
         filter = new BarFilter(filter);
         return new TokenStreamComponents(source, filter);
       }
       @Override
       protected TokenStream normalize(TokenStream in) {
         // Assuming FooFilter is about normalization and BarFilter is about
         // stemming, only FooFilter should be applied
         return new FooFilter(in);
       }
     };
     
    For more examples, see the Analysis package documentation.

    For some concrete implementations bundled with Lucene, look in the analysis modules:

    • Common: Analyzers for indexing content in different languages and domains.
    • ICU: Exposes functionality from ICU to Apache Lucene.
    • Kuromoji: Morphological analyzer for Japanese text.
    • Morfologik: Dictionary-driven lemmatization for the Polish language.
    • Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
    • Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
    • Stempel: Algorithmic Stemmer for the Polish Language.
    Since:
    3.1