org.apache.lucene.analysis
Class Analyzer

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
All Implemented Interfaces:
Closeable
Direct Known Subclasses:
AnalyzerWrapper

public abstract class Analyzer
extends Object
implements Closeable

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in createComponents(String, Reader). The components are then reused in each call to tokenStream(String, Reader).

Simple example:

 Analyzer analyzer = new Analyzer() {
  @Override
   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
     Tokenizer source = new FooTokenizer(reader);
     TokenStream filter = new FooFilter(source);
     filter = new BarFilter(filter);
     return new TokenStreamComponents(source, filter);
   }
 };
 
For more examples, see the Analysis package documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:


Nested Class Summary
static class Analyzer.GlobalReuseStrategy
          Deprecated. This implementation class will be hidden in Lucene 5.0. Use GLOBAL_REUSE_STRATEGY instead!
static class Analyzer.PerFieldReuseStrategy
          Deprecated. This implementation class will be hidden in Lucene 5.0. Use PER_FIELD_REUSE_STRATEGY instead!
static class Analyzer.ReuseStrategy
          Strategy defining how TokenStreamComponents are reused per call to tokenStream(String, java.io.Reader).
static class Analyzer.TokenStreamComponents
          This class encapsulates the outer components of a token stream.
 
Field Summary
static Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY
          A predefined Analyzer.ReuseStrategy that reuses the same components for every field.
static Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY
          A predefined Analyzer.ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
 
Constructor Summary
Analyzer()
          Create a new Analyzer, reusing the same set of components per-thread across calls to tokenStream(String, Reader).
Analyzer(Analyzer.ReuseStrategy reuseStrategy)
          Expert: create a new Analyzer with a custom Analyzer.ReuseStrategy.
 
Method Summary
 void close()
          Frees persistent resources used by this Analyzer
protected abstract  Analyzer.TokenStreamComponents createComponents(String fieldName, Reader reader)
          Creates a new Analyzer.TokenStreamComponents instance for this analyzer.
 int getOffsetGap(String fieldName)
          Just like getPositionIncrementGap(java.lang.String), except for Token offsets instead.
 int getPositionIncrementGap(String fieldName)
          Invoked before indexing a IndexableField instance if terms have already been added to that field.
 Analyzer.ReuseStrategy getReuseStrategy()
          Returns the used Analyzer.ReuseStrategy.
protected  Reader initReader(String fieldName, Reader reader)
          Override this if you want to add a CharFilter chain.
 TokenStream tokenStream(String fieldName, Reader reader)
          Returns a TokenStream suitable for fieldName, tokenizing the contents of reader.
 TokenStream tokenStream(String fieldName, String text)
          Returns a TokenStream suitable for fieldName, tokenizing the contents of text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GLOBAL_REUSE_STRATEGY

public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY
A predefined Analyzer.ReuseStrategy that reuses the same components for every field.


PER_FIELD_REUSE_STRATEGY

public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY
A predefined Analyzer.ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.

Constructor Detail

Analyzer

public Analyzer()
Create a new Analyzer, reusing the same set of components per-thread across calls to tokenStream(String, Reader).


Analyzer

public Analyzer(Analyzer.ReuseStrategy reuseStrategy)
Expert: create a new Analyzer with a custom Analyzer.ReuseStrategy.

NOTE: if you just want to reuse on a per-field basis, its easier to use a subclass of AnalyzerWrapper such as PerFieldAnalyerWrapper instead.

Method Detail

createComponents

protected abstract Analyzer.TokenStreamComponents createComponents(String fieldName,
                                                                   Reader reader)
Creates a new Analyzer.TokenStreamComponents instance for this analyzer.

Parameters:
fieldName - the name of the fields content passed to the Analyzer.TokenStreamComponents sink as a reader
reader - the reader passed to the Tokenizer constructor
Returns:
the Analyzer.TokenStreamComponents for this analyzer.

tokenStream

public final TokenStream tokenStream(String fieldName,
                                     Reader reader)
                              throws IOException
Returns a TokenStream suitable for fieldName, tokenizing the contents of reader.

This method uses createComponents(String, Reader) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).

NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this. NOTE: If your data is available as a String, use tokenStream(String, String) which reuses a StringReader-like instance internally.

Parameters:
fieldName - the name of the field the created TokenStream is used for
reader - the reader the streams source reads from
Returns:
TokenStream for iterating the analyzed content of reader
Throws:
AlreadyClosedException - if the Analyzer is closed.
IOException - if an i/o error occurs.
See Also:
tokenStream(String, String)

tokenStream

public final TokenStream tokenStream(String fieldName,
                                     String text)
                              throws IOException
Returns a TokenStream suitable for fieldName, tokenizing the contents of text.

This method uses createComponents(String, Reader) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).

NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this.

Parameters:
fieldName - the name of the field the created TokenStream is used for
text - the String the streams source reads from
Returns:
TokenStream for iterating the analyzed content of reader
Throws:
AlreadyClosedException - if the Analyzer is closed.
IOException - if an i/o error occurs (may rarely happen for strings).
See Also:
tokenStream(String, Reader)

initReader

protected Reader initReader(String fieldName,
                            Reader reader)
Override this if you want to add a CharFilter chain.

The default implementation returns reader unchanged.

Parameters:
fieldName - IndexableField name being indexed
reader - original Reader
Returns:
reader, optionally decorated with CharFilter(s)

getPositionIncrementGap

public int getPositionIncrementGap(String fieldName)
Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.

Parameters:
fieldName - IndexableField name being indexed.
Returns:
position increment gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.

getOffsetGap

public int getOffsetGap(String fieldName)
Just like getPositionIncrementGap(java.lang.String), except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.

Parameters:
fieldName - the field just indexed
Returns:
offset gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.

getReuseStrategy

public final Analyzer.ReuseStrategy getReuseStrategy()
Returns the used Analyzer.ReuseStrategy.


close

public void close()
Frees persistent resources used by this Analyzer

Specified by:
close in interface Closeable


Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.