public abstract class Analyzer extends Object implements Closeable
In order to define what analysis is done, subclasses must define their
TokenStreamComponents in createComponents(String).
The components are then reused in each call to tokenStream(String, Reader).
Simple example:
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
}
@Override
protected TokenStream normalize(TokenStream in) {
// Assuming FooFilter is about normalization and BarFilter is about
// stemming, only FooFilter should be applied
return new FooFilter(in);
}
};
For more examples, see the Analysis package documentation.
For some concrete implementations bundled with Lucene, look in the analysis modules:
| Modifier and Type | Class and Description |
|---|---|
static class |
Analyzer.ReuseStrategy
Strategy defining how TokenStreamComponents are reused per call to
tokenStream(String, java.io.Reader). |
static class |
Analyzer.TokenStreamComponents
This class encapsulates the outer components of a token stream.
|
| Modifier and Type | Field and Description |
|---|---|
static Analyzer.ReuseStrategy |
GLOBAL_REUSE_STRATEGY
A predefined
Analyzer.ReuseStrategy that reuses the same components for
every field. |
static Analyzer.ReuseStrategy |
PER_FIELD_REUSE_STRATEGY
A predefined
Analyzer.ReuseStrategy that reuses components per-field by
maintaining a Map of TokenStreamComponent per field name. |
| Constructor and Description |
|---|
Analyzer()
Create a new Analyzer, reusing the same set of components per-thread
across calls to
tokenStream(String, Reader). |
Analyzer(Analyzer.ReuseStrategy reuseStrategy)
Expert: create a new Analyzer with a custom
Analyzer.ReuseStrategy. |
| Modifier and Type | Method and Description |
|---|---|
protected AttributeFactory |
attributeFactory(String fieldName)
|
void |
close()
Frees persistent resources used by this Analyzer
|
protected abstract Analyzer.TokenStreamComponents |
createComponents(String fieldName)
Creates a new
Analyzer.TokenStreamComponents instance for this analyzer. |
int |
getOffsetGap(String fieldName)
Just like
getPositionIncrementGap(java.lang.String), except for
Token offsets instead. |
int |
getPositionIncrementGap(String fieldName)
Invoked before indexing a IndexableField instance if
terms have already been added to that field.
|
Analyzer.ReuseStrategy |
getReuseStrategy()
Returns the used
Analyzer.ReuseStrategy. |
Version |
getVersion()
Return the version of Lucene this analyzer will mimic the behavior of for analysis.
|
protected Reader |
initReader(String fieldName,
Reader reader)
Override this if you want to add a CharFilter chain.
|
protected Reader |
initReaderForNormalization(String fieldName,
Reader reader)
Wrap the given
Reader with CharFilters that make sense
for normalization. |
BytesRef |
normalize(String fieldName,
String text)
Normalize a string down to the representation that it would have in the
index.
|
protected TokenStream |
normalize(String fieldName,
TokenStream in)
Wrap the given
TokenStream in order to apply normalization filters. |
void |
setVersion(Version v)
Set the version of Lucene this analyzer should mimic the behavior for for analysis.
|
TokenStream |
tokenStream(String fieldName,
Reader reader)
Returns a TokenStream suitable for
fieldName, tokenizing
the contents of reader. |
TokenStream |
tokenStream(String fieldName,
String text)
Returns a TokenStream suitable for
fieldName, tokenizing
the contents of text. |
public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY
Analyzer.ReuseStrategy that reuses the same components for
every field.public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY
Analyzer.ReuseStrategy that reuses components per-field by
maintaining a Map of TokenStreamComponent per field name.public Analyzer()
tokenStream(String, Reader).public Analyzer(Analyzer.ReuseStrategy reuseStrategy)
Analyzer.ReuseStrategy.
NOTE: if you just want to reuse on a per-field basis, it's easier to
use a subclass of AnalyzerWrapper such as
PerFieldAnalyerWrapper instead.
protected abstract Analyzer.TokenStreamComponents createComponents(String fieldName)
Analyzer.TokenStreamComponents instance for this analyzer.fieldName - the name of the fields content passed to the
Analyzer.TokenStreamComponents sink as a readerAnalyzer.TokenStreamComponents for this analyzer.protected TokenStream normalize(String fieldName, TokenStream in)
TokenStream in order to apply normalization filters.
The default implementation returns the TokenStream as-is. This is
used by normalize(String, String).public final TokenStream tokenStream(String fieldName, Reader reader)
fieldName, tokenizing
the contents of reader.
This method uses createComponents(String) to obtain an
instance of Analyzer.TokenStreamComponents. It returns the sink of the
components and stores the components internally. Subsequent calls to this
method will reuse the previously stored components after resetting them
through Analyzer.TokenStreamComponents.setReader(Reader).
NOTE: After calling this method, the consumer must follow the
workflow described in TokenStream to properly consume its contents.
See the Analysis package documentation for
some examples demonstrating this.
NOTE: If your data is available as a String, use
tokenStream(String, String) which reuses a StringReader-like
instance internally.
fieldName - the name of the field the created TokenStream is used forreader - the reader the streams source reads fromreaderAlreadyClosedException - if the Analyzer is closed.tokenStream(String, String)public final TokenStream tokenStream(String fieldName, String text)
fieldName, tokenizing
the contents of text.
This method uses createComponents(String) to obtain an
instance of Analyzer.TokenStreamComponents. It returns the sink of the
components and stores the components internally. Subsequent calls to this
method will reuse the previously stored components after resetting them
through Analyzer.TokenStreamComponents.setReader(Reader).
NOTE: After calling this method, the consumer must follow the
workflow described in TokenStream to properly consume its contents.
See the Analysis package documentation for
some examples demonstrating this.
fieldName - the name of the field the created TokenStream is used fortext - the String the streams source reads fromreaderAlreadyClosedException - if the Analyzer is closed.tokenStream(String, Reader)public final BytesRef normalize(String fieldName, String text)
This is typically used by query parsers in order to generate a query on a given term, without tokenizing or stemming, which are undesirable if the string to analyze is a partial word (eg. in case of a wildcard or fuzzy query).
This method uses initReaderForNormalization(String, Reader) in
order to apply necessary character-level normalization and then
normalize(String, TokenStream) in order to apply the normalizing
token filters.
protected Reader initReader(String fieldName, Reader reader)
The default implementation returns reader
unchanged.
fieldName - IndexableField name being indexedreader - original Readerprotected Reader initReaderForNormalization(String fieldName, Reader reader)
Reader with CharFilters that make sense
for normalization. This is typically a subset of the CharFilters
that are applied in initReader(String, Reader). This is used by
normalize(String, String).protected AttributeFactory attributeFactory(String fieldName)
AttributeFactory to be used for
analysis and
normalization on the given
FieldName. The default implementation returns
TokenStream.DEFAULT_TOKEN_ATTRIBUTE_FACTORY.public int getPositionIncrementGap(String fieldName)
fieldName - IndexableField name being indexed.tokenStream(String,Reader).
This value must be >= 0.public int getOffsetGap(String fieldName)
getPositionIncrementGap(java.lang.String), except for
Token offsets instead. By default this returns 1.
This method is only called if the field
produced at least one token for indexing.fieldName - the field just indexedtokenStream(String,Reader).
This value must be >= 0.public final Analyzer.ReuseStrategy getReuseStrategy()
Analyzer.ReuseStrategy.public void setVersion(Version v)
public Version getVersion()
public void close()
close in interface Closeableclose in interface AutoCloseableCopyright © 2000-2017 Apache Software Foundation. All Rights Reserved.