org.apache.lucene.analysis.ngram
Class NGramTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.ngram.NGramTokenFilter
- All Implemented Interfaces:
- Closeable
public final class NGramTokenFilter
- extends TokenFilter
Tokenizes the input into n-grams of the given size(s).
You must specify the required Version
compatibility when
creating a NGramTokenFilter
. As of Lucene 4.4, this token filters:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then
increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc",
"c").
You can make this filter use the old behavior by providing a version <
Version.LUCENE_44
in the constructor but this is not recommended as
it will lead to broken TokenStream
s that will cause highlighting
bugs.
If you were using this TokenFilter
to perform partial highlighting,
this won't work anymore since this filter doesn't update offsets. You should
modify your analysis chain to use NGramTokenizer
, and potentially
override NGramTokenizer.isTokenChar(int)
to perform pre-tokenization.
Method Summary |
boolean |
incrementToken()
Returns the next token in the stream, or null at EOS. |
void |
reset()
|
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
DEFAULT_MIN_NGRAM_SIZE
public static final int DEFAULT_MIN_NGRAM_SIZE
- See Also:
- Constant Field Values
DEFAULT_MAX_NGRAM_SIZE
public static final int DEFAULT_MAX_NGRAM_SIZE
- See Also:
- Constant Field Values
NGramTokenFilter
public NGramTokenFilter(Version version,
TokenStream input,
int minGram,
int maxGram)
- Creates NGramTokenFilter with given min and max n-grams.
- Parameters:
version
- Lucene version to enable correct position increments.
See above for details.input
- TokenStream
holding the input to be tokenizedminGram
- the smallest n-gram to generatemaxGram
- the largest n-gram to generate
NGramTokenFilter
public NGramTokenFilter(Version version,
TokenStream input)
- Creates NGramTokenFilter with default min and max n-grams.
- Parameters:
version
- Lucene version to enable correct position increments.
See above for details.input
- TokenStream
holding the input to be tokenized
incrementToken
public final boolean incrementToken()
throws IOException
- Returns the next token in the stream, or null at EOS.
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenFilter
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.