Package org.apache.lucene.analysis.ngram
Class NGramTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.ngram.NGramTokenFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
Tokenizes the input into n-grams of the given size(s). As of Lucene 4.4, this token filter:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
If you were using this TokenFilter
to perform partial highlighting, this won't work
anymore since this filter doesn't update offsets. You should modify your analysis chain to use
NGramTokenizer
, and potentially override NGramTokenizer.isTokenChar(int)
to
perform pre-tokenization.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionNGramTokenFilter
(TokenStream input, int gramSize) Creates an NGramTokenFilter that produces n-grams of the indicated size.NGramTokenFilter
(TokenStream input, int minGram, int maxGram, boolean preserveOriginal) Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram. -
Method Summary
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
DEFAULT_PRESERVE_ORIGINAL
public static final boolean DEFAULT_PRESERVE_ORIGINAL- See Also:
-
-
Constructor Details
-
NGramTokenFilter
Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram. Will optionally preserve the original term when its length is outside of the defined range.Note: Care must be taken when choosing minGram and maxGram; depending on the input token size, this filter potentially produces a huge number of terms.
- Parameters:
input
-TokenStream
holding the input to be tokenizedminGram
- the minimum length of the generated n-gramsmaxGram
- the maximum length of the generated n-gramspreserveOriginal
- Whether or not to keep the original term when it is shorter than minGram or longer than maxGram
-
NGramTokenFilter
Creates an NGramTokenFilter that produces n-grams of the indicated size.- Parameters:
input
-TokenStream
holding the input to be tokenizedgramSize
- the size of n-grams to generate.
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
end
- Overrides:
end
in classTokenFilter
- Throws:
IOException
-