Package org.apache.lucene.analysis.ngram
Class NGramTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.ngram.NGramTokenFilter
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
public final class NGramTokenFilter extends TokenFilter
Tokenizes the input into n-grams of the given size(s). As of Lucene 4.4, this token filter:- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
If you were using this
TokenFilter
to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to useNGramTokenizer
, and potentially overrideNGramTokenizer.isTokenChar(int)
to perform pre-tokenization.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static boolean
DEFAULT_PRESERVE_ORIGINAL
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description NGramTokenFilter(TokenStream input, int gramSize)
Creates an NGramTokenFilter that produces n-grams of the indicated size.NGramTokenFilter(TokenStream input, int minGram, int maxGram, boolean preserveOriginal)
Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
end()
boolean
incrementToken()
void
reset()
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, unwrap
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_PRESERVE_ORIGINAL
public static final boolean DEFAULT_PRESERVE_ORIGINAL
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
NGramTokenFilter
public NGramTokenFilter(TokenStream input, int minGram, int maxGram, boolean preserveOriginal)
Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram. Will optionally preserve the original term when its length is outside of the defined range.Note: Care must be taken when choosing minGram and maxGram; depending on the input token size, this filter potentially produces a huge number of terms.
- Parameters:
input
-TokenStream
holding the input to be tokenizedminGram
- the minimum length of the generated n-gramsmaxGram
- the maximum length of the generated n-gramspreserveOriginal
- Whether or not to keep the original term when it is shorter than minGram or longer than maxGram
-
NGramTokenFilter
public NGramTokenFilter(TokenStream input, int gramSize)
Creates an NGramTokenFilter that produces n-grams of the indicated size.- Parameters:
input
-TokenStream
holding the input to be tokenizedgramSize
- the size of n-grams to generate.
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classTokenFilter
- Throws:
IOException
-
-