Class NGramTokenFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable, Unwrappable<TokenStream>

    public final class NGramTokenFilter
    extends TokenFilter
    Tokenizes the input into n-grams of the given size(s). As of Lucene 4.4, this token filter:
    • handles supplementary characters correctly,
    • emits all n-grams for the same token at the same position,
    • does not modify offsets,
    • sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").

    If you were using this TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override NGramTokenizer.isTokenChar(int) to perform pre-tokenization.

    • Field Detail

      • DEFAULT_PRESERVE_ORIGINAL

        public static final boolean DEFAULT_PRESERVE_ORIGINAL
        See Also:
        Constant Field Values
    • Constructor Detail

      • NGramTokenFilter

        public NGramTokenFilter​(TokenStream input,
                                int minGram,
                                int maxGram,
                                boolean preserveOriginal)
        Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram. Will optionally preserve the original term when its length is outside of the defined range.

        Note: Care must be taken when choosing minGram and maxGram; depending on the input token size, this filter potentially produces a huge number of terms.

        Parameters:
        input - TokenStream holding the input to be tokenized
        minGram - the minimum length of the generated n-grams
        maxGram - the maximum length of the generated n-grams
        preserveOriginal - Whether or not to keep the original term when it is shorter than minGram or longer than maxGram
      • NGramTokenFilter

        public NGramTokenFilter​(TokenStream input,
                                int gramSize)
        Creates an NGramTokenFilter that produces n-grams of the indicated size.
        Parameters:
        input - TokenStream holding the input to be tokenized
        gramSize - the size of n-grams to generate.