org.apache.lucene.analysis.ngram.NGramTokenFilter

All Implemented Interfaces:: Closeable, AutoCloseable, Unwrappable<TokenStream>

public final class NGramTokenFilter extends TokenFilter

Tokenizes the input into n-grams of the given size(s). As of Lucene 4.4, this token filter:

handles supplementary characters correctly,
emits all n-grams for the same token at the same position,
does not modify offsets,
sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").

If you were using this TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override NGramTokenizer.isTokenChar(int) to perform pre-tokenization.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

static final boolean

DEFAULT_PRESERVE_ORIGINAL

Fields inherited from class org.apache.lucene.analysis.TokenFilter
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

NGramTokenFilter(TokenStream input, int gramSize)

Creates an NGramTokenFilter that produces n-grams of the indicated size.

NGramTokenFilter(TokenStream input, int minGram, int maxGram, boolean preserveOriginal)

Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram.
Method Summary

Modifier and Type

Method

Description

void

end()

final boolean

incrementToken()

void

reset()

Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, unwrap

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- DEFAULT_PRESERVE_ORIGINAL
  
  public static final boolean DEFAULT_PRESERVE_ORIGINAL
  See Also:
  
  Constant Field Values
Constructor Details
- NGramTokenFilter
  
  public NGramTokenFilter(TokenStream input, int minGram, int maxGram, boolean preserveOriginal)
  
  Creates an NGramTokenFilter that, for a given input term, produces all contained n-grams with lengths >= minGram and <= maxGram. Will optionally preserve the original term when its length is outside of the defined range.
  Note: Care must be taken when choosing minGram and maxGram; depending on the input token size, this filter potentially produces a huge number of terms.
  
  Parameters:
  
  input - TokenStream holding the input to be tokenized
  
  minGram - the minimum length of the generated n-grams
  
  maxGram - the maximum length of the generated n-grams
  
  preserveOriginal - Whether or not to keep the original term when it is shorter than minGram or longer than maxGram
- NGramTokenFilter
  
  public NGramTokenFilter(TokenStream input, int gramSize)
  
  Creates an NGramTokenFilter that produces n-grams of the indicated size.
  
  Parameters:
  
  input - TokenStream holding the input to be tokenized
  
  gramSize - the size of n-grams to generate.
Method Details
- incrementToken
  
  public final boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class TokenFilter
  
  Throws:
  
  IOException
- end
  
  public void end() throws IOException
  
  Overrides:
  
  end in class TokenFilter
  
  Throws:
  
  IOException

Class NGramTokenFilter

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.TokenFilter

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenFilter

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

DEFAULT_PRESERVE_ORIGINAL

Constructor Details

NGramTokenFilter

NGramTokenFilter

Method Details

incrementToken

reset

end