org.apache.lucene.analysis.ngram.NGramTokenizer

All Implemented Interfaces:: Closeable, AutoCloseable

Direct Known Subclasses:: EdgeNGramTokenizer

public class NGramTokenizer extends Tokenizer

Tokenizes the input into n-grams of the given size(s).

On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.

For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

ngram tokens example
Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:

tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),
count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),
give the ability to pre-tokenize the stream before computing n-grams.

Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_MAX_NGRAM_SIZE

static final int

DEFAULT_MIN_NGRAM_SIZE

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

NGramTokenizer()

Creates NGramTokenizer with default min and max n-grams.

NGramTokenizer(int minGram, int maxGram)

Creates NGramTokenizer with given min and max n-grams.

NGramTokenizer(AttributeFactory factory, int minGram, int maxGram)

Creates NGramTokenizer with given min and max n-grams.
Method Summary

Modifier and Type

Method

Description

final void

end()

final boolean

incrementToken()

protected boolean

isTokenChar(int chr)

Only collect characters which satisfy this condition.

final void

reset()

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- DEFAULT_MIN_NGRAM_SIZE
  
  public static final int DEFAULT_MIN_NGRAM_SIZE
  See Also:
  
  Constant Field Values
- DEFAULT_MAX_NGRAM_SIZE
  
  public static final int DEFAULT_MAX_NGRAM_SIZE
  See Also:
  
  Constant Field Values
Constructor Details
- NGramTokenizer
  
  public NGramTokenizer(int minGram, int maxGram)
  
  Creates NGramTokenizer with given min and max n-grams.
  
  Parameters:
  
  minGram - the smallest n-gram to generate
  
  maxGram - the largest n-gram to generate
- NGramTokenizer
  
  public NGramTokenizer(AttributeFactory factory, int minGram, int maxGram)
  
  Creates NGramTokenizer with given min and max n-grams.
  
  Parameters:
  
  factory - AttributeFactory to use
  
  minGram - the smallest n-gram to generate
  
  maxGram - the largest n-gram to generate
- NGramTokenizer
  
  public NGramTokenizer()
  
  Creates NGramTokenizer with default min and max n-grams.
Method Details
- incrementToken
  
  public final boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- isTokenChar
  
  protected boolean isTokenChar(int chr)
  
  Only collect characters which satisfy this condition.
- end
  
  public final void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public final void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException

Class NGramTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MIN_NGRAM_SIZE

DEFAULT_MAX_NGRAM_SIZE

Constructor Details

NGramTokenizer

NGramTokenizer

NGramTokenizer

Method Details

incrementToken

isTokenChar

end

reset