NGramTokenizer (Lucene 8.4.1 API)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.ngram.NGramTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable

Direct Known Subclasses:

EdgeNGramTokenizer
```
public class NGramTokenizer
extends Tokenizer
```
Tokenizes the input into n-grams of the given size(s).
On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

Term ab abc bc bcd cd cde de

Position increment 1 1 1 1 1 1 1

Position length 1 1 1 1 1 1 1

Offsets [0,2[ [0,3[ [1,3[ [1,4[ [2,4[ [2,5[ [3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:
- tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),
- count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),
- give the ability to pre-tokenize the stream before computing n-grams.
Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).

Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary

Fields
Modifier and Type Field and Description

static int DEFAULT_MAX_NGRAM_SIZE

static int DEFAULT_MIN_NGRAM_SIZE
- Fields inherited from class org.apache.lucene.analysis.Tokenizer
  input
- Fields inherited from class org.apache.lucene.analysis.TokenStream
  DEFAULT_TOKEN_ATTRIBUTE_FACTORY

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_MAX_NGRAM_SIZE`
`static int`	`DEFAULT_MIN_NGRAM_SIZE`

Constructor Summary

Constructors
Constructor and Description
`NGramTokenizer()` Creates NGramTokenizer with default min and max n-grams.
`NGramTokenizer(AttributeFactory factory, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.
`NGramTokenizer(int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`end()`
`boolean`	`incrementToken()`
`protected boolean`	`isTokenChar(int chr)` Only collect characters which satisfy this condition.
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_MIN_NGRAM_SIZE
```
public static final int DEFAULT_MIN_NGRAM_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_NGRAM_SIZE
```
public static final int DEFAULT_MAX_NGRAM_SIZE
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - NGramTokenizer
```
public NGramTokenizer(int minGram,
                      int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    
    minGram - the smallest n-gram to generate
    
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
public NGramTokenizer(AttributeFactory factory,
                      int minGram,
                      int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    
    factory - AttributeFactory to use
    
    minGram - the smallest n-gram to generate
    
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
public NGramTokenizer()
```
    Creates NGramTokenizer with default min and max n-grams.
- Method Detail
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Specified by:
    
    incrementToken in class TokenStream
    
    Throws:
    
    IOException
  - isTokenChar
```
protected boolean isTokenChar(int chr)
```
    Only collect characters which satisfy this condition.
  - end
```
public final void end()
               throws IOException
```
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException
  - reset
```
public final void reset()
                 throws IOException
```
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException

Class NGramTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MIN_NGRAM_SIZE

DEFAULT_MAX_NGRAM_SIZE

Constructor Detail

NGramTokenizer

NGramTokenizer

NGramTokenizer

Method Detail

incrementToken

isTokenChar

end

reset