NGramTokenizer (Lucene 4.10.3 API)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.ngram.NGramTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable

Direct Known Subclasses:

EdgeNGramTokenizer
```
public class NGramTokenizer
extends Tokenizer
```
Tokenizes the input into n-grams of the given size(s).
On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

Term ab abc bc bcd cd cde de

Position increment 1 1 1 1 1 1 1

Position length 1 1 1 1 1 1 1

Offsets [0,2[ [0,3[ [1,3[ [1,4[ [2,4[ [2,5[ [3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:
- give the ability to pre-tokenize the stream before computing n-grams.
Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).
Although highly discouraged, it is still possible to use the old behavior through Lucene43NGramTokenizer.

Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary

Fields
Modifier and Type Field and Description

static int DEFAULT_MAX_NGRAM_SIZE

static int DEFAULT_MIN_NGRAM_SIZE
- Fields inherited from class org.apache.lucene.analysis.Tokenizer
  input
- Fields inherited from class org.apache.lucene.analysis.TokenStream
  DEFAULT_TOKEN_ATTRIBUTE_FACTORY
- Fields inherited from class org.apache.lucene.util.AttributeSource
  DEFAULT_ATTRIBUTE_FACTORY

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_MAX_NGRAM_SIZE`
`static int`	`DEFAULT_MIN_NGRAM_SIZE`

Constructor Summary

Constructors
Constructor and Description
`NGramTokenizer(AttributeFactory factory, Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.
`NGramTokenizer(Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.
`NGramTokenizer(Version version, AttributeFactory factory, Reader input, int minGram, int maxGram)` Deprecated. For `Version.LUCENE_4_3_0` and before, use `Lucene43NGramTokenizer`, otherwise use `NGramTokenizer(AttributeFactory, Reader, int, int)`
`NGramTokenizer(Version version, Reader input)` Creates NGramTokenizer with default min and max n-grams.
`NGramTokenizer(Version version, Reader input, int minGram, int maxGram)` Deprecated. For `Version.LUCENE_4_3_0` and before, use `Lucene43NGramTokenizer`, otherwise use `NGramTokenizer(Reader, int, int)`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`end()`
`boolean`	`incrementToken()`
`protected boolean`	`isTokenChar(int chr)` Only collect characters which satisfy this condition.
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_MIN_NGRAM_SIZE
```
public static final int DEFAULT_MIN_NGRAM_SIZE
```
    See Also:
    Constant Field Values
  - DEFAULT_MAX_NGRAM_SIZE
```
public static final int DEFAULT_MAX_NGRAM_SIZE
```
    See Also:
    Constant Field Values
- Constructor Detail
  - NGramTokenizer
```
public NGramTokenizer(Reader input,
              int minGram,
              int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    input - Reader holding the input to be tokenized
    minGram - the smallest n-gram to generate
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
@Deprecated
public NGramTokenizer(Version version,
                         Reader input,
                         int minGram,
                         int maxGram)
```
    Deprecated. For Version.LUCENE_4_3_0 and before, use Lucene43NGramTokenizer, otherwise use NGramTokenizer(Reader, int, int)
  - NGramTokenizer
```
public NGramTokenizer(AttributeFactory factory,
              Reader input,
              int minGram,
              int maxGram)
```
    Creates NGramTokenizer with given min and max n-grams.
    
    Parameters:
    factory - AttributeFactory to use
    input - Reader holding the input to be tokenized
    minGram - the smallest n-gram to generate
    maxGram - the largest n-gram to generate
  - NGramTokenizer
```
@Deprecated
public NGramTokenizer(Version version,
                         AttributeFactory factory,
                         Reader input,
                         int minGram,
                         int maxGram)
```
    Deprecated. For Version.LUCENE_4_3_0 and before, use Lucene43NGramTokenizer, otherwise use NGramTokenizer(AttributeFactory, Reader, int, int)
  - NGramTokenizer
```
public NGramTokenizer(Version version,
              Reader input)
```
    Creates NGramTokenizer with default min and max n-grams.
    
    Parameters:
    input - Reader holding the input to be tokenized
- Method Detail
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Specified by:
    
    incrementToken in class TokenStream
    
    Throws:
    
    IOException
  - isTokenChar
```
protected boolean isTokenChar(int chr)
```
    Only collect characters which satisfy this condition.
  - end
```
public final void end()
               throws IOException
```
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException
  - reset
```
public final void reset()
                 throws IOException
```
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException

Class NGramTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Fields inherited from class org.apache.lucene.util.AttributeSource

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MIN_NGRAM_SIZE

DEFAULT_MAX_NGRAM_SIZE

Constructor Detail

NGramTokenizer

NGramTokenizer

NGramTokenizer

NGramTokenizer

NGramTokenizer

Method Detail

incrementToken

isTokenChar

end

reset