NGramTokenizer (Lucene 4.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.ngram
Class NGramTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.ngram.NGramTokenizer

All Implemented Interfaces:: Closeable

Direct Known Subclasses:: EdgeNGramTokenizer

public class NGramTokenizer
extends Tokenizer
extends Tokenizer

Tokenizes the input into n-grams of the given size(s).

On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.

For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:

tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),

count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),

give the ability to pre-tokenize the stream before computing n-grams.

Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).

Although highly discouraged, it is still possible to use the old behavior through Lucene43NGramTokenizer.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static int`	`DEFAULT_MAX_NGRAM_SIZE`
`static int`	`DEFAULT_MIN_NGRAM_SIZE`

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`NGramTokenizer(Version version, AttributeSource.AttributeFactory factory, Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.
`NGramTokenizer(Version version, Reader input)` Creates NGramTokenizer with default min and max n-grams.
`NGramTokenizer(Version version, Reader input, int minGram, int maxGram)` Creates NGramTokenizer with given min and max n-grams.

Method Summary
`void`	`end()`
`boolean`	`incrementToken()`
`protected boolean`	`isTokenChar(int chr)` Only collect characters which satisfy this condition.
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset, setReader`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait`

Field Detail

DEFAULT_MIN_NGRAM_SIZE

public static final int DEFAULT_MIN_NGRAM_SIZE

See Also:: Constant Field Values

DEFAULT_MAX_NGRAM_SIZE

public static final int DEFAULT_MAX_NGRAM_SIZE

See Also:: Constant Field Values

Constructor Detail

NGramTokenizer

public NGramTokenizer(Version version,
                      Reader input,
                      int minGram,
                      int maxGram)

Creates NGramTokenizer with given min and max n-grams.

Parameters:: version - the lucene compatibility version; input - Reader holding the input to be tokenized; minGram - the smallest n-gram to generate; maxGram - the largest n-gram to generate

NGramTokenizer

public NGramTokenizer(Version version,
                      AttributeSource.AttributeFactory factory,
                      Reader input,
                      int minGram,
                      int maxGram)

Creates NGramTokenizer with given min and max n-grams.

Parameters:: version - the lucene compatibility version; factory - AttributeSource.AttributeFactory to use; input - Reader holding the input to be tokenized; minGram - the smallest n-gram to generate; maxGram - the largest n-gram to generate

NGramTokenizer

public NGramTokenizer(Version version,
                      Reader input)

Creates NGramTokenizer with default min and max n-grams.

Parameters:: version - the lucene compatibility version; input - Reader holding the input to be tokenized

Method Detail

incrementToken

public final boolean incrementToken()
                             throws IOException

Specified by:: incrementToken in class TokenStream

Throws:: IOException

isTokenChar

protected boolean isTokenChar(int chr)

Only collect characters which satisfy this condition.

end

public final void end()
               throws IOException

Overrides:: end in class TokenStream

Throws:: IOException

reset

public final void reset()
                 throws IOException

Overrides:: reset in class TokenStream

Throws:: IOException

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.ngram Class NGramTokenizer

DEFAULT_MIN_NGRAM_SIZE

DEFAULT_MAX_NGRAM_SIZE

NGramTokenizer

NGramTokenizer

NGramTokenizer

incrementToken

isTokenChar

end

reset

org.apache.lucene.analysis.ngram
Class NGramTokenizer