org.apache.lucene.analysis.ngram
Class NGramTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.ngram.NGramTokenizer
- All Implemented Interfaces:
- Closeable
- Direct Known Subclasses:
- EdgeNGramTokenizer
public class NGramTokenizer
- extends Tokenizer
Tokenizes the input into n-grams of the given size(s).
On the contrary to NGramTokenFilter
, this class sets offsets so
that characters between startOffset and endOffset in the original stream are
the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):
Term | ab | abc | bc | bcd | cd | cde | de |
Position increment | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Position length | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Offsets | [0,2[ | [0,3[ | [1,3[ | [1,4[ | [2,4[ | [2,5[ | [3,5[ |
This tokenizer changed a lot in Lucene 4.4 in order to:
- tokenize in a streaming fashion to support streams which are larger
than 1024 chars (limit of the previous version),
- count grams based on unicode code points instead of java chars (and
never split in the middle of surrogate pairs),
- give the ability to
pre-tokenize
the stream
before computing n-grams.
Additionally, this class doesn't trim trailing whitespaces and emits
tokens in a different order, tokens are now emitted by increasing start
offsets while they used to be emitted by increasing lengths (which prevented
from supporting large input streams).
Although highly discouraged, it is still possible
to use the old behavior through Lucene43NGramTokenizer
.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
DEFAULT_MIN_NGRAM_SIZE
public static final int DEFAULT_MIN_NGRAM_SIZE
- See Also:
- Constant Field Values
DEFAULT_MAX_NGRAM_SIZE
public static final int DEFAULT_MAX_NGRAM_SIZE
- See Also:
- Constant Field Values
NGramTokenizer
public NGramTokenizer(Version version,
Reader input,
int minGram,
int maxGram)
- Creates NGramTokenizer with given min and max n-grams.
- Parameters:
version
- the lucene compatibility versioninput
- Reader
holding the input to be tokenizedminGram
- the smallest n-gram to generatemaxGram
- the largest n-gram to generate
NGramTokenizer
public NGramTokenizer(Version version,
AttributeSource.AttributeFactory factory,
Reader input,
int minGram,
int maxGram)
- Creates NGramTokenizer with given min and max n-grams.
- Parameters:
version
- the lucene compatibility versionfactory
- AttributeSource.AttributeFactory
to useinput
- Reader
holding the input to be tokenizedminGram
- the smallest n-gram to generatemaxGram
- the largest n-gram to generate
NGramTokenizer
public NGramTokenizer(Version version,
Reader input)
- Creates NGramTokenizer with default min and max n-grams.
- Parameters:
version
- the lucene compatibility versioninput
- Reader
holding the input to be tokenized
incrementToken
public final boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
isTokenChar
protected boolean isTokenChar(int chr)
- Only collect characters which satisfy this condition.
end
public final void end()
throws IOException
- Overrides:
end
in class TokenStream
- Throws:
IOException
reset
public final void reset()
throws IOException
- Overrides:
reset
in class TokenStream
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.