public class LetterTokenizer extends CharTokenizer
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
AttributeSource.State
DEFAULT_MAX_WORD_LEN
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
LetterTokenizer()
Construct a new LetterTokenizer.
|
LetterTokenizer(AttributeFactory factory)
Construct a new LetterTokenizer using a given
AttributeFactory . |
LetterTokenizer(AttributeFactory factory,
int maxTokenLen)
Construct a new LetterTokenizer using a given
AttributeFactory . |
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(int c)
Collects only characters which satisfy
Character.isLetter(int) . |
end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, reset
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public LetterTokenizer()
public LetterTokenizer(AttributeFactory factory)
AttributeFactory
.factory
- the attribute factory to use for this Tokenizer
public LetterTokenizer(AttributeFactory factory, int maxTokenLen)
AttributeFactory
.factory
- the attribute factory to use for this Tokenizer
maxTokenLen
- maximum token length the tokenizer will emit.
Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)IllegalArgumentException
- if maxTokenLen is invalid.protected boolean isTokenChar(int c)
Character.isLetter(int)
.isTokenChar
in class CharTokenizer
Copyright © 2000-2024 Apache Software Foundation. All Rights Reserved.