Package org.apache.lucene.analysis.core
Class LetterTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.CharTokenizer
org.apache.lucene.analysis.core.LetterTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines
tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter()
predicate.
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.util.CharTokenizer
DEFAULT_MAX_WORD_LEN
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionConstruct a new LetterTokenizer.LetterTokenizer
(AttributeFactory factory) Construct a new LetterTokenizer using a givenAttributeFactory
.LetterTokenizer
(AttributeFactory factory, int maxTokenLen) Construct a new LetterTokenizer using a givenAttributeFactory
. -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
isTokenChar
(int c) Collects only characters which satisfyCharacter.isLetter(int)
.Methods inherited from class org.apache.lucene.analysis.util.CharTokenizer
end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, reset
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
LetterTokenizer
public LetterTokenizer()Construct a new LetterTokenizer. -
LetterTokenizer
Construct a new LetterTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
-
LetterTokenizer
Construct a new LetterTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
maxTokenLen
- maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)- Throws:
IllegalArgumentException
- if maxTokenLen is invalid.
-
-
Method Details
-
isTokenChar
protected boolean isTokenChar(int c) Collects only characters which satisfyCharacter.isLetter(int)
.- Specified by:
isTokenChar
in classCharTokenizer
-