Package org.apache.lucene.analysis.core
Class UnicodeWhitespaceTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.CharTokenizer
org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences
of non-Whitespace characters form tokens (according to Unicode's WHITESPACE property).
For Unicode version see: UnicodeProps
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.util.CharTokenizer
DEFAULT_MAX_WORD_LEN
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionConstruct a new UnicodeWhitespaceTokenizer.Construct a new UnicodeWhitespaceTokenizer using a givenAttributeFactory
.UnicodeWhitespaceTokenizer
(AttributeFactory factory, int maxTokenLen) Construct a new UnicodeWhitespaceTokenizer using a givenAttributeFactory
. -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
isTokenChar
(int c) Collects only characters which do not satisfy Unicode's WHITESPACE property.Methods inherited from class org.apache.lucene.analysis.util.CharTokenizer
end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, reset
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
UnicodeWhitespaceTokenizer
public UnicodeWhitespaceTokenizer()Construct a new UnicodeWhitespaceTokenizer. -
UnicodeWhitespaceTokenizer
Construct a new UnicodeWhitespaceTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
-
UnicodeWhitespaceTokenizer
Construct a new UnicodeWhitespaceTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
maxTokenLen
- maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)- Throws:
IllegalArgumentException
- if maxTokenLen is invalid.
-
-
Method Details
-
isTokenChar
protected boolean isTokenChar(int c) Collects only characters which do not satisfy Unicode's WHITESPACE property.- Specified by:
isTokenChar
in classCharTokenizer
-