Package org.apache.lucene.analysis.core
Class WhitespaceTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.CharTokenizer
org.apache.lucene.analysis.core.WhitespaceTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
A tokenizer that divides text at whitespace characters as defined by
Character.isWhitespace(int)
. Note: That definition explicitly excludes the non-breaking space.
Adjacent sequences of non-Whitespace characters form tokens.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.util.CharTokenizer
DEFAULT_MAX_WORD_LEN
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionConstruct a new WhitespaceTokenizer.WhitespaceTokenizer
(int maxTokenLen) Construct a new WhitespaceTokenizer using a given max token lengthWhitespaceTokenizer
(AttributeFactory factory) Construct a new WhitespaceTokenizer using a givenAttributeFactory
.WhitespaceTokenizer
(AttributeFactory factory, int maxTokenLen) Construct a new WhitespaceTokenizer using a givenAttributeFactory
. -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
isTokenChar
(int c) Collects only characters which do not satisfyCharacter.isWhitespace(int)
.Methods inherited from class org.apache.lucene.analysis.util.CharTokenizer
end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, reset
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
WhitespaceTokenizer
public WhitespaceTokenizer()Construct a new WhitespaceTokenizer. -
WhitespaceTokenizer
Construct a new WhitespaceTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
-
WhitespaceTokenizer
public WhitespaceTokenizer(int maxTokenLen) Construct a new WhitespaceTokenizer using a given max token length- Parameters:
maxTokenLen
- maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)- Throws:
IllegalArgumentException
- if maxTokenLen is invalid.
-
WhitespaceTokenizer
Construct a new WhitespaceTokenizer using a givenAttributeFactory
.- Parameters:
factory
- the attribute factory to use for thisTokenizer
maxTokenLen
- maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)- Throws:
IllegalArgumentException
- if maxTokenLen is invalid.
-
-
Method Details
-
isTokenChar
protected boolean isTokenChar(int c) Collects only characters which do not satisfyCharacter.isWhitespace(int)
.- Specified by:
isTokenChar
in classCharTokenizer
-