Class ICUTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
Breaks text into words according to UAX #29: Unicode Text Segmentation
(http://www.unicode.org/reports/tr29/)
Words are broken across script boundaries, then segmented according to the BreakIterator and
typing provided by the ICUTokenizerConfig
- See Also:
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionConstruct a new ICUTokenizer that breaks text into words from the given Reader.ICUTokenizer
(ICUTokenizerConfig config) Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.ICUTokenizer
(AttributeFactory factory, ICUTokenizerConfig config) Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration. -
Method Summary
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
ICUTokenizer
public ICUTokenizer()Construct a new ICUTokenizer that breaks text into words from the given Reader.The default script-specific handling is used.
The default attribute factory is used.
- See Also:
-
ICUTokenizer
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.The default attribute factory is used.
- Parameters:
config
- Tailored BreakIterator configuration
-
ICUTokenizer
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.- Parameters:
factory
- AttributeFactory to useconfig
- Tailored BreakIterator configuration
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-