org.apache.lucene.analysis.icu.segmentation.ICUTokenizer

All Implemented Interfaces:: Closeable, AutoCloseable

public final class ICUTokenizer extends Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:

ICUTokenizerConfig

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

ICUTokenizer()

Construct a new ICUTokenizer that breaks text into words from the given Reader.

ICUTokenizer(ICUTokenizerConfig config)

Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

ICUTokenizer(AttributeFactory factory, ICUTokenizerConfig config)

Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
Method Summary

Modifier and Type

Method

Description

void

end()

boolean

incrementToken()

void

reset()

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- ICUTokenizer
  
  public ICUTokenizer()
  
  Construct a new ICUTokenizer that breaks text into words from the given Reader.
  The default script-specific handling is used.
  The default attribute factory is used.
  See Also:
  
  DefaultICUTokenizerConfig
- ICUTokenizer
  
  public ICUTokenizer(ICUTokenizerConfig config)
  
  Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
  The default attribute factory is used.
  
  Parameters:
  
  config - Tailored BreakIterator configuration
- ICUTokenizer
  
  public ICUTokenizer(AttributeFactory factory, ICUTokenizerConfig config)
  
  Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
  
  Parameters:
  
  factory - AttributeFactory to use
  
  config - Tailored BreakIterator configuration
Method Details
- incrementToken
  
  public boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException
- end
  
  public void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException

Class ICUTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Constructor Details

ICUTokenizer

ICUTokenizer

ICUTokenizer

Method Details

incrementToken

reset

end