Class ICUTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

public final class ICUTokenizer extends Tokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Constructor Details

    • ICUTokenizer

      public ICUTokenizer()
      Construct a new ICUTokenizer that breaks text into words from the given Reader.

      The default script-specific handling is used.

      The default attribute factory is used.

      See Also:
    • ICUTokenizer

      public ICUTokenizer(ICUTokenizerConfig config)
      Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

      The default attribute factory is used.

      Parameters:
      config - Tailored BreakIterator configuration
    • ICUTokenizer

      public ICUTokenizer(AttributeFactory factory, ICUTokenizerConfig config)
      Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
      Parameters:
      factory - AttributeFactory to use
      config - Tailored BreakIterator configuration
  • Method Details