Class ICUTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class ICUTokenizer
    extends Tokenizer
    Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

    Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

    See Also:
    ICUTokenizerConfig
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Constructor Detail

      • ICUTokenizer

        public ICUTokenizer()
        Construct a new ICUTokenizer that breaks text into words from the given Reader.

        The default script-specific handling is used.

        The default attribute factory is used.

        See Also:
        DefaultICUTokenizerConfig
      • ICUTokenizer

        public ICUTokenizer​(ICUTokenizerConfig config)
        Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

        The default attribute factory is used.

        Parameters:
        config - Tailored BreakIterator configuration
      • ICUTokenizer

        public ICUTokenizer​(AttributeFactory factory,
                            ICUTokenizerConfig config)
        Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
        Parameters:
        factory - AttributeFactory to use
        config - Tailored BreakIterator configuration