Class UnicodeWhitespaceTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class UnicodeWhitespaceTokenizer
    extends CharTokenizer
    A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens (according to Unicode's WHITESPACE property).

    For Unicode version see: UnicodeProps

    • Constructor Detail

      • UnicodeWhitespaceTokenizer

        public UnicodeWhitespaceTokenizer()
        Construct a new UnicodeWhitespaceTokenizer.
      • UnicodeWhitespaceTokenizer

        public UnicodeWhitespaceTokenizer​(AttributeFactory factory)
        Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
        Parameters:
        factory - the attribute factory to use for this Tokenizer
      • UnicodeWhitespaceTokenizer

        public UnicodeWhitespaceTokenizer​(AttributeFactory factory,
                                          int maxTokenLen)
        Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
        Parameters:
        factory - the attribute factory to use for this Tokenizer
        maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
        Throws:
        IllegalArgumentException - if maxTokenLen is invalid.
    • Method Detail

      • isTokenChar

        protected boolean isTokenChar​(int c)
        Collects only characters which do not satisfy Unicode's WHITESPACE property.
        Specified by:
        isTokenChar in class CharTokenizer