Class UnicodeWhitespaceTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

public final class UnicodeWhitespaceTokenizer extends CharTokenizer
A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens (according to Unicode's WHITESPACE property).

For Unicode version see: UnicodeProps

  • Constructor Details

    • UnicodeWhitespaceTokenizer

      public UnicodeWhitespaceTokenizer()
      Construct a new UnicodeWhitespaceTokenizer.
    • UnicodeWhitespaceTokenizer

      public UnicodeWhitespaceTokenizer(AttributeFactory factory)
      Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
      Parameters:
      factory - the attribute factory to use for this Tokenizer
    • UnicodeWhitespaceTokenizer

      public UnicodeWhitespaceTokenizer(AttributeFactory factory, int maxTokenLen)
      Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
      Parameters:
      factory - the attribute factory to use for this Tokenizer
      maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
      Throws:
      IllegalArgumentException - if maxTokenLen is invalid.
  • Method Details

    • isTokenChar

      protected boolean isTokenChar(int c)
      Collects only characters which do not satisfy Unicode's WHITESPACE property.
      Specified by:
      isTokenChar in class CharTokenizer