Class LetterTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

public class LetterTokenizer extends CharTokenizer
A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.

Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

  • Constructor Details

    • LetterTokenizer

      public LetterTokenizer()
      Construct a new LetterTokenizer.
    • LetterTokenizer

      public LetterTokenizer(AttributeFactory factory)
      Construct a new LetterTokenizer using a given AttributeFactory.
      Parameters:
      factory - the attribute factory to use for this Tokenizer
    • LetterTokenizer

      public LetterTokenizer(AttributeFactory factory, int maxTokenLen)
      Construct a new LetterTokenizer using a given AttributeFactory.
      Parameters:
      factory - the attribute factory to use for this Tokenizer
      maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
      Throws:
      IllegalArgumentException - if maxTokenLen is invalid.
  • Method Details