Class LetterTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable
    Direct Known Subclasses:
    LowerCaseTokenizer

    public class LetterTokenizer
    extends CharTokenizer
    A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.

    Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

    • Constructor Detail

      • LetterTokenizer

        public LetterTokenizer()
        Construct a new LetterTokenizer.
      • LetterTokenizer

        public LetterTokenizer​(AttributeFactory factory)
        Construct a new LetterTokenizer using a given AttributeFactory.
        Parameters:
        factory - the attribute factory to use for this Tokenizer
      • LetterTokenizer

        public LetterTokenizer​(AttributeFactory factory,
                               int maxTokenLen)
        Construct a new LetterTokenizer using a given AttributeFactory.
        Parameters:
        factory - the attribute factory to use for this Tokenizer
        maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
        Throws:
        IllegalArgumentException - if maxTokenLen is invalid.