Class SimplePatternTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class SimplePatternTokenizer
    extends Tokenizer
    This tokenizer uses a Lucene RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens. The regexp syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. The provided regex should match valid token characters (not token separator characters, like String.split). The matching is greedy: the longest match at a given start point will be the next token. Empty string tokens are never produced.
    WARNING: This API is experimental and might change in incompatible ways in the next release.