Fast, general-purpose grammar-based tokenizer
StandardTokenizerimplements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike
UAX29URLEmailTokenizerfrom the analysis module, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
Class Summary Class Description StandardAnalyzer StandardFilter Deprecated.StandardFilter is a no-op and can be removed from code StandardTokenizerA grammar-based tokenizer constructed with JFlex. StandardTokenizerImplThis class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.