Package org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizers.
ClassicTokenizer: this class was formerly (prior to Lucene 3.1) namedStandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.)ClassicAnalyzerincludesClassicTokenizer,LowerCaseFilterandStopFilter.UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29, except URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzerincludesUAX29URLEmailTokenizer,LowerCaseFilterandStopFilter.
This Java package additionally contains StandardAnalyzer and StandardTokenizer,
which are not visible here, because they moved to Lucene Core.
The factories for those components (e.g., used in Solr) are still part of this module.
-
Class Summary Class Description ClassicAnalyzer FiltersClassicTokenizerwithClassicFilter,LowerCaseFilterandStopFilter, using a list of English stop words.ClassicFilter Normalizes tokens extracted withClassicTokenizer.ClassicFilterFactory Factory forClassicFilter.ClassicTokenizer A grammar-based tokenizer constructed with JFlexClassicTokenizerFactory Factory forClassicTokenizer.StandardTokenizerFactory Factory forStandardTokenizer.UAX29URLEmailAnalyzer FiltersUAX29URLEmailTokenizerwithLowerCaseFilterandStopFilter, using a list of English stop words.UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.UAX29URLEmailTokenizerFactory Factory forUAX29URLEmailTokenizer.UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.