org.apache.lucene.analysis.standard (Lucene 3.6.2 API)

Interface Summary
Interface Description

StandardTokenizerInterface
Internal interface for supporting versioned grammars.

Interface Summary
Interface	Description
StandardTokenizerInterface	Internal interface for supporting versioned grammars.

Class Summary
Class	Description
ClassicAnalyzer	Filters `ClassicTokenizer` with `ClassicFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
ClassicFilter	Normalizes tokens extracted with `ClassicTokenizer`.
ClassicTokenizer	A grammar-based tokenizer constructed with JFlex This should be a good tokenizer for most European-language documents: Splits words at punctuation characters, removing punctuation.
StandardAnalyzer	Filters `StandardTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
StandardFilter	Normalizes tokens extracted with `StandardTokenizer`.
StandardTokenizer	A grammar-based tokenizer constructed with JFlex.
StandardTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character
UAX29URLEmailAnalyzer	Filters `UAX29URLEmailTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
UAX29URLEmailTokenizer	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Package org.apache.lucene.analysis.standard Description

Standards-based analyzers implemented with JFlex.

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer includes StandardTokenizer, StandardFilter, LowerCaseFilter and StopFilter. When the Version specified in the constructor is lower than 3.1, the ClassicTokenizer implementation is invoked.
ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.
UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer includes UAX29URLEmailTokenizer, StandardFilter, LowerCaseFilter and StopFilter.