|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
org.apache.lucene.analysis.standard
package contains three
fast grammar-based tokenizers constructed with JFlex:
See:
Description
Interface Summary | |
---|---|
StandardTokenizerInterface |
Class Summary | |
---|---|
ClassicAnalyzer | Filters ClassicTokenizer with ClassicFilter , LowerCaseFilter and StopFilter , using a list of
English stop words. |
ClassicFilter | Normalizes tokens extracted with ClassicTokenizer . |
ClassicTokenizer | A grammar-based tokenizer constructed with JFlex |
StandardAnalyzer | Filters StandardTokenizer with StandardFilter , LowerCaseFilter and StopFilter , using a list of
English stop words. |
StandardFilter | Normalizes tokens extracted with StandardTokenizer . |
StandardTokenizer | A grammar-based tokenizer constructed with JFlex. |
StandardTokenizerImpl | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character |
UAX29URLEmailTokenizer | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. |
UAX29URLEmailTokenizerImpl | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. |
The org.apache.lucene.analysis.standard
package contains three
fast grammar-based tokenizers constructed with JFlex:
StandardTokenizer
:
as of Lucene 3.1, implements the Word Break rules from the Unicode Text
Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer
, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer
includes
StandardTokenizer
,
StandardFilter
,
LowerCaseFilter
and StopFilter
.
When the Version
specified in the constructor is lower than
3.1, the ClassicTokenizer
implementation is invoked.ClassicTokenizer
:
this class was formerly (prior to Lucene 3.1) named
StandardTokenizer
. (Its tokenization rules are not
based on the Unicode Text Segmentation algorithm.)
ClassicAnalyzer
includes
ClassicTokenizer
,
StandardFilter
,
LowerCaseFilter
and StopFilter
.
UAX29URLEmailTokenizer
:
implements the Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
URLs and email addresses are also tokenized according to the relevant RFCs.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |