org.apache.lucene.analysis.ar
Class ArabicLetterTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.CharTokenizer
org.apache.lucene.analysis.LetterTokenizer
org.apache.lucene.analysis.ar.ArabicLetterTokenizer
- All Implemented Interfaces:
- Closeable
public class ArabicLetterTokenizer
- extends LetterTokenizer
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics.
Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
protected boolean |
isTokenChar(char c)
Allows for Letter category or NonspacingMark category |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
ArabicLetterTokenizer
public ArabicLetterTokenizer(Reader in)
ArabicLetterTokenizer
public ArabicLetterTokenizer(AttributeSource source,
Reader in)
ArabicLetterTokenizer
public ArabicLetterTokenizer(AttributeSource.AttributeFactory factory,
Reader in)
isTokenChar
protected boolean isTokenChar(char c)
- Allows for Letter category or NonspacingMark category
- Overrides:
isTokenChar
in class LetterTokenizer
- See Also:
LetterTokenizer.isTokenChar(char)
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.