StandardTokenizer instead.@Deprecated public class ArabicLetterTokenizer extends LetterTokenizer
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required Version compatibility when creating
ArabicLetterTokenizer:
CharTokenizer uses an int based API to normalize and
detect token characters. See isTokenChar(int) and
CharTokenizer.normalize(int) for details.AttributeSource.StateDEFAULT_TOKEN_ATTRIBUTE_FACTORYDEFAULT_ATTRIBUTE_FACTORY| Constructor and Description |
|---|
ArabicLetterTokenizer(Version matchVersion,
AttributeFactory factory,
Reader in)
Deprecated.
Construct a new ArabicLetterTokenizer using a given
AttributeFactory. |
ArabicLetterTokenizer(Version matchVersion,
Reader in)
Deprecated.
Construct a new ArabicLetterTokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
protected boolean |
isTokenChar(int c)
Deprecated.
Allows for Letter category or NonspacingMark category
|
end, incrementToken, normalize, resetclose, correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toStringpublic ArabicLetterTokenizer(Version matchVersion, Reader in)
matchVersion - Lucene version
to match See abovein - the input to split up into tokenspublic ArabicLetterTokenizer(Version matchVersion, AttributeFactory factory, Reader in)
AttributeFactory. * @param
matchVersion Lucene version to match See
abovefactory - the attribute factory to use for this Tokenizerin - the input to split up into tokensprotected boolean isTokenChar(int c)
isTokenChar in class LetterTokenizerLetterTokenizer.isTokenChar(int)Copyright © 2000-2015 Apache Software Foundation. All Rights Reserved.