StandardTokenizer
instead.@Deprecated public class ArabicLetterTokenizer extends LetterTokenizer
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required Version
compatibility when creating
ArabicLetterTokenizer
:
CharTokenizer
uses an int based API to normalize and
detect token characters. See isTokenChar(int)
and
CharTokenizer.normalize(int)
for details.AttributeSource.State
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
DEFAULT_ATTRIBUTE_FACTORY
Constructor and Description |
---|
ArabicLetterTokenizer(Version matchVersion,
AttributeFactory factory,
Reader in)
Deprecated.
Construct a new ArabicLetterTokenizer using a given
AttributeFactory . |
ArabicLetterTokenizer(Version matchVersion,
Reader in)
Deprecated.
Construct a new ArabicLetterTokenizer.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(int c)
Deprecated.
Allows for Letter category or NonspacingMark category
|
end, incrementToken, normalize, reset
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
public ArabicLetterTokenizer(Version matchVersion, Reader in)
matchVersion
- Lucene version
to match See abovein
- the input to split up into tokenspublic ArabicLetterTokenizer(Version matchVersion, AttributeFactory factory, Reader in)
AttributeFactory
. * @param
matchVersion Lucene version to match See
abovefactory
- the attribute factory to use for this Tokenizerin
- the input to split up into tokensprotected boolean isTokenChar(int c)
isTokenChar
in class LetterTokenizer
LetterTokenizer.isTokenChar(int)
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.