public abstract class CharTokenizer extends Tokenizer
CharTokenizeruses an int based API to normalize and detect token codepoints. See
CharTokenizer API has been introduced with Lucene 3.1. This API
moved from UTF-16 code units to UTF-32 codepoints to eventually add support
for supplementary characters. The old char based API has been
deprecated and should be replaced with the int based methods
As of Lucene 3.1 each
CharTokenizer - constructor expects a
Version argument. Based on the given
Version either the new
API or a backwards compatibility layer is used at runtime. For
Version < 3.1 the backwards compatibility layer ensures correct
behavior even for indexes build with previous versions of Lucene. If a
Version >= 3.1 is used
CharTokenizer requires the new API to
be implemented by the instantiated class. Yet, the old char based API
is not required anymore even if backwards compatibility must be preserved.
CharTokenizer subclasses implementing the new API are fully backwards
compatible if instantiated with
Version < 3.1.
|Constructor and Description|
|Modifier and Type||Method and Description|
Returns true iff a codepoint should be included in a token.
Called on each token character to normalize it before it is added to the token.
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
matchVersion- Lucene version to match
input- the input to split up into tokens
protected abstract boolean isTokenChar(int c)
protected int normalize(int c)
public final boolean incrementToken() throws IOException
public final void end() throws IOException
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.