Class IndicTokenizer

  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.CharTokenizer
                  extended by
All Implemented Interfaces:

public final class IndicTokenizer
extends CharTokenizer

Simple Tokenizer for text in Indian Languages.

Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
Field Summary
Fields inherited from class org.apache.lucene.analysis.Tokenizer
Constructor Summary
IndicTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)
IndicTokenizer(Version matchVersion, AttributeSource source, Reader input)
IndicTokenizer(Version matchVersion, Reader input)
Method Summary
protected  boolean isTokenChar(int c)
          Returns true iff a codepoint should be included in a token.
Methods inherited from class org.apache.lucene.analysis.CharTokenizer
end, incrementToken, isTokenChar, normalize, normalize, reset
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
Methods inherited from class org.apache.lucene.analysis.TokenStream
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Detail


public IndicTokenizer(Version matchVersion,
                      AttributeSource.AttributeFactory factory,
                      Reader input)


public IndicTokenizer(Version matchVersion,
                      AttributeSource source,
                      Reader input)


public IndicTokenizer(Version matchVersion,
                      Reader input)
Method Detail


protected boolean isTokenChar(int c)
Description copied from class: CharTokenizer
Returns true iff a codepoint should be included in a token. This tokenizer generates as tokens adjacent sequences of codepoints which satisfy this predicate. Codepoints for which this is false are used to define token boundaries and are not included in tokens.

As of Lucene 3.1 the char based API (CharTokenizer.isTokenChar(char) and CharTokenizer.normalize(char)) has been depreciated in favor of a Unicode 4.0 compatible int based API to support codepoints instead of UTF-16 code units. Subclasses of CharTokenizer must not override the char based methods if a Version >= 3.1 is passed to the constructor.

NOTE: This method will be marked abstract in Lucene 4.0.

isTokenChar in class CharTokenizer

Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.