org.apache.lucene.analysis.ru
Class RussianLetterTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.CharTokenizer
                  extended by org.apache.lucene.analysis.ru.RussianLetterTokenizer

public class RussianLetterTokenizer
extends CharTokenizer

A RussianLetterTokenizer is a Tokenizer that extends LetterTokenizer by additionally looking up letters in a given "russian charset".

The problem with LetterTokenizer is that it uses Character.isLetter(char) method, which doesn't know how to detect letters in encodings like CP1252 and KOI8 (well-known problems with 0xD7 and 0xF7 chars)

Version:
$Id: RussianLetterTokenizer.java 806961 2009-08-23 12:39:28Z rmuir $

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
RussianLetterTokenizer(AttributeSource.AttributeFactory factory, Reader in)
           
RussianLetterTokenizer(AttributeSource source, Reader in)
           
RussianLetterTokenizer(Reader in)
           
RussianLetterTokenizer(Reader in, char[] charset)
          Deprecated. Use RussianLetterTokenizer(Reader) instead.
 
Method Summary
protected  boolean isTokenChar(char c)
          Collects only characters which satisfy Character.isLetter(char).
 
Methods inherited from class org.apache.lucene.analysis.CharTokenizer
end, incrementToken, next, next, normalize, reset
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
getOnlyUseNewAPI, reset, setOnlyUseNewAPI
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RussianLetterTokenizer

public RussianLetterTokenizer(Reader in,
                              char[] charset)
Deprecated. Use RussianLetterTokenizer(Reader) instead.


RussianLetterTokenizer

public RussianLetterTokenizer(Reader in)

RussianLetterTokenizer

public RussianLetterTokenizer(AttributeSource source,
                              Reader in)

RussianLetterTokenizer

public RussianLetterTokenizer(AttributeSource.AttributeFactory factory,
                              Reader in)
Method Detail

isTokenChar

protected boolean isTokenChar(char c)
Collects only characters which satisfy Character.isLetter(char).

Specified by:
isTokenChar in class CharTokenizer


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.