org.apache.lucene.analysis
Class LowerCaseTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.CharTokenizer
                  extended by org.apache.lucene.analysis.LetterTokenizer
                      extended by org.apache.lucene.analysis.LowerCaseTokenizer
All Implemented Interfaces:
Closeable

public final class LowerCaseTokenizer
extends LetterTokenizer

LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation.

Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

You must specify the required Version compatibility when creating LowerCaseTokenizer:


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
LowerCaseTokenizer(AttributeSource.AttributeFactory factory, Reader in)
          Deprecated. use LowerCaseTokenizer(AttributeSource.AttributeFactory, Reader) instead. This will be removed in Lucene 4.0.
LowerCaseTokenizer(AttributeSource source, Reader in)
          Deprecated. use LowerCaseTokenizer(AttributeSource, Reader) instead. This will be removed in Lucene 4.0.
LowerCaseTokenizer(Reader in)
          Deprecated. use LowerCaseTokenizer(Reader) instead. This will be removed in Lucene 4.0.
LowerCaseTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader in)
          Construct a new LowerCaseTokenizer using a given AttributeSource.AttributeFactory.
LowerCaseTokenizer(Version matchVersion, AttributeSource source, Reader in)
          Construct a new LowerCaseTokenizer using a given AttributeSource.
LowerCaseTokenizer(Version matchVersion, Reader in)
          Construct a new LowerCaseTokenizer.
 
Method Summary
protected  int normalize(int c)
          Converts char to lower case Character.toLowerCase(int).
 
Methods inherited from class org.apache.lucene.analysis.LetterTokenizer
isTokenChar
 
Methods inherited from class org.apache.lucene.analysis.CharTokenizer
end, incrementToken, isTokenChar, normalize, reset
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LowerCaseTokenizer

public LowerCaseTokenizer(Version matchVersion,
                          Reader in)
Construct a new LowerCaseTokenizer.

Parameters:
matchVersion - Lucene version to match See above
in - the input to split up into tokens

LowerCaseTokenizer

public LowerCaseTokenizer(Version matchVersion,
                          AttributeSource source,
                          Reader in)
Construct a new LowerCaseTokenizer using a given AttributeSource.

Parameters:
matchVersion - Lucene version to match See above
source - the attribute source to use for this Tokenizer
in - the input to split up into tokens

LowerCaseTokenizer

public LowerCaseTokenizer(Version matchVersion,
                          AttributeSource.AttributeFactory factory,
                          Reader in)
Construct a new LowerCaseTokenizer using a given AttributeSource.AttributeFactory.

Parameters:
matchVersion - Lucene version to match See above
factory - the attribute factory to use for this Tokenizer
in - the input to split up into tokens

LowerCaseTokenizer

@Deprecated
public LowerCaseTokenizer(Reader in)
Deprecated. use LowerCaseTokenizer(Reader) instead. This will be removed in Lucene 4.0.

Construct a new LowerCaseTokenizer.


LowerCaseTokenizer

@Deprecated
public LowerCaseTokenizer(AttributeSource source,
                                     Reader in)
Deprecated. use LowerCaseTokenizer(AttributeSource, Reader) instead. This will be removed in Lucene 4.0.

Construct a new LowerCaseTokenizer using a given AttributeSource.


LowerCaseTokenizer

@Deprecated
public LowerCaseTokenizer(AttributeSource.AttributeFactory factory,
                                     Reader in)
Deprecated. use LowerCaseTokenizer(AttributeSource.AttributeFactory, Reader) instead. This will be removed in Lucene 4.0.

Construct a new LowerCaseTokenizer using a given AttributeSource.AttributeFactory.

Method Detail

normalize

protected int normalize(int c)
Converts char to lower case Character.toLowerCase(int).

Overrides:
normalize in class CharTokenizer


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.