org.apache.lucene.analysis.standard
Class ClassicTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.standard.ClassicTokenizer
- All Implemented Interfaces:
- Closeable
public final class ClassicTokenizer
- extends Tokenizer
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a
dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1.
As of 3.1, StandardTokenizer
implements Unicode text segmentation,
as specified by UAX#29.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
ALPHANUM
public static final int ALPHANUM
- See Also:
- Constant Field Values
APOSTROPHE
public static final int APOSTROPHE
- See Also:
- Constant Field Values
ACRONYM
public static final int ACRONYM
- See Also:
- Constant Field Values
COMPANY
public static final int COMPANY
- See Also:
- Constant Field Values
EMAIL
public static final int EMAIL
- See Also:
- Constant Field Values
HOST
public static final int HOST
- See Also:
- Constant Field Values
NUM
public static final int NUM
- See Also:
- Constant Field Values
CJ
public static final int CJ
- See Also:
- Constant Field Values
ACRONYM_DEP
public static final int ACRONYM_DEP
- See Also:
- Constant Field Values
TOKEN_TYPES
public static final String[] TOKEN_TYPES
- String token types that correspond to token type int constants
ClassicTokenizer
public ClassicTokenizer(Version matchVersion,
Reader input)
- Creates a new instance of the
ClassicTokenizer
. Attaches
the input
to the newly created JFlex scanner.
- Parameters:
input
- The input reader
See http://issues.apache.org/jira/browse/LUCENE-1068
ClassicTokenizer
public ClassicTokenizer(Version matchVersion,
AttributeSource source,
Reader input)
- Creates a new ClassicTokenizer with a given
AttributeSource
.
ClassicTokenizer
public ClassicTokenizer(Version matchVersion,
AttributeSource.AttributeFactory factory,
Reader input)
- Creates a new ClassicTokenizer with a given
AttributeSource.AttributeFactory
setMaxTokenLength
public void setMaxTokenLength(int length)
- Set the max allowed token length. Any token longer
than this is skipped.
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
setMaxTokenLength(int)
incrementToken
public final boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
end
public final void end()
- Overrides:
end
in class TokenStream
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenStream
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.