Class ClassicTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.classic.ClassicTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1,
StandardTokenizer
implements Unicode text segmentation, as specified by UAX#29.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final String[]
String token types that correspond to token type int constantsFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionCreates a new instance of theClassicTokenizer
.ClassicTokenizer
(AttributeFactory factory) Creates a new ClassicTokenizer with a givenAttributeFactory
-
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
final void
end()
int
final boolean
void
reset()
void
setMaxTokenLength
(int length) Set the max allowed token length.Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
ALPHANUM
public static final int ALPHANUM- See Also:
-
APOSTROPHE
public static final int APOSTROPHE- See Also:
-
ACRONYM
public static final int ACRONYM- See Also:
-
COMPANY
public static final int COMPANY- See Also:
-
EMAIL
public static final int EMAIL- See Also:
-
HOST
public static final int HOST- See Also:
-
NUM
public static final int NUM- See Also:
-
CJ
public static final int CJ- See Also:
-
ACRONYM_DEP
public static final int ACRONYM_DEP- See Also:
-
TOKEN_TYPES
String token types that correspond to token type int constants
-
-
Constructor Details
-
ClassicTokenizer
public ClassicTokenizer()Creates a new instance of theClassicTokenizer
. Attaches theinput
to the newly created JFlex scanner.See http://issues.apache.org/jira/browse/LUCENE-1068
-
ClassicTokenizer
Creates a new ClassicTokenizer with a givenAttributeFactory
-
-
Method Details
-
setMaxTokenLength
public void setMaxTokenLength(int length) Set the max allowed token length. Any token longer than this is skipped. -
getMaxTokenLength
public int getMaxTokenLength()- See Also:
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-