Class StandardTokenizerImpl
java.lang.Object
org.apache.lucene.analysis.standard.StandardTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Emoji token typestatic final int
Hangul token typestatic final int
Hiragana token typestatic final int
Ideographic token typestatic final int
Katakana token typestatic final int
Numbersstatic final int
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).static final int
Alphanumeric sequencesstatic final int
This character denotes the end of file.static final int
Lexical States. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionint
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.final void
Fills CharTermAttribute with the current token text.final void
setBufferSize
(int numChars) Sets the scanner buffer size in charsfinal boolean
yyatEOF()
Returns whether the scanner has reached the end of the reader it reads from.final void
yybegin
(int newState) Enters a new lexical state.final int
yychar()
Character count processed so farfinal char
yycharat
(int position) Returns the character at the given position from the matched text.final void
yyclose()
Closes the input reader.final int
yylength()
How many characters were matched.void
yypushback
(int number) Pushes the specified amount of characters back into the input stream.final void
Resets the scanner to read from a new input stream.final int
yystate()
Returns the current lexical state.final String
yytext()
Returns the text matched by the current regular expression.
-
Field Details
-
YYEOF
public static final int YYEOFThis character denotes the end of file.- See Also:
-
YYINITIAL
public static final int YYINITIALLexical States.- See Also:
-
WORD_TYPE
public static final int WORD_TYPEAlphanumeric sequences- See Also:
-
NUMERIC_TYPE
public static final int NUMERIC_TYPENumbers- See Also:
-
SOUTH_EAST_ASIAN_TYPE
public static final int SOUTH_EAST_ASIAN_TYPEChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
- See Also:
-
IDEOGRAPHIC_TYPE
public static final int IDEOGRAPHIC_TYPEIdeographic token type- See Also:
-
HIRAGANA_TYPE
public static final int HIRAGANA_TYPEHiragana token type- See Also:
-
KATAKANA_TYPE
public static final int KATAKANA_TYPEKatakana token type- See Also:
-
HANGUL_TYPE
public static final int HANGUL_TYPEHangul token type- See Also:
-
EMOJI_TYPE
public static final int EMOJI_TYPEEmoji token type- See Also:
-
-
Constructor Details
-
StandardTokenizerImpl
Creates a new scanner- Parameters:
in
- the java.io.Reader to read input from.
-
-
Method Details
-
yychar
public final int yychar()Character count processed so far -
getText
Fills CharTermAttribute with the current token text. -
setBufferSize
public final void setBufferSize(int numChars) Sets the scanner buffer size in chars -
yyclose
Closes the input reader.- Throws:
IOException
- if the reader could not be closed.
-
yyreset
Resets the scanner to read from a new input stream.Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to
ZZ_INITIAL
.Internal scan buffer is resized down to its initial length, if it has grown.
- Parameters:
reader
- The new input stream.
-
yyatEOF
public final boolean yyatEOF()Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
-
yystate
public final int yystate()Returns the current lexical state.- Returns:
- the current lexical state.
-
yybegin
public final void yybegin(int newState) Enters a new lexical state.- Parameters:
newState
- the new lexical state
-
yytext
Returns the text matched by the current regular expression.- Returns:
- the matched text.
-
yycharat
public final char yycharat(int position) Returns the character at the given position from the matched text.It is equivalent to
yytext().charAt(pos)
, but faster.- Parameters:
position
- the position of the character to fetch. A value from 0 toyylength()-1
.- Returns:
- the character at
position
.
-
yylength
public final int yylength()How many characters were matched.- Returns:
- the length of the matched text region.
-
yypushback
public void yypushback(int number) Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method.
- Parameters:
number
- the number of characters to be read again. This number must not be greater thanyylength()
.
-
getNextToken
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
IOException
- if any I/O-Error occurs.
-