java.lang.Object
- org.apache.lucene.analysis.standard.StandardTokenizerImpl

```
public final class StandardTokenizerImpl
extends Object
```
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters

Field Summary

Fields
Modifier and Type	Field	Description
`static int`	`EMOJI_TYPE`	Emoji token type
`static int`	`HANGUL_TYPE`	Hangul token type
`static int`	`HIRAGANA_TYPE`	Hiragana token type
`static int`	`IDEOGRAPHIC_TYPE`	Ideographic token type
`static int`	`KATAKANA_TYPE`	Katakana token type
`static int`	`NUMERIC_TYPE`	Numbers
`static int`	`SOUTH_EAST_ASIAN_TYPE`	Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
`static int`	`WORD_TYPE`	Alphanumeric sequences
`static int`	`YYEOF`	This character denotes the end of file.
`static int`	`YYINITIAL`	Lexical States.

Constructor Summary

Constructors
Constructor Description

StandardTokenizerImpl(Reader in)
Creates a new scanner

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`int`	`getNextToken()`	Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
`void`	`getText(CharTermAttribute t)`	Fills CharTermAttribute with the current token text.
`void`	`setBufferSize(int numChars)`	Sets the scanner buffer size in chars
`boolean`	`yyatEOF()`	Returns whether the scanner has reached the end of the reader it reads from.
`void`	`yybegin(int newState)`	Enters a new lexical state.
`int`	`yychar()`	Character count processed so far
`char`	`yycharat(int position)`	Returns the character at the given position from the matched text.
`void`	`yyclose()`	Closes the input reader.
`int`	`yylength()`	How many characters were matched.
`void`	`yypushback(int number)`	Pushes the specified amount of characters back into the input stream.
`void`	`yyreset(Reader reader)`	Resets the scanner to read from a new input stream.
`int`	`yystate()`	Returns the current lexical state.
`String`	`yytext()`	Returns the text matched by the current regular expression.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - YYEOF
```
public static final int YYEOF
```
    This character denotes the end of file.
    
    See Also:
    
    Constant Field Values
  - YYINITIAL
```
public static final int YYINITIAL
```
    Lexical States.
    
    See Also:
    
    Constant Field Values
  - WORD_TYPE
```
public static final int WORD_TYPE
```
    Alphanumeric sequences
    
    See Also:
    
    Constant Field Values
  - NUMERIC_TYPE
```
public static final int NUMERIC_TYPE
```
    Numbers
    
    See Also:
    
    Constant Field Values
  - SOUTH_EAST_ASIAN_TYPE
```
public static final int SOUTH_EAST_ASIAN_TYPE
```
    Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.
    See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
    
    See Also:
    
    Constant Field Values
  - IDEOGRAPHIC_TYPE
```
public static final int IDEOGRAPHIC_TYPE
```
    Ideographic token type
    
    See Also:
    
    Constant Field Values
  - HIRAGANA_TYPE
```
public static final int HIRAGANA_TYPE
```
    Hiragana token type
    
    See Also:
    
    Constant Field Values
  - KATAKANA_TYPE
```
public static final int KATAKANA_TYPE
```
    Katakana token type
    
    See Also:
    
    Constant Field Values
  - HANGUL_TYPE
```
public static final int HANGUL_TYPE
```
    Hangul token type
    
    See Also:
    
    Constant Field Values
  - EMOJI_TYPE
```
public static final int EMOJI_TYPE
```
    Emoji token type
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - StandardTokenizerImpl
```
public StandardTokenizerImpl(Reader in)
```
    Creates a new scanner
    
    Parameters:
    
    in - the java.io.Reader to read input from.
- Method Detail
  - yychar
```
public final int yychar()
```
    Character count processed so far
  - getText
```
public final void getText(CharTermAttribute t)
```
    Fills CharTermAttribute with the current token text.
  - setBufferSize
```
public final void setBufferSize(int numChars)
```
    Sets the scanner buffer size in chars
  - yyclose
```
public final void yyclose()
                   throws IOException
```
    Closes the input reader.
    
    Throws:
    
    IOException - if the reader could not be closed.
  - yyreset
```
public final void yyreset(Reader reader)
```
    Resets the scanner to read from a new input stream.
    Does not close the old reader.
    All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.
    Internal scan buffer is resized down to its initial length, if it has grown.
    
    Parameters:
    
    reader - The new input stream.
  - yyatEOF
```
public final boolean yyatEOF()
```
    Returns whether the scanner has reached the end of the reader it reads from.
    
    Returns:
    
    whether the scanner has reached EOF.
  - yystate
```
public final int yystate()
```
    Returns the current lexical state.
    
    Returns:
    
    the current lexical state.
  - yybegin
```
public final void yybegin(int newState)
```
    Enters a new lexical state.
    
    Parameters:
    
    newState - the new lexical state
  - yytext
```
public final String yytext()
```
    Returns the text matched by the current regular expression.
    
    Returns:
    
    the matched text.
  - yycharat
```
public final char yycharat(int position)
```
    Returns the character at the given position from the matched text.
    It is equivalent to yytext().charAt(pos), but faster.
    
    Parameters:
    
    position - the position of the character to fetch. A value from 0 to yylength()-1.
    
    Returns:
    
    the character at position.
  - yylength
```
public final int yylength()
```
    How many characters were matched.
    
    Returns:
    
    the length of the matched text region.
  - yypushback
```
public void yypushback(int number)
```
    Pushes the specified amount of characters back into the input stream.
    They will be read again by then next call of the scanning method.
    
    Parameters:
    
    number - the number of characters to be read again. This number must not be greater than yylength().
  - getNextToken
```
public int getNextToken()
                 throws IOException
```
    Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
    
    Returns:
    
    the next token.
    
    Throws:
    
    IOException - if any I/O-Error occurs.

Class StandardTokenizerImpl

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

YYEOF

YYINITIAL

WORD_TYPE

NUMERIC_TYPE

SOUTH_EAST_ASIAN_TYPE

IDEOGRAPHIC_TYPE

HIRAGANA_TYPE

KATAKANA_TYPE

HANGUL_TYPE

EMOJI_TYPE

Constructor Detail

StandardTokenizerImpl

Method Detail

yychar

getText

setBufferSize

yyclose

yyreset

yyatEOF

yystate

yybegin

yytext

yycharat

yylength

yypushback

getNextToken