java.lang.Object

org.apache.lucene.analysis.standard.StandardTokenizerImpl

public final class StandardTokenizerImpl extends Object

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters
<EMOJI>: A sequence of Emoji characters

Field Summary

Fields

Modifier and Type

Field

Description

static final int

EMOJI_TYPE

Emoji token type

static final int

HANGUL_TYPE

Hangul token type

static final int

HIRAGANA_TYPE

Hiragana token type

static final int

IDEOGRAPHIC_TYPE

Ideographic token type

static final int

KATAKANA_TYPE

Katakana token type

static final int

NUMERIC_TYPE

Numbers

static final int

SOUTH_EAST_ASIAN_TYPE

Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).

static final int

WORD_TYPE

Alphanumeric sequences

static final int

YYEOF

This character denotes the end of file.

static final int

YYINITIAL

Lexical States.
Constructor Summary

Constructors

Constructor

Description

StandardTokenizerImpl(Reader in)

Creates a new scanner
Method Summary

Modifier and Type

Method

Description

int

getNextToken()

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

final void

getText(CharTermAttribute t)

Fills CharTermAttribute with the current token text.

final void

setBufferSize(int numChars)

Sets the scanner buffer size in chars

final boolean

yyatEOF()

Returns whether the scanner has reached the end of the reader it reads from.

final void

yybegin(int newState)

Enters a new lexical state.

final int

yychar()

Character count processed so far

final char

yycharat(int position)

Returns the character at the given position from the matched text.

final void

yyclose()

Closes the input reader.

final int

yylength()

How many characters were matched.

void

yypushback(int number)

Pushes the specified amount of characters back into the input stream.

final void

yyreset(Reader reader)

Resets the scanner to read from a new input stream.

final int

yystate()

Returns the current lexical state.

final String

yytext()

Returns the text matched by the current regular expression.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- YYEOF
  
  public static final int YYEOF
  
  This character denotes the end of file.
  See Also:
  
  Constant Field Values
- YYINITIAL
  
  public static final int YYINITIAL
  
  Lexical States.
  See Also:
  
  Constant Field Values
- WORD_TYPE
  
  public static final int WORD_TYPE
  
  Alphanumeric sequences
  See Also:
  
  Constant Field Values
- NUMERIC_TYPE
  
  public static final int NUMERIC_TYPE
  
  Numbers
  See Also:
  
  Constant Field Values
- SOUTH_EAST_ASIAN_TYPE
  
  public static final int SOUTH_EAST_ASIAN_TYPE
  
  Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.
  See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
  See Also:
  
  Constant Field Values
- IDEOGRAPHIC_TYPE
  
  public static final int IDEOGRAPHIC_TYPE
  
  Ideographic token type
  See Also:
  
  Constant Field Values
- HIRAGANA_TYPE
  
  public static final int HIRAGANA_TYPE
  
  Hiragana token type
  See Also:
  
  Constant Field Values
- KATAKANA_TYPE
  
  public static final int KATAKANA_TYPE
  
  Katakana token type
  See Also:
  
  Constant Field Values
- HANGUL_TYPE
  
  public static final int HANGUL_TYPE
  
  Hangul token type
  See Also:
  
  Constant Field Values
- EMOJI_TYPE
  
  public static final int EMOJI_TYPE
  
  Emoji token type
  See Also:
  
  Constant Field Values
Constructor Details
- StandardTokenizerImpl
  
  public StandardTokenizerImpl(Reader in)
  
  Creates a new scanner
  
  Parameters:
  
  in - the java.io.Reader to read input from.
Method Details
- yychar
  
  public final int yychar()
  
  Character count processed so far
- getText
  
  public final void getText(CharTermAttribute t)
  
  Fills CharTermAttribute with the current token text.
- setBufferSize
  
  public final void setBufferSize(int numChars)
  
  Sets the scanner buffer size in chars
- yyclose
  
  public final void yyclose() throws IOException
  
  Closes the input reader.
  
  Throws:
  
  IOException - if the reader could not be closed.
- yyreset
  
  public final void yyreset(Reader reader)
  
  Resets the scanner to read from a new input stream.
  Does not close the old reader.
  All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.
  Internal scan buffer is resized down to its initial length, if it has grown.
  
  Parameters:
  
  reader - The new input stream.
- yyatEOF
  
  public final boolean yyatEOF()
  
  Returns whether the scanner has reached the end of the reader it reads from.
  
  Returns:
  
  whether the scanner has reached EOF.
- yystate
  
  public final int yystate()
  
  Returns the current lexical state.
  
  Returns:
  
  the current lexical state.
- yybegin
  
  public final void yybegin(int newState)
  
  Enters a new lexical state.
  
  Parameters:
  
  newState - the new lexical state
- yytext
  
  public final String yytext()
  
  Returns the text matched by the current regular expression.
  
  Returns:
  
  the matched text.
- yycharat
  
  public final char yycharat(int position)
  
  Returns the character at the given position from the matched text.
  It is equivalent to yytext().charAt(pos), but faster.
  
  Parameters:
  
  position - the position of the character to fetch. A value from 0 to yylength()-1.
  
  Returns:
  
  the character at position.
- yylength
  
  public final int yylength()
  
  How many characters were matched.
  
  Returns:
  
  the length of the matched text region.
- yypushback
  
  public void yypushback(int number)
  
  Pushes the specified amount of characters back into the input stream.
  They will be read again by then next call of the scanning method.
  
  Parameters:
  
  number - the number of characters to be read again. This number must not be greater than yylength().
- getNextToken
  
  public int getNextToken() throws IOException
  
  Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
  
  Returns:
  
  the next token.
  
  Throws:
  
  IOException - if any I/O-Error occurs.

Class StandardTokenizerImpl

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

YYEOF

YYINITIAL

WORD_TYPE

NUMERIC_TYPE

SOUTH_EAST_ASIAN_TYPE

IDEOGRAPHIC_TYPE

HIRAGANA_TYPE

KATAKANA_TYPE

HANGUL_TYPE

EMOJI_TYPE

Constructor Details

StandardTokenizerImpl

Method Details

yychar

getText

setBufferSize

yyclose

yyreset

yyatEOF

yystate

yybegin

yytext

yycharat

yylength

yypushback

getNextToken