Class StandardTokenizerImpl

java.lang.Object
org.apache.lucene.analysis.standard.StandardTokenizerImpl

public final class StandardTokenizerImpl extends Object
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

  • <ALPHANUM>: A sequence of alphabetic and numeric characters
  • <NUM>: A number
  • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
  • <IDEOGRAPHIC>: A single CJKV ideographic character
  • <HIRAGANA>: A single hiragana character
  • <KATAKANA>: A sequence of katakana characters
  • <HANGUL>: A sequence of Hangul characters
  • <EMOJI>: A sequence of Emoji characters
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Emoji token type
    static final int
    Hangul token type
    static final int
    Hiragana token type
    static final int
    Ideographic token type
    static final int
    Katakana token type
    static final int
    Numbers
    static final int
    Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
    static final int
    Alphanumeric sequences
    static final int
    This character denotes the end of file.
    static final int
    Lexical States.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new scanner
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
    final void
    Fills CharTermAttribute with the current token text.
    final void
    setBufferSize(int numChars)
    Sets the scanner buffer size in chars
    final boolean
    Returns whether the scanner has reached the end of the reader it reads from.
    final void
    yybegin(int newState)
    Enters a new lexical state.
    final int
    Character count processed so far
    final char
    yycharat(int position)
    Returns the character at the given position from the matched text.
    final void
    Closes the input reader.
    final int
    How many characters were matched.
    void
    yypushback(int number)
    Pushes the specified amount of characters back into the input stream.
    final void
    yyreset(Reader reader)
    Resets the scanner to read from a new input stream.
    final int
    Returns the current lexical state.
    final String
    Returns the text matched by the current regular expression.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • YYEOF

      public static final int YYEOF
      This character denotes the end of file.
      See Also:
    • YYINITIAL

      public static final int YYINITIAL
      Lexical States.
      See Also:
    • WORD_TYPE

      public static final int WORD_TYPE
      Alphanumeric sequences
      See Also:
    • NUMERIC_TYPE

      public static final int NUMERIC_TYPE
      Numbers
      See Also:
    • SOUTH_EAST_ASIAN_TYPE

      public static final int SOUTH_EAST_ASIAN_TYPE
      Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

      See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

      See Also:
    • IDEOGRAPHIC_TYPE

      public static final int IDEOGRAPHIC_TYPE
      Ideographic token type
      See Also:
    • HIRAGANA_TYPE

      public static final int HIRAGANA_TYPE
      Hiragana token type
      See Also:
    • KATAKANA_TYPE

      public static final int KATAKANA_TYPE
      Katakana token type
      See Also:
    • HANGUL_TYPE

      public static final int HANGUL_TYPE
      Hangul token type
      See Also:
    • EMOJI_TYPE

      public static final int EMOJI_TYPE
      Emoji token type
      See Also:
  • Constructor Details

    • StandardTokenizerImpl

      public StandardTokenizerImpl(Reader in)
      Creates a new scanner
      Parameters:
      in - the java.io.Reader to read input from.
  • Method Details

    • yychar

      public final int yychar()
      Character count processed so far
    • getText

      public final void getText(CharTermAttribute t)
      Fills CharTermAttribute with the current token text.
    • setBufferSize

      public final void setBufferSize(int numChars)
      Sets the scanner buffer size in chars
    • yyclose

      public final void yyclose() throws IOException
      Closes the input reader.
      Throws:
      IOException - if the reader could not be closed.
    • yyreset

      public final void yyreset(Reader reader)
      Resets the scanner to read from a new input stream.

      Does not close the old reader.

      All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.

      Internal scan buffer is resized down to its initial length, if it has grown.

      Parameters:
      reader - The new input stream.
    • yyatEOF

      public final boolean yyatEOF()
      Returns whether the scanner has reached the end of the reader it reads from.
      Returns:
      whether the scanner has reached EOF.
    • yystate

      public final int yystate()
      Returns the current lexical state.
      Returns:
      the current lexical state.
    • yybegin

      public final void yybegin(int newState)
      Enters a new lexical state.
      Parameters:
      newState - the new lexical state
    • yytext

      public final String yytext()
      Returns the text matched by the current regular expression.
      Returns:
      the matched text.
    • yycharat

      public final char yycharat(int position)
      Returns the character at the given position from the matched text.

      It is equivalent to yytext().charAt(pos), but faster.

      Parameters:
      position - the position of the character to fetch. A value from 0 to yylength()-1.
      Returns:
      the character at position.
    • yylength

      public final int yylength()
      How many characters were matched.
      Returns:
      the length of the matched text region.
    • yypushback

      public void yypushback(int number)
      Pushes the specified amount of characters back into the input stream.

      They will be read again by then next call of the scanning method.

      Parameters:
      number - the number of characters to be read again. This number must not be greater than yylength().
    • getNextToken

      public int getNextToken() throws IOException
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      Returns:
      the next token.
      Throws:
      IOException - if any I/O-Error occurs.