Class StandardTokenizerImpl


  • public final class StandardTokenizerImpl
    extends Object
    This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

    Tokens produced are of the following types:

    • <ALPHANUM>: A sequence of alphabetic and numeric characters
    • <NUM>: A number
    • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
    • <IDEOGRAPHIC>: A single CJKV ideographic character
    • <HIRAGANA>: A single hiragana character
    • <KATAKANA>: A sequence of katakana characters
    • <HANGUL>: A sequence of Hangul characters
    • <EMOJI>: A sequence of Emoji characters
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int EMOJI_TYPE
      Emoji token type
      static int HANGUL_TYPE
      Hangul token type
      static int HIRAGANA_TYPE
      Hiragana token type
      static int IDEOGRAPHIC_TYPE
      Ideographic token type
      static int KATAKANA_TYPE
      Katakana token type
      static int NUMERIC_TYPE
      Numbers
      static int SOUTH_EAST_ASIAN_TYPE
      Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
      static int WORD_TYPE
      Alphanumeric sequences
      static int YYEOF
      This character denotes the end of file.
      static int YYINITIAL
      Lexical States.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int getNextToken()
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      void getText​(CharTermAttribute t)
      Fills CharTermAttribute with the current token text.
      void setBufferSize​(int numChars)
      Sets the scanner buffer size in chars
      boolean yyatEOF()
      Returns whether the scanner has reached the end of the reader it reads from.
      void yybegin​(int newState)
      Enters a new lexical state.
      int yychar()
      Character count processed so far
      char yycharat​(int position)
      Returns the character at the given position from the matched text.
      void yyclose()
      Closes the input reader.
      int yylength()
      How many characters were matched.
      void yypushback​(int number)
      Pushes the specified amount of characters back into the input stream.
      void yyreset​(Reader reader)
      Resets the scanner to read from a new input stream.
      int yystate()
      Returns the current lexical state.
      String yytext()
      Returns the text matched by the current regular expression.
    • Field Detail

      • YYEOF

        public static final int YYEOF
        This character denotes the end of file.
        See Also:
        Constant Field Values
      • SOUTH_EAST_ASIAN_TYPE

        public static final int SOUTH_EAST_ASIAN_TYPE
        Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

        See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

        See Also:
        Constant Field Values
      • IDEOGRAPHIC_TYPE

        public static final int IDEOGRAPHIC_TYPE
        Ideographic token type
        See Also:
        Constant Field Values
      • HIRAGANA_TYPE

        public static final int HIRAGANA_TYPE
        Hiragana token type
        See Also:
        Constant Field Values
      • KATAKANA_TYPE

        public static final int KATAKANA_TYPE
        Katakana token type
        See Also:
        Constant Field Values
    • Constructor Detail

      • StandardTokenizerImpl

        public StandardTokenizerImpl​(Reader in)
        Creates a new scanner
        Parameters:
        in - the java.io.Reader to read input from.
    • Method Detail

      • yychar

        public final int yychar()
        Character count processed so far
      • getText

        public final void getText​(CharTermAttribute t)
        Fills CharTermAttribute with the current token text.
      • setBufferSize

        public final void setBufferSize​(int numChars)
        Sets the scanner buffer size in chars
      • yyclose

        public final void yyclose()
                           throws IOException
        Closes the input reader.
        Throws:
        IOException - if the reader could not be closed.
      • yyreset

        public final void yyreset​(Reader reader)
        Resets the scanner to read from a new input stream.

        Does not close the old reader.

        All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.

        Internal scan buffer is resized down to its initial length, if it has grown.

        Parameters:
        reader - The new input stream.
      • yyatEOF

        public final boolean yyatEOF()
        Returns whether the scanner has reached the end of the reader it reads from.
        Returns:
        whether the scanner has reached EOF.
      • yystate

        public final int yystate()
        Returns the current lexical state.
        Returns:
        the current lexical state.
      • yybegin

        public final void yybegin​(int newState)
        Enters a new lexical state.
        Parameters:
        newState - the new lexical state
      • yytext

        public final String yytext()
        Returns the text matched by the current regular expression.
        Returns:
        the matched text.
      • yycharat

        public final char yycharat​(int position)
        Returns the character at the given position from the matched text.

        It is equivalent to yytext().charAt(pos), but faster.

        Parameters:
        position - the position of the character to fetch. A value from 0 to yylength()-1.
        Returns:
        the character at position.
      • yylength

        public final int yylength()
        How many characters were matched.
        Returns:
        the length of the matched text region.
      • yypushback

        public void yypushback​(int number)
        Pushes the specified amount of characters back into the input stream.

        They will be read again by then next call of the scanning method.

        Parameters:
        number - the number of characters to be read again. This number must not be greater than yylength().
      • getNextToken

        public int getNextToken()
                         throws IOException
        Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
        Returns:
        the next token.
        Throws:
        IOException - if any I/O-Error occurs.