org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizerImpl

java.lang.Object
  extended by org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl
All Implemented Interfaces:
StandardTokenizerInterface

public final class UAX29URLEmailTokenizerImpl
extends Object
implements StandardTokenizerInterface

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:


Field Summary
static int EMAIL_TYPE
           
static int HANGUL_TYPE
           
static int HIRAGANA_TYPE
           
static int IDEOGRAPHIC_TYPE
           
static int KATAKANA_TYPE
           
static int NUMERIC_TYPE
          Numbers
static int SOUTH_EAST_ASIAN_TYPE
          Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
static int URL_TYPE
           
static int WORD_TYPE
          Alphanumeric sequences
static int YYEOF
          This character denotes the end of file
static int YYINITIAL
          lexical states
 
Constructor Summary
UAX29URLEmailTokenizerImpl(InputStream in)
          Creates a new scanner.
UAX29URLEmailTokenizerImpl(Reader in)
          Creates a new scanner There is also a java.io.InputStream version of this constructor.
 
Method Summary
 int getNextToken()
          Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
 void getText(CharTermAttribute t)
          Fills CharTermAttribute with the current token text.
 void yybegin(int newState)
          Enters a new lexical state
 int yychar()
          Returns the current position.
 char yycharat(int pos)
          Returns the character at position pos from the matched text.
 void yyclose()
          Closes the input stream.
 int yylength()
          Returns the length of the matched text region.
 void yypushback(int number)
          Pushes the specified amount of characters back into the input stream.
 void yyreset(Reader reader)
          Resets the scanner to read from a new input stream.
 int yystate()
          Returns the current lexical state.
 String yytext()
          Returns the text matched by the current regular expression.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

YYEOF

public static final int YYEOF
This character denotes the end of file

See Also:
Constant Field Values

YYINITIAL

public static final int YYINITIAL
lexical states

See Also:
Constant Field Values

WORD_TYPE

public static final int WORD_TYPE
Alphanumeric sequences

See Also:
Constant Field Values

NUMERIC_TYPE

public static final int NUMERIC_TYPE
Numbers

See Also:
Constant Field Values

SOUTH_EAST_ASIAN_TYPE

public static final int SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

See Also:
Constant Field Values

IDEOGRAPHIC_TYPE

public static final int IDEOGRAPHIC_TYPE
See Also:
Constant Field Values

HIRAGANA_TYPE

public static final int HIRAGANA_TYPE
See Also:
Constant Field Values

KATAKANA_TYPE

public static final int KATAKANA_TYPE
See Also:
Constant Field Values

HANGUL_TYPE

public static final int HANGUL_TYPE
See Also:
Constant Field Values

EMAIL_TYPE

public static final int EMAIL_TYPE
See Also:
Constant Field Values

URL_TYPE

public static final int URL_TYPE
See Also:
Constant Field Values
Constructor Detail

UAX29URLEmailTokenizerImpl

public UAX29URLEmailTokenizerImpl(Reader in)
Creates a new scanner There is also a java.io.InputStream version of this constructor.

Parameters:
in - the java.io.Reader to read input from.

UAX29URLEmailTokenizerImpl

public UAX29URLEmailTokenizerImpl(InputStream in)
Creates a new scanner. There is also java.io.Reader version of this constructor.

Parameters:
in - the java.io.Inputstream to read input from.
Method Detail

yychar

public final int yychar()
Description copied from interface: StandardTokenizerInterface
Returns the current position.

Specified by:
yychar in interface StandardTokenizerInterface

getText

public final void getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text.

Specified by:
getText in interface StandardTokenizerInterface

yyclose

public final void yyclose()
                   throws IOException
Closes the input stream.

Throws:
IOException

yyreset

public final void yyreset(Reader reader)
Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.

Specified by:
yyreset in interface StandardTokenizerInterface
Parameters:
reader - the new input stream

yystate

public final int yystate()
Returns the current lexical state.


yybegin

public final void yybegin(int newState)
Enters a new lexical state

Parameters:
newState - the new lexical state

yytext

public final String yytext()
Returns the text matched by the current regular expression.


yycharat

public final char yycharat(int pos)
Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster

Parameters:
pos - the position of the character to fetch. A value from 0 to yylength()-1.
Returns:
the character at position pos

yylength

public final int yylength()
Returns the length of the matched text region.

Specified by:
yylength in interface StandardTokenizerInterface

yypushback

public void yypushback(int number)
Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method

Parameters:
number - the number of characters to be read again. This number must not be greater than yylength()!

getNextToken

public int getNextToken()
                 throws IOException
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

Specified by:
getNextToken in interface StandardTokenizerInterface
Returns:
the next token
Throws:
IOException - if any I/O-Error occurs


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.