UAX29URLEmailTokenizerImpl (Lucene 3.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizerImpl

java.lang.Object
  org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl

All Implemented Interfaces:: StandardTokenizerInterface

public final class UAX29URLEmailTokenizerImpl
extends Object
implements StandardTokenizerInterface
extends Object
implements StandardTokenizerInterface

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<URL>: A URL
<EMAIL>: An email address
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character

Field Summary
`static int`	`EMAIL_TYPE`
`static int`	`HANGUL_TYPE`
`static int`	`HIRAGANA_TYPE`
`static int`	`IDEOGRAPHIC_TYPE`
`static int`	`KATAKANA_TYPE`
`static int`	`NUMERIC_TYPE` Numbers
`static int`	`SOUTH_EAST_ASIAN_TYPE` Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).
`static int`	`URL_TYPE`
`static int`	`WORD_TYPE` Alphanumeric sequences
`static int`	`YYEOF` This character denotes the end of file
`static int`	`YYINITIAL` lexical states

Constructor Summary
`UAX29URLEmailTokenizerImpl(InputStream in)` Creates a new scanner.
`UAX29URLEmailTokenizerImpl(Reader in)` Creates a new scanner There is also a java.io.InputStream version of this constructor.

Method Summary
`int`	`getNextToken()` Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
`void`	`getText(CharTermAttribute t)` Fills CharTermAttribute with the current token text.
`void`	`yybegin(int newState)` Enters a new lexical state
`int`	`yychar()` Returns the current position.
`char`	`yycharat(int pos)` Returns the character at position `pos` from the matched text.
`void`	`yyclose()` Closes the input stream.
`int`	`yylength()` Returns the length of the matched text region.
`void`	`yypushback(int number)` Pushes the specified amount of characters back into the input stream.
`void`	`yyreset(Reader reader)` Resets the scanner to read from a new input stream.
`int`	`yystate()` Returns the current lexical state.
`String`	`yytext()` Returns the text matched by the current regular expression.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

YYEOF

public static final int YYEOF

This character denotes the end of file

See Also:: Constant Field Values

YYINITIAL

public static final int YYINITIAL

lexical states

See Also:: Constant Field Values

WORD_TYPE

public static final int WORD_TYPE

Alphanumeric sequences

See Also:: Constant Field Values

NUMERIC_TYPE

public static final int NUMERIC_TYPE

Numbers

See Also:: Constant Field Values

SOUTH_EAST_ASIAN_TYPE

public static final int SOUTH_EAST_ASIAN_TYPE

Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

See Also:: Constant Field Values

IDEOGRAPHIC_TYPE

public static final int IDEOGRAPHIC_TYPE

See Also:: Constant Field Values

HIRAGANA_TYPE

public static final int HIRAGANA_TYPE

See Also:: Constant Field Values

KATAKANA_TYPE

public static final int KATAKANA_TYPE

See Also:: Constant Field Values

HANGUL_TYPE

public static final int HANGUL_TYPE

See Also:: Constant Field Values

EMAIL_TYPE

public static final int EMAIL_TYPE

See Also:: Constant Field Values

URL_TYPE

public static final int URL_TYPE

See Also:: Constant Field Values

Constructor Detail

UAX29URLEmailTokenizerImpl

public UAX29URLEmailTokenizerImpl(Reader in)

Creates a new scanner There is also a java.io.InputStream version of this constructor.

Parameters:: in - the java.io.Reader to read input from.

UAX29URLEmailTokenizerImpl

public UAX29URLEmailTokenizerImpl(InputStream in)

Creates a new scanner. There is also java.io.Reader version of this constructor.

Parameters:: in - the java.io.Inputstream to read input from.

Method Detail

yychar

public final int yychar()

Description copied from interface: StandardTokenizerInterface

Returns the current position.

Specified by:: yychar in interface StandardTokenizerInterface

getText

public final void getText(CharTermAttribute t)

Fills CharTermAttribute with the current token text.

Specified by:: getText in interface StandardTokenizerInterface

yyclose

public final void yyclose()
                   throws IOException

Closes the input stream.

Throws:: IOException

yyreset

public final void yyreset(Reader reader)

Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.

Specified by:: yyreset in interface StandardTokenizerInterface

Parameters:: reader - the new input stream

yystate

public final int yystate()

Returns the current lexical state.

yybegin

public final void yybegin(int newState)

Enters a new lexical state

Parameters:: newState - the new lexical state

yytext

public final String yytext()

Returns the text matched by the current regular expression.

yycharat

public final char yycharat(int pos)

Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster

Parameters:: pos - the position of the character to fetch. A value from 0 to yylength()-1.
Returns:: the character at position pos

yylength

public final int yylength()

Returns the length of the matched text region.

Specified by:: yylength in interface StandardTokenizerInterface

yypushback

public void yypushback(int number)

Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method

Parameters:: number - the number of characters to be read again. This number must not be greater than yylength()!

getNextToken

public int getNextToken()
                 throws IOException

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

Specified by:: getNextToken in interface StandardTokenizerInterface

Returns:: the next token
Throws:: IOException - if any I/O-Error occurs

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.standard Class UAX29URLEmailTokenizerImpl

YYEOF

YYINITIAL

WORD_TYPE

NUMERIC_TYPE

SOUTH_EAST_ASIAN_TYPE

IDEOGRAPHIC_TYPE

HIRAGANA_TYPE

KATAKANA_TYPE

HANGUL_TYPE

EMAIL_TYPE

URL_TYPE

UAX29URLEmailTokenizerImpl

UAX29URLEmailTokenizerImpl

yychar

getText

yyclose

yyreset

yystate

yybegin

yytext

yycharat

yylength

yypushback

getNextToken

org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizerImpl