UAX29URLEmailTokenizer (Lucene 3.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer

All Implemented Interfaces:: Closeable

public final class UAX29URLEmailTokenizer
extends Tokenizer
extends Tokenizer

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<URL>: A URL
<EMAIL>: An email address
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character

You must specify the required Version compatibility when creating UAX29URLEmailTokenizer:

As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static int`	`ALPHANUM`
`static int`	`EMAIL`
`static String`	`EMAIL_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`HANGUL`
`static String`	`HANGUL_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`HIRAGANA`
`static String`	`HIRAGANA_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`IDEOGRAPHIC`
`static String`	`IDEOGRAPHIC_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`KATAKANA`
`static String`	`KATAKANA_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`NUM`
`static String`	`NUMERIC_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static String`	`SOUTH_EAST_ASIAN_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static int`	`SOUTHEAST_ASIAN`
`static String[]`	`TOKEN_TYPES` String token types that correspond to token type int constants
`static int`	`URL`
`static String`	`URL_TYPE` Deprecated. use `TOKEN_TYPES` instead
`static String`	`WORD_TYPE` Deprecated. use `TOKEN_TYPES` instead

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`UAX29URLEmailTokenizer(AttributeSource.AttributeFactory factory, Reader input)` Deprecated. use `UAX29URLEmailTokenizer(Version, AttributeSource.AttributeFactory, Reader)` instead.
`UAX29URLEmailTokenizer(AttributeSource source, Reader input)` Deprecated. use `UAX29URLEmailTokenizer(Version, AttributeSource, Reader)` instead.
`UAX29URLEmailTokenizer(InputStream input)` Deprecated. use `UAX29URLEmailTokenizer(Version, Reader)` instead.
`UAX29URLEmailTokenizer(Reader input)` Deprecated. use `UAX29URLEmailTokenizer(Version, Reader)` instead.
`UAX29URLEmailTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)` Creates a new UAX29URLEmailTokenizer with a given `AttributeSource.AttributeFactory`
`UAX29URLEmailTokenizer(Version matchVersion, AttributeSource source, Reader input)` Creates a new UAX29URLEmailTokenizer with a given `AttributeSource`.
`UAX29URLEmailTokenizer(Version matchVersion, Reader input)` Creates a new instance of the UAX29URLEmailTokenizer.

Method Summary
`void`	`end()` This method is called by the consumer after the last token has been consumed, after `TokenStream.incrementToken()` returned `false` (using the new `TokenStream` API).
`int`	`getMaxTokenLength()`
`boolean`	`incrementToken()` Consumers (i.e., `IndexWriter`) use this method to advance the stream to the next token.
`void`	`reset(Reader reader)` Expert: Reset the tokenizer to a new reader.
`void`	`setMaxTokenLength(int length)` Set the max allowed token length.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset`

Methods inherited from class org.apache.lucene.analysis.TokenStream
`reset`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

ALPHANUM

public static final int ALPHANUM

See Also:: Constant Field Values

NUM

public static final int NUM

See Also:: Constant Field Values

SOUTHEAST_ASIAN

public static final int SOUTHEAST_ASIAN

See Also:: Constant Field Values

IDEOGRAPHIC

public static final int IDEOGRAPHIC

See Also:: Constant Field Values

HIRAGANA

public static final int HIRAGANA

See Also:: Constant Field Values

KATAKANA

public static final int KATAKANA

See Also:: Constant Field Values

HANGUL

public static final int HANGUL

See Also:: Constant Field Values

URL

public static final int URL

See Also:: Constant Field Values

EMAIL

public static final int EMAIL

See Also:: Constant Field Values

TOKEN_TYPES

public static final String[] TOKEN_TYPES

String token types that correspond to token type int constants