Package org.apache.lucene.analysis.email
Class UAX29URLEmailTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.email.UAX29URLEmailTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified
in Unicode Standard Annex #29 URLs and email
addresses are also tokenized according to the relevant RFCs.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Alpha/numeric token typestatic final int
Email token typestatic final int
Emoji token type.static final int
Hangul token typestatic final int
Hiragana token typestatic final int
Ideographic token typestatic final int
Katakana token typestatic final int
Absolute maximum sized tokenstatic final int
Numeric token typestatic final int
Southeast Asian token typestatic final String[]
String token types that correspond to token type int constantsstatic final int
URL token typeFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionCreates a new instance of the UAX29URLEmailTokenizer.UAX29URLEmailTokenizer
(AttributeFactory factory) Creates a new UAX29URLEmailTokenizer with a givenAttributeFactory
-
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
final void
end()
int
final boolean
void
reset()
void
setMaxTokenLength
(int length) Set the max allowed token length.Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
ALPHANUM
public static final int ALPHANUMAlpha/numeric token type- See Also:
-
NUM
public static final int NUMNumeric token type- See Also:
-
SOUTHEAST_ASIAN
public static final int SOUTHEAST_ASIANSoutheast Asian token type- See Also:
-
IDEOGRAPHIC
public static final int IDEOGRAPHICIdeographic token type- See Also:
-
HIRAGANA
public static final int HIRAGANAHiragana token type- See Also:
-
KATAKANA
public static final int KATAKANAKatakana token type- See Also:
-
HANGUL
public static final int HANGULHangul token type- See Also:
-
URL
public static final int URLURL token type- See Also:
-
EMAIL
public static final int EMAILEmail token type- See Also:
-
EMOJI
public static final int EMOJIEmoji token type.- See Also:
-
TOKEN_TYPES
String token types that correspond to token type int constants -
MAX_TOKEN_LENGTH_LIMIT
public static final int MAX_TOKEN_LENGTH_LIMITAbsolute maximum sized token- See Also:
-
-
Constructor Details
-
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer()Creates a new instance of the UAX29URLEmailTokenizer. Attaches theinput
to the newly created JFlex scanner. -
UAX29URLEmailTokenizer
Creates a new UAX29URLEmailTokenizer with a givenAttributeFactory
-
-
Method Details
-
setMaxTokenLength
public void setMaxTokenLength(int length) Set the max allowed token length. Tokens larger than this will be chopped up at this token length and emitted as multiple tokens. If you need to skip such large tokens, you could increase this max length, and then useLengthFilter
to remove long tokens. The default isUAX29URLEmailAnalyzer.DEFAULT_MAX_TOKEN_LENGTH
.- Throws:
IllegalArgumentException
- if the given length is outside of the range [1, 1048576].
-
getMaxTokenLength
public int getMaxTokenLength()- See Also:
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-