public class KoreanNumberFilter extends TokenFilter
TokenFilter
that normalizes Korean numbers to regular Arabic
decimal numbers in half-width characters.
Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds punctuation. For example, 3.2천 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2천 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation
tokens being found in the token stream. Please make sure your KoreanTokenizer
has discardPunctuation
set to false. In case punctuation characters, such as .
(U+FF0E FULLWIDTH FULL STOP), is removed from the token stream, this filter would find
input tokens tokens 3 and 2천 and give outputs 3 and 2000 instead of 3200, which is
likely not the intended result. If you want to remove punctuation characters from your
index that are not part of normalized numbers, add a
StopFilter
with the punctuation you wish to
remove after KoreanNumberFilter
in your analyzer chain.
Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
Tokens preceded by a token with PositionIncrementAttribute
of zero are left
left untouched and emitted as-is.
This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context.
For example, is 전중경일 is a name and means Tanaka Kyōichi, but 경일 (Kyōichi) out of
context can strictly speaking also represent the number 10000000000000001. This filter
respects the KeywordAttribute
, which can be used to prevent specific
normalizations from happening.
Modifier and Type | Class and Description |
---|---|
static class |
KoreanNumberFilter.NumberBuffer
Buffer that holds a Korean number string and a position index used as a parsed-to marker
|
AttributeSource.State
input
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
KoreanNumberFilter(TokenStream input) |
Modifier and Type | Method and Description |
---|---|
boolean |
incrementToken() |
boolean |
isArabicNumeral(char c)
Arabic numeral predicate.
|
boolean |
isNumeral(char c)
Numeral predicate
|
boolean |
isNumeral(String input)
Numeral predicate
|
boolean |
isNumeralPunctuation(char c)
Numeral punctuation predicate
|
boolean |
isNumeralPunctuation(String input)
Numeral punctuation predicate
|
String |
normalizeNumber(String number)
Normalizes a Korean number
|
BigDecimal |
parseLargeHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
Parse large Hangul numerals (ten thousands or larger)
|
BigDecimal |
parseMediumHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
Parse medium Hangul numerals (tens, hundreds or thousands)
|
void |
reset() |
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public KoreanNumberFilter(TokenStream input)
public final boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
public void reset() throws IOException
reset
in class TokenFilter
IOException
public String normalizeNumber(String number)
number
- number or normalizepublic BigDecimal parseLargeHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic BigDecimal parseMediumHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic boolean isNumeral(String input)
input
- string to testpublic boolean isNumeral(char c)
c
- character to testpublic boolean isNumeralPunctuation(String input)
input
- string to testpublic boolean isNumeralPunctuation(char c)
c
- character to testpublic boolean isArabicNumeral(char c)
c
- character to testCopyright © 2000-2019 Apache Software Foundation. All Rights Reserved.