org.apache.lucene.analysis.ko.KoreanNumberFilter

All Implemented Interfaces:: Closeable, AutoCloseable, Unwrappable<TokenStream>

public class KoreanNumberFilter extends TokenFilter

A TokenFilter that normalizes Korean numbers to regular Arabic decimal numbers in half-width characters.

Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds punctuation. For example, ３．２천 means 3200. This filter does this kind of normalization and allows a search for 3200 to match ３．２천 in text, but can also be used to make range facets based on the normalized numbers and so on.

Notice that this analyzer uses a token composition scheme and relies on punctuation tokens being found in the token stream. Please make sure your KoreanTokenizer has discardPunctuation set to false. In case punctuation characters, such as ． (U+FF0E FULLWIDTH FULL STOP), is removed from the token stream, this filter would find input tokens tokens ３ and ２천 and give outputs 3 and 2000 instead of 3200, which is likely not the intended result. If you want to remove punctuation characters from your index that are not part of normalized numbers, add a StopFilter with the punctuation you wish to remove after KoreanNumberFilter in your analyzer chain.

Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.

영영칠 becomes 7
일영영영 becomes 1000
삼천2백2십삼 becomes 3223
조육백만오천일 becomes 1000006005001
３.２천 becomes 3200
１.２만３４５.６７ becomes 12345.67
4,647.100 becomes 4647.1
15,7 becomes 157 (be aware of this weakness)

Tokens preceded by a token with PositionIncrementAttribute of zero are left left untouched and emitted as-is.

This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.

This filter may in some cases normalize tokens that are not numbers in their context. For example, is 전중경일 is a name and means Tanaka Kyōichi, but 경일 (Kyōichi) out of context can strictly speaking also represent the number 10000000000000001. This filter respects the KeywordAttribute, which can be used to prevent specific normalizations from happening.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

KoreanNumberFilter.NumberBuffer

Buffer that holds a Korean number string and a position index used as a parsed-to marker

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields inherited from class org.apache.lucene.analysis.TokenFilter
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

KoreanNumberFilter(TokenStream input)
Method Summary

Modifier and Type

Method

Description

final boolean

incrementToken()

boolean

isArabicNumeral(char c)

Arabic numeral predicate.

boolean

isNumeral(char c)

Numeral predicate

boolean

isNumeral(String input)

Numeral predicate

boolean

isNumeralPunctuation(char c)

Numeral punctuation predicate

boolean

isNumeralPunctuation(String input)

Numeral punctuation predicate

String

normalizeNumber(String number)

Normalizes a Korean number

BigDecimal

parseLargeHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)

Parse large Hangul numerals (ten thousands or larger)

BigDecimal

parseMediumHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)

Parse medium Hangul numerals (tens, hundreds or thousands)

void

reset()

Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- KoreanNumberFilter
  
  public KoreanNumberFilter(TokenStream input)
Method Details
- incrementToken
  
  public final boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class TokenFilter
  
  Throws:
  
  IOException
- normalizeNumber
  
  public String normalizeNumber(String number)
  
  Normalizes a Korean number
  
  Parameters:
  
  number - number or normalize
  
  Returns:
  
  normalized number, or number to normalize on error (no op)
- parseLargeHangulNumeral
  
  public BigDecimal parseLargeHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
  
  Parse large Hangul numerals (ten thousands or larger)
  
  Parameters:
  
  buffer - buffer to parse
  
  Returns:
  
  parsed number, or null on error or end of input
- parseMediumHangulNumeral
  
  public BigDecimal parseMediumHangulNumeral(KoreanNumberFilter.NumberBuffer buffer)
  
  Parse medium Hangul numerals (tens, hundreds or thousands)
  
  Parameters:
  
  buffer - buffer to parse
  
  Returns:
  
  parsed number or null on error
- isNumeral
  
  public boolean isNumeral(String input)
  
  Numeral predicate
  
  Parameters:
  
  input - string to test
  
  Returns:
  
  true if and only if input is a numeral
- isNumeral
  
  public boolean isNumeral(char c)
  
  Numeral predicate
  
  Parameters:
  
  c - character to test
  
  Returns:
  
  true if and only if c is a numeral
- isNumeralPunctuation
  
  public boolean isNumeralPunctuation(String input)
  
  Numeral punctuation predicate
  
  Parameters:
  
  input - string to test
  
  Returns:
  
  true if and only if c is a numeral punctuation string
- isNumeralPunctuation
  
  public boolean isNumeralPunctuation(char c)
  
  Numeral punctuation predicate
  
  Parameters:
  
  c - character to test
  
  Returns:
  
  true if and only if c is a numeral punctuation character
- isArabicNumeral
  
  public boolean isArabicNumeral(char c)
  
  Arabic numeral predicate. Both half-width and full-width characters are supported
  
  Parameters:
  
  c - character to test
  
  Returns:
  
  true if and only if c is an Arabic numeral

Class KoreanNumberFilter

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.TokenFilter

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenFilter

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Constructor Details

KoreanNumberFilter

Method Details

incrementToken

reset

normalizeNumber

parseLargeHangulNumeral

parseMediumHangulNumeral

isNumeral

isNumeral

isNumeralPunctuation

isNumeralPunctuation

isArabicNumeral