Class JapaneseNumberFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
TokenFilter
that normalizes Japanese numbers (kansūji) to regular Arabic decimal
numbers in half-width characters.
Japanese numbers are often written using a combination of kanji and Arabic numbers with various kinds punctuation. For example, 3.2千 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2千 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
being found in the token stream. Please make sure your JapaneseTokenizer
has
discardPunctuation
set to false. In case punctuation characters, such as . (U+FF0E FULLWIDTH
FULL STOP), is removed from the token stream, this filter would find input tokens tokens 3 and 2千
and give outputs 3 and 2000 instead of 3200, which is likely not the intended result. If you want
to remove punctuation characters from your index that are not part of normalized numbers, add a
StopFilter
with the punctuation you wish to remove after
JapaneseNumberFilter
in your analyzer chain.
Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
- 〇〇七 becomes 7
- 一〇〇〇 becomes 1000
- 三千2百2十三 becomes 3223
- 兆六百万五千一 becomes 1000006005001
- 3.2千 becomes 3200
- 1.2万345.67 becomes 12345.67
- 4,647.100 becomes 4647.1
- 15,7 becomes 157 (be aware of this weakness)
Tokens preceded by a token with PositionIncrementAttribute
of zero are left left
untouched and emitted as-is.
This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context. For
example, is 田中京一 is a name and means Tanaka Kyōichi, but 京一 (Kyōichi) out of context can strictly
speaking also represent the number 10000000000000001. This filter respects the KeywordAttribute
, which can be used to prevent specific normalizations from happening.
Also notice that token attributes such as PartOfSpeechAttribute
, ReadingAttribute
, InflectionAttribute
and BaseFormAttribute
are left unchanged and will
inherit the values of the last token used to compose the normalized number and can be wrong.
Hence, for 10万 (10000), we will have ReadingAttribute
set to マン. This is a known issue
and is subject to a future improvement.
Japanese formal numbers (daiji), accounting numbers and decimal fractions are currently not supported.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Buffer that holds a Japanese number string and a position index used as a parsed-to markerNested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionfinal boolean
boolean
isArabicNumeral
(char c) Arabic numeral predicate.boolean
isNumeral
(char c) Numeral predicateboolean
Numeral predicateboolean
isNumeralPunctuation
(char c) Numeral punctuation predicateboolean
isNumeralPunctuation
(String input) Numeral punctuation predicatenormalizeNumber
(String number) Normalizes a Japanese numberParse large kanji numerals (ten thousands or larger)Parse medium kanji numerals (tens, hundreds or thousands)void
reset()
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
JapaneseNumberFilter
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
normalizeNumber
Normalizes a Japanese number- Parameters:
number
- number or normalize- Returns:
- normalized number, or number to normalize on error (no op)
-
parseLargeKanjiNumeral
Parse large kanji numerals (ten thousands or larger)- Parameters:
buffer
- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseMediumKanjiNumeral
Parse medium kanji numerals (tens, hundreds or thousands)- Parameters:
buffer
- buffer to parse- Returns:
- parsed number or null on error
-
isNumeral
Numeral predicate- Parameters:
input
- string to test- Returns:
- true if and only if input is a numeral
-
isNumeral
public boolean isNumeral(char c) Numeral predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a numeral
-
isNumeralPunctuation
Numeral punctuation predicate- Parameters:
input
- string to test- Returns:
- true if and only if c is a numeral punctuation string
-
isNumeralPunctuation
public boolean isNumeralPunctuation(char c) Numeral punctuation predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a numeral punctuation character
-
isArabicNumeral
public boolean isArabicNumeral(char c) Arabic numeral predicate. Both half-width and full-width characters are supported- Parameters:
c
- character to test- Returns:
- true if and only if c is an Arabic numeral
-