Class ICUNormalizer2Filter

  • All Implemented Interfaces:
    Closeable, AutoCloseable, Unwrappable<TokenStream>
    Direct Known Subclasses:
    ICUFoldingFilter

    public class ICUNormalizer2Filter
    extends TokenFilter
    Normalize token text with ICU's Normalizer2

    With this filter, you can normalize text in the following ways:

    • NFKC Normalization, Case Folding, and removing Ignorables (the default)
    • Using a standard Normalization mode (NFC, NFD, NFKC, NFKD)
    • Based on rules from a custom normalization mapping.

    If you use the defaults, this filter is a simple way to standardize Unicode text in a language-independent way for search:

    • The case folding that it does can be seen as a replacement for LowerCaseFilter: For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
    • The normalization will standardizes different forms of the same character in Unicode. For example, CJK full-width numbers will be standardized to their ASCII forms.
    • Ignorables such as Zero-Width Joiner and Variation Selectors are removed. These are typically modifier characters that affect display.
    See Also:
    Normalizer2, FilteredNormalizer2
    • Constructor Detail

      • ICUNormalizer2Filter

        public ICUNormalizer2Filter​(TokenStream input)
        Create a new Normalizer2Filter that combines NFKC normalization, Case Folding, and removes Default Ignorables (NFKC_Casefold)
      • ICUNormalizer2Filter

        public ICUNormalizer2Filter​(TokenStream input,
                                    com.ibm.icu.text.Normalizer2 normalizer)
        Create a new Normalizer2Filter with the specified Normalizer2
        Parameters:
        input - stream
        normalizer - normalizer to use