Class SoraniNormalizer

java.lang.Object
org.apache.lucene.analysis.ckb.SoraniNormalizer

public class SoraniNormalizer extends Object
Normalizes the Unicode representation of Sorani text.

Normalization consists of:

  • Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH)
  • Alternate form of 'k' (0643) is converted to 06A9 (KEHEH)
  • Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE)
  • Alternate (joining) form of 'h' (06BE) is converted to 0647
  • Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW)
  • Harakat, tatweel, and formatting characters such as directional controls are removed.
  • Constructor Details

    • SoraniNormalizer

      public SoraniNormalizer()
  • Method Details

    • normalize

      public int normalize(char[] s, int len)
      Normalize an input buffer of Sorani text
      Parameters:
      s - input buffer
      len - length of input buffer
      Returns:
      length of input buffer after normalization