org.apache.lucene.analysis.ckb
Class SoraniNormalizer
java.lang.Object
org.apache.lucene.analysis.ckb.SoraniNormalizer
public class SoraniNormalizer
- extends Object
Normalizes the Unicode representation of Sorani text.
Normalization consists of:
- Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH)
- Alternate form of 'k' (0643) is converted to 06A9 (KEHEH)
- Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE)
- Alternate (joining) form of 'h' (06BE) is converted to 0647
- Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW)
- Harakat, tatweel, and formatting characters such as directional controls are removed.
Method Summary |
int |
normalize(char[] s,
int len)
Normalize an input buffer of Sorani text |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SoraniNormalizer
public SoraniNormalizer()
normalize
public int normalize(char[] s,
int len)
- Normalize an input buffer of Sorani text
- Parameters:
s
- input bufferlen
- length of input buffer
- Returns:
- length of input buffer after normalization
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.