See: Description
Class | Description |
---|---|
CompoundWordTokenFilterBase |
Base class for decomposition token filters.
|
DictionaryCompoundWordTokenFilter |
A
TokenFilter that decomposes compound words found in many Germanic languages. |
DictionaryCompoundWordTokenFilterFactory |
Factory for
DictionaryCompoundWordTokenFilter . |
HyphenationCompoundWordTokenFilter |
A
TokenFilter that decomposes compound words found in many Germanic languages. |
HyphenationCompoundWordTokenFilterFactory |
Factory for
HyphenationCompoundWordTokenFilter . |
Input token stream |
---|
Rindfleischüberwachungsgesetz Drahtschere abba |
Output token stream |
---|
(Rindfleischüberwachungsgesetz,0,29) |
(Rind,0,4,posIncr=0) |
(fleisch,4,11,posIncr=0) |
(überwachung,11,22,posIncr=0) |
(gesetz,23,29,posIncr=0) |
(Drahtschere,30,41) |
(Draht,30,35,posIncr=0) |
(schere,35,41,posIncr=0) |
(abba,42,46) |
HyphenationCompoundWordTokenFilter
uses hyphenation grammars to find
potential subwords that a worth to check against the dictionary. It can be used
without a dictionary as well but then produces a lot of "nonword" tokens.
The quality of the output tokens is directly connected to the quality of the
grammar file you use. For languages like German they are quite good.
DictionaryCompoundWordTokenFilter
uses a dictionary-only approach to
find subwords in a compound word. It is much slower than the one that
uses the hyphenation grammars. You can use it as a first start to
see if your dictionary is good or not because it is much simpler in design.
Token filter | Output quality | Performance |
---|---|---|
HyphenationCompoundWordTokenFilter | good if grammar file is good – acceptable otherwise | fast |
DictionaryCompoundWordTokenFilter | good | slow |
public void testHyphenationCompoundWordsDE() throws Exception { String[] dict = { "Rind", "Fleisch", "Draht", "Schere", "Gesetz", "Aufgabe", "Überwachung" }; Reader reader = new FileReader("de_DR.xml"); HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter .getHyphenationTree(reader); HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( new WhitespaceTokenizer(new StringReader( "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator, dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE, CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false); CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); while (tf.incrementToken()) { System.out.println(t); } } public void testHyphenationCompoundWordsWithoutDictionaryDE() throws Exception { Reader reader = new FileReader("de_DR.xml"); HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter .getHyphenationTree(reader); HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( new WhitespaceTokenizer(new StringReader( "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator); CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); while (tf.incrementToken()) { System.out.println(t); } } public void testDumbCompoundWordsSE() throws Exception { String[] dict = { "Bil", "Dörr", "Motor", "Tak", "Borr", "Slag", "Hammar", "Pelar", "Glas", "Ögon", "Fodral", "Bas", "Fiol", "Makare", "Gesäll", "Sko", "Vind", "Rute", "Torkare", "Blad" }; DictionaryCompoundWordTokenFilter tf = new DictionaryCompoundWordTokenFilter( new WhitespaceTokenizer( new StringReader( "Bildörr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glasögonfodral Basfiolsfodral Basfiolsfodralmakaregesäll Skomakare Vindrutetorkare Vindrutetorkarblad abba")), dict); CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); while (tf.incrementToken()) { System.out.println(t); } }
Copyright © 2000-2024 Apache Software Foundation. All Rights Reserved.