org.apache.lucene.index.memory
Class SynonymMap

java.lang.Object
  extended by org.apache.lucene.index.memory.SynonymMap

public class SynonymMap
extends Object

Loads the WordNet prolog file wn_s.pl into a thread-safe main-memory hash map that can be used for fast high-frequency lookups of synonyms for any given (lowercase) word string.

There holds: If B is a synonym for A (A -> B) then A is also a synonym for B (B -> A). There does not necessarily hold: A -> B, B -> C then A -> C.

Loading typically takes some 1.5 secs, so should be done only once per (server) program execution, using a singleton pattern. Once loaded, a synonym lookup via getSynonyms(String)takes constant time O(1). A loaded default synonym map consumes about 10 MB main memory. An instance is immutable, hence thread-safe.

This implementation borrows some ideas from the Lucene Syns2Index demo that Dave Spencer originally contributed to Lucene. Dave's approach involved a persistent Lucene index which is suitable for occasional lookups or very large synonym tables, but considered unsuitable for high-frequency lookups of medium size synonym tables.

Example Usage:

 String[] words = new String[] { "hard", "woods", "forest", "wolfish", "xxxx"};
 SynonymMap map = new SynonymMap(new FileInputStream("samples/fulltext/wn_s.pl"));
 for (int i = 0; i < words.length; i++) {
     String[] synonyms = map.getSynonyms(words[i]);
     System.out.println(words[i] + ":" + java.util.Arrays.asList(synonyms).toString());
 }
 
 Example output:
 hard:[arduous, backbreaking, difficult, fermented, firmly, grueling, gruelling, heavily, heavy, intemperately, knockout, laborious, punishing, severe, severely, strong, toilsome, tough]
 woods:[forest, wood]
 forest:[afforest, timber, timberland, wood, woodland, woods]
 wolfish:[edacious, esurient, rapacious, ravening, ravenous, voracious, wolflike]
 xxxx:[]
 

See Also:
prologdb man page , Dave's synonym demo site

Constructor Summary
SynonymMap(InputStream input)
          Constructs an instance, loading WordNet synonym data from the given input stream.
 
Method Summary
protected  String analyze(String word)
          Analyzes/transforms the given word on input stream loading.
 String[] getSynonyms(String word)
          Returns the synonym set for the given word, sorted ascending.
 String toString()
          Returns a String representation of the index data for debugging purposes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SynonymMap

public SynonymMap(InputStream input)
           throws IOException
Constructs an instance, loading WordNet synonym data from the given input stream. Finally closes the stream. The words in the stream must be in UTF-8 or a compatible subset (for example ASCII, MacRoman, etc.).

Parameters:
input - the stream to read from (null indicates an empty synonym map)
Throws:
IOException - if an error occured while reading the stream.
Method Detail

getSynonyms

public String[] getSynonyms(String word)
Returns the synonym set for the given word, sorted ascending.

Parameters:
word - the word to lookup (must be in lowercase).
Returns:
the synonyms; a set of zero or more words, sorted ascending, each word containing lowercase characters that satisfy Character.isLetter().

toString

public String toString()
Returns a String representation of the index data for debugging purposes.

Overrides:
toString in class Object
Returns:
a String representation

analyze

protected String analyze(String word)
Analyzes/transforms the given word on input stream loading. This default implementation simply lowercases the word. Override this method with a custom stemming algorithm or similar, if desired.

Parameters:
word - the word to analyze
Returns:
the same word, or a different word (or null to indicate that the word should be ignored)


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.