Class WordlistLoader

java.lang.Object
org.apache.lucene.analysis.WordlistLoader

public class WordlistLoader extends Object
Loader for text files that represent a list of stopwords.
See Also:
NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.
  • Method Details

    • getWordSet

      public static CharArraySet getWordSet(Reader reader, CharArraySet result) throws IOException
      Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      reader - Reader containing the wordlist
      result - the CharArraySet to fill with the readers words
      Returns:
      the given CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(Reader reader) throws IOException
      Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      reader - Reader containing the wordlist
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(InputStream stream) throws IOException
      Reads lines from an InputStream with UTF-8 charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      stream - InputStream containing the wordlist
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(InputStream stream, Charset charset) throws IOException
      Reads lines from an InputStream with the given charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      stream - InputStream containing the wordlist
      charset - Charset of the wordlist
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(Reader reader, String comment, CharArraySet result) throws IOException
      Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      reader - Reader containing the wordlist
      comment - The string representing a comment.
      result - the CharArraySet to fill with the readers words
      Returns:
      the given CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(Reader reader, String comment) throws IOException
      Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      reader - Reader containing the wordlist
      comment - The string representing a comment.
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(InputStream stream, String comment) throws IOException
      Reads lines from an InputStream with UTF-8 charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      stream - InputStream in UTF-8 encoding containing the wordlist
      comment - The string representing a comment.
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getWordSet

      public static CharArraySet getWordSet(InputStream stream, Charset charset, String comment) throws IOException
      Reads lines from an InputStream with the given charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
      Parameters:
      stream - InputStream containing the wordlist
      charset - Charset of the wordlist
      comment - The string representing a comment.
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getSnowballWordSet

      public static CharArraySet getSnowballWordSet(Reader reader, CharArraySet result) throws IOException
      Reads stopwords from a stopword list in Snowball format.

      The snowball format is the following:

      • Lines may contain multiple words separated by whitespace.
      • The comment character is the vertical line (|).
      • Lines may contain trailing comments.
      Parameters:
      reader - Reader containing a Snowball stopword list
      result - the CharArraySet to fill with the readers words
      Returns:
      the given CharArraySet with the reader's words
      Throws:
      IOException
    • getSnowballWordSet

      public static CharArraySet getSnowballWordSet(Reader reader) throws IOException
      Reads stopwords from a stopword list in Snowball format.

      The snowball format is the following:

      • Lines may contain multiple words separated by whitespace.
      • The comment character is the vertical line (|).
      • Lines may contain trailing comments.
      Parameters:
      reader - Reader containing a Snowball stopword list
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getSnowballWordSet

      public static CharArraySet getSnowballWordSet(InputStream stream) throws IOException
      Reads stopwords from a stopword list in Snowball format.

      The snowball format is the following:

      • Lines may contain multiple words separated by whitespace.
      • The comment character is the vertical line (|).
      • Lines may contain trailing comments.
      Parameters:
      stream - InputStream in UTF-8 encoding containing a Snowball stopword list
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getSnowballWordSet

      public static CharArraySet getSnowballWordSet(InputStream stream, Charset charset) throws IOException
      Reads stopwords from a stopword list in Snowball format.

      The snowball format is the following:

      • Lines may contain multiple words separated by whitespace.
      • The comment character is the vertical line (|).
      • Lines may contain trailing comments.
      Parameters:
      stream - InputStream containing a Snowball stopword list
      charset - Charset of the stopword list
      Returns:
      An unmodifiable CharArraySet with the reader's words
      Throws:
      IOException
    • getStemDict

      public static CharArrayMap<String> getStemDict(Reader reader, CharArrayMap<String> result) throws IOException
      Reads a stem dictionary. Each line contains:
      word\tstem
      (i.e. two tab separated words)
      Returns:
      stem dictionary that overrules the stemming algorithm
      Throws:
      IOException - If there is a low-level I/O error.
    • getLines

      public static List<String> getLines(InputStream stream, Charset charset) throws IOException
      Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.

      A comment line is any line that starts with the character "#"

      Returns:
      a list of non-blank non-comment lines with whitespace trimmed
      Throws:
      IOException - If there is a low-level I/O error.