org.apache.lucene.analysis.miscellaneous
Class HyphenatedWordsFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.miscellaneous.HyphenatedWordsFilter
All Implemented Interfaces:
Closeable

public final class HyphenatedWordsFilter
extends TokenFilter

When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:

 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"/>
      <filter class="solr.HyphenatedWordsFilterFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
 </fieldtype>
 


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
HyphenatedWordsFilter(TokenStream in)
          Creates a new HyphenatedWordsFilter
 
Method Summary
 boolean incrementToken()
          
 void reset()
          
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HyphenatedWordsFilter

public HyphenatedWordsFilter(TokenStream in)
Creates a new HyphenatedWordsFilter

Parameters:
in - TokenStream that will be filtered
Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Specified by:
incrementToken in class TokenStream
Throws:
IOException

reset

public void reset()
           throws IOException

Overrides:
reset in class TokenFilter
Throws:
IOException


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.