org.apache.lucene.analysis.pattern
Class PatternTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.pattern.PatternTokenizer
All Implemented Interfaces:
Closeable

public final class PatternTokenizer
extends Tokenizer

This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = \'([^\']+)\'
  group = 0
  input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

See Also:
Pattern

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
PatternTokenizer(AttributeSource.AttributeFactory factory, Reader input, Pattern pattern, int group)
          creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
PatternTokenizer(Reader input, Pattern pattern, int group)
          creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
 
Method Summary
 void end()
           
 boolean incrementToken()
           
 void reset()
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

PatternTokenizer

public PatternTokenizer(Reader input,
                        Pattern pattern,
                        int group)
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)


PatternTokenizer

public PatternTokenizer(AttributeSource.AttributeFactory factory,
                        Reader input,
                        Pattern pattern,
                        int group)
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)

Method Detail

incrementToken

public boolean incrementToken()
Specified by:
incrementToken in class TokenStream

end

public void end()
         throws IOException
Overrides:
end in class TokenStream
Throws:
IOException

reset

public void reset()
           throws IOException
Overrides:
reset in class Tokenizer
Throws:
IOException


Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.