org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizer

All Implemented Interfaces:: Closeable, AutoCloseable

public final class SimplePatternSplitTokenizer extends Tokenizer

This tokenizer uses a Lucene RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens. The regexp syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. This is just like SimplePatternTokenizer except that the pattern should make valid token separator characters, like String.split. Empty string tokens are never produced.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

SimplePatternSplitTokenizer(String regexp)

See RegExp for the accepted syntax.

SimplePatternSplitTokenizer(AttributeFactory factory, String regexp, int determinizeWorkLimit)

See RegExp for the accepted syntax.

SimplePatternSplitTokenizer(AttributeFactory factory, Automaton dfa)

Runs a pre-built automaton.

SimplePatternSplitTokenizer(Automaton dfa)

Runs a pre-built automaton.
Method Summary

Modifier and Type

Method

Description

void

end()

boolean

incrementToken()

void

reset()

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- SimplePatternSplitTokenizer
  
  public SimplePatternSplitTokenizer(String regexp)
  
  See RegExp for the accepted syntax.
- SimplePatternSplitTokenizer
  
  public SimplePatternSplitTokenizer(Automaton dfa)
  
  Runs a pre-built automaton.
- SimplePatternSplitTokenizer
  
  public SimplePatternSplitTokenizer(AttributeFactory factory, String regexp, int determinizeWorkLimit)
  
  See RegExp for the accepted syntax.
- SimplePatternSplitTokenizer
  
  public SimplePatternSplitTokenizer(AttributeFactory factory, Automaton dfa)
  
  Runs a pre-built automaton.
Method Details
- incrementToken
  
  public boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- end
  
  public void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException

Class SimplePatternSplitTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Constructor Details

SimplePatternSplitTokenizer

SimplePatternSplitTokenizer

SimplePatternSplitTokenizer

SimplePatternSplitTokenizer

Method Details

incrementToken

end

reset