org.apache.lucene.analysis.util.SegmentingTokenizerBase

All Implemented Interfaces:: Closeable, AutoCloseable

Direct Known Subclasses:: ThaiTokenizer

public abstract class SegmentingTokenizerBase extends Tokenizer

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

protected final char[]

buffer

protected static final int

BUFFERMAX

protected int

offset

accumulated offset of previous buffers for this reader, for offsetAtt

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

SegmentingTokenizerBase(BreakIterator iterator)

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)

Construct a new SegmenterBase, also supplying the AttributeFactory
Method Summary

Modifier and Type

Method

Description

final void

end()

final boolean

incrementToken()

protected abstract boolean

incrementWord()

Returns true if another word is available

protected boolean

isSafeEnd(char ch)

For sentence tokenization, these are the unambiguous break positions.

void

reset()

protected abstract void

setNextSentence(int sentenceStart, int sentenceEnd)

Provides the next input sentence for analysis

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- BUFFERMAX
  
  protected static final int BUFFERMAX
  See Also:
  
  Constant Field Values
- buffer
  
  protected final char[] buffer
- offset
  
  protected int offset
  
  accumulated offset of previous buffers for this reader, for offsetAtt
Constructor Details
- SegmentingTokenizerBase
  
  public SegmentingTokenizerBase(BreakIterator iterator)
  
  Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.
  Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
- SegmentingTokenizerBase
  
  public SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
  
  Construct a new SegmenterBase, also supplying the AttributeFactory
Method Details
- incrementToken
  
  public final boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException
- end
  
  public final void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException
- isSafeEnd
  
  protected boolean isSafeEnd(char ch)
  
  For sentence tokenization, these are the unambiguous break positions.
- setNextSentence
  
  protected abstract void setNextSentence(int sentenceStart, int sentenceEnd)
  
  Provides the next input sentence for analysis
- incrementWord
  
  protected abstract boolean incrementWord()
  
  Returns true if another word is available

Class SegmentingTokenizerBase

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

BUFFERMAX

buffer

offset

Constructor Details

SegmentingTokenizerBase

SegmentingTokenizerBase

Method Details

incrementToken

reset

end

isSafeEnd

setNextSentence

incrementWord