public abstract class SegmentingTokenizerBase extends Tokenizer
BreakIterator
and
allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
AttributeSource.State
Modifier and Type | Field and Description |
---|---|
protected char[] |
buffer |
protected static int |
BUFFERMAX |
protected int |
offset
accumulated offset of previous buffers for this reader, for offsetAtt
|
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
SegmentingTokenizerBase(AttributeFactory factory,
BreakIterator iterator)
Construct a new SegmenterBase, also supplying the AttributeFactory
|
SegmentingTokenizerBase(BreakIterator iterator)
Construct a new SegmenterBase, using
the provided BreakIterator for sentence segmentation.
|
Modifier and Type | Method and Description |
---|---|
void |
end() |
boolean |
incrementToken() |
protected abstract boolean |
incrementWord()
Returns true if another word is available
|
protected boolean |
isSafeEnd(char ch)
For sentence tokenization, these are the unambiguous break positions.
|
void |
reset() |
protected abstract void |
setNextSentence(int sentenceStart,
int sentenceEnd)
Provides the next input sentence for analysis
|
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
protected static final int BUFFERMAX
protected final char[] buffer
protected int offset
public SegmentingTokenizerBase(BreakIterator iterator)
Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
public SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
public final boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
public void reset() throws IOException
reset
in class Tokenizer
IOException
public final void end() throws IOException
end
in class TokenStream
IOException
protected boolean isSafeEnd(char ch)
protected abstract void setNextSentence(int sentenceStart, int sentenceEnd)
protected abstract boolean incrementWord()
Copyright © 2000-2024 Apache Software Foundation. All Rights Reserved.