public abstract class SegmentingTokenizerBase extends Tokenizer
BreakIterator and
allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
AttributeSource.State| Modifier and Type | Field and Description |
|---|---|
protected char[] |
buffer |
protected static int |
BUFFERMAX |
protected int |
offset
accumulated offset of previous buffers for this reader, for offsetAtt
|
DEFAULT_TOKEN_ATTRIBUTE_FACTORY| Constructor and Description |
|---|
SegmentingTokenizerBase(AttributeFactory factory,
BreakIterator iterator)
Construct a new SegmenterBase, also supplying the AttributeFactory
|
SegmentingTokenizerBase(BreakIterator iterator)
Construct a new SegmenterBase, using
the provided BreakIterator for sentence segmentation.
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
protected abstract boolean |
incrementWord()
Returns true if another word is available
|
protected boolean |
isSafeEnd(char ch)
For sentence tokenization, these are the unambiguous break positions.
|
void |
reset() |
protected abstract void |
setNextSentence(int sentenceStart,
int sentenceEnd)
Provides the next input sentence for analysis
|
close, correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringprotected static final int BUFFERMAX
protected final char[] buffer
protected int offset
public SegmentingTokenizerBase(BreakIterator iterator)
Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
public SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
public final boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionpublic void reset()
throws IOException
reset in class TokenizerIOExceptionpublic final void end()
throws IOException
end in class TokenStreamIOExceptionprotected boolean isSafeEnd(char ch)
protected abstract void setNextSentence(int sentenceStart,
int sentenceEnd)
protected abstract boolean incrementWord()
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.