Package org.apache.lucene.analysis.util
Class SegmentingTokenizerBase
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.util.SegmentingTokenizerBase
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
- Direct Known Subclasses:
ThaiTokenizer
public abstract class SegmentingTokenizerBase extends Tokenizer
Breaks text into sentences with aBreakIterator
and allows subclasses to decompose these sentences into words.This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description protected char[]
buffer
protected static int
BUFFERMAX
protected int
offset
accumulated offset of previous buffers for this reader, for offsetAtt-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description SegmentingTokenizerBase(BreakIterator iterator)
Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
Construct a new SegmenterBase, also supplying the AttributeFactory
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
end()
boolean
incrementToken()
protected abstract boolean
incrementWord()
Returns true if another word is availableprotected boolean
isSafeEnd(char ch)
For sentence tokenization, these are the unambiguous break positions.void
reset()
protected abstract void
setNextSentence(int sentenceStart, int sentenceEnd)
Provides the next input sentence for analysis-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
BUFFERMAX
protected static final int BUFFERMAX
- See Also:
- Constant Field Values
-
buffer
protected final char[] buffer
-
offset
protected int offset
accumulated offset of previous buffers for this reader, for offsetAtt
-
-
Constructor Detail
-
SegmentingTokenizerBase
public SegmentingTokenizerBase(BreakIterator iterator)
Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
-
SegmentingTokenizerBase
public SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
Construct a new SegmenterBase, also supplying the AttributeFactory
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
public final void end() throws IOException
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
isSafeEnd
protected boolean isSafeEnd(char ch)
For sentence tokenization, these are the unambiguous break positions.
-
setNextSentence
protected abstract void setNextSentence(int sentenceStart, int sentenceEnd)
Provides the next input sentence for analysis
-
incrementWord
protected abstract boolean incrementWord()
Returns true if another word is available
-
-