Package org.apache.lucene.analysis.util
Class SegmentingTokenizerBase
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
- All Implemented Interfaces:
Closeable
,AutoCloseable
- Direct Known Subclasses:
ThaiTokenizer
Breaks text into sentences with a
BreakIterator
and allows subclasses to decompose these
sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Modifier and TypeFieldDescriptionprotected final char[]
protected static final int
protected int
accumulated offset of previous buffers for this reader, for offsetAttFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionSegmentingTokenizerBase
(BreakIterator iterator) Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.SegmentingTokenizerBase
(AttributeFactory factory, BreakIterator iterator) Construct a new SegmenterBase, also supplying the AttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionfinal void
end()
final boolean
protected abstract boolean
Returns true if another word is availableprotected boolean
isSafeEnd
(char ch) For sentence tokenization, these are the unambiguous break positions.void
reset()
protected abstract void
setNextSentence
(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysisMethods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
BUFFERMAX
protected static final int BUFFERMAX- See Also:
-
buffer
protected final char[] buffer -
offset
protected int offsetaccumulated offset of previous buffers for this reader, for offsetAtt
-
-
Constructor Details
-
SegmentingTokenizerBase
Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
-
SegmentingTokenizerBase
Construct a new SegmenterBase, also supplying the AttributeFactory
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
isSafeEnd
protected boolean isSafeEnd(char ch) For sentence tokenization, these are the unambiguous break positions. -
setNextSentence
protected abstract void setNextSentence(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysis -
incrementWord
protected abstract boolean incrementWord()Returns true if another word is available
-