Class SegmentingTokenizerBase

  • All Implemented Interfaces:
    Closeable, AutoCloseable
    Direct Known Subclasses:
    ThaiTokenizer

    public abstract class SegmentingTokenizerBase
    extends Tokenizer
    Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

    This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

    Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Detail

      • buffer

        protected final char[] buffer
      • offset

        protected int offset
        accumulated offset of previous buffers for this reader, for offsetAtt
    • Constructor Detail

      • SegmentingTokenizerBase

        public SegmentingTokenizerBase​(BreakIterator iterator)
        Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

        Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

      • SegmentingTokenizerBase

        public SegmentingTokenizerBase​(AttributeFactory factory,
                                       BreakIterator iterator)
        Construct a new SegmenterBase, also supplying the AttributeFactory
    • Method Detail

      • isSafeEnd

        protected boolean isSafeEnd​(char ch)
        For sentence tokenization, these are the unambiguous break positions.
      • setNextSentence

        protected abstract void setNextSentence​(int sentenceStart,
                                                int sentenceEnd)
        Provides the next input sentence for analysis
      • incrementWord

        protected abstract boolean incrementWord()
        Returns true if another word is available