Class SegmentingTokenizerBase

All Implemented Interfaces:
Closeable, AutoCloseable
Direct Known Subclasses:
ThaiTokenizer

public abstract class SegmentingTokenizerBase extends Tokenizer
Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

    • BUFFERMAX

      protected static final int BUFFERMAX
      See Also:
    • buffer

      protected final char[] buffer
    • offset

      protected int offset
      accumulated offset of previous buffers for this reader, for offsetAtt
  • Constructor Details

    • SegmentingTokenizerBase

      public SegmentingTokenizerBase(BreakIterator iterator)
      Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

      Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

    • SegmentingTokenizerBase

      public SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)
      Construct a new SegmenterBase, also supplying the AttributeFactory
  • Method Details

    • incrementToken

      public final boolean incrementToken() throws IOException
      Specified by:
      incrementToken in class TokenStream
      Throws:
      IOException
    • reset

      public void reset() throws IOException
      Overrides:
      reset in class Tokenizer
      Throws:
      IOException
    • end

      public final void end() throws IOException
      Overrides:
      end in class TokenStream
      Throws:
      IOException
    • isSafeEnd

      protected boolean isSafeEnd(char ch)
      For sentence tokenization, these are the unambiguous break positions.
    • setNextSentence

      protected abstract void setNextSentence(int sentenceStart, int sentenceEnd)
      Provides the next input sentence for analysis
    • incrementWord

      protected abstract boolean incrementWord()
      Returns true if another word is available