java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.util.SegmentingTokenizerBase

All Implemented Interfaces:

Closeable, AutoCloseable

Direct Known Subclasses:

ThaiTokenizer
```
public abstract class SegmentingTokenizerBase
extends Tokenizer
```
Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary

Fields
Modifier and Type	Field	Description
`protected char[]`	`buffer`
`protected static int`	`BUFFERMAX`
`protected int`	`offset`	accumulated offset of previous buffers for this reader, for offsetAtt

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY

Constructor Summary

Constructors
Constructor	Description
`SegmentingTokenizerBase(BreakIterator iterator)`	Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.
`SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator)`	Construct a new SegmenterBase, also supplying the AttributeFactory

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`end()`
`boolean`	`incrementToken()`
`protected abstract boolean`	`incrementWord()`	Returns true if another word is available
`protected boolean`	`isSafeEnd(char ch)`	For sentence tokenization, these are the unambiguous break positions.
`void`	`reset()`
`protected abstract void`	`setNextSentence(int sentenceStart, int sentenceEnd)`	Provides the next input sentence for analysis

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - BUFFERMAX
```
protected static final int BUFFERMAX
```
    See Also:
    
    Constant Field Values
  - buffer
```
protected final char[] buffer
```
  - offset
```
protected int offset
```
    accumulated offset of previous buffers for this reader, for offsetAtt
- Constructor Detail
  - SegmentingTokenizerBase
```
public SegmentingTokenizerBase(BreakIterator iterator)
```
    Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.
    Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
  - SegmentingTokenizerBase
```
public SegmentingTokenizerBase(AttributeFactory factory,
                               BreakIterator iterator)
```
    Construct a new SegmenterBase, also supplying the AttributeFactory
- Method Detail
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Specified by:
    
    incrementToken in class TokenStream
    
    Throws:
    
    IOException
  - reset
```
public void reset()
           throws IOException
```
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException
  - end
```
public final void end()
               throws IOException
```
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException
  - isSafeEnd
```
protected boolean isSafeEnd(char ch)
```
    For sentence tokenization, these are the unambiguous break positions.
  - setNextSentence
```
protected abstract void setNextSentence(int sentenceStart,
                                        int sentenceEnd)
```
    Provides the next input sentence for analysis
  - incrementWord
```
protected abstract boolean incrementWord()
```
    Returns true if another word is available

Class SegmentingTokenizerBase

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

BUFFERMAX

buffer

offset

Constructor Detail

SegmentingTokenizerBase

SegmentingTokenizerBase

Method Detail

incrementToken

reset

end

isSafeEnd

setNextSentence

incrementWord