ICUTokenizer (Lucene 4.1.0 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.icu.segmentation.ICUTokenizer

All Implemented Interfaces:

Closeable
```
public final class ICUTokenizer
extends Tokenizer
```
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)
Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:
ICUTokenizerConfig
WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.AttributeFactory, AttributeSource.State

Field Summary
- Fields inherited from class org.apache.lucene.analysis.Tokenizer
  input

Constructor Summary

Constructors
Constructor and Description
`ICUTokenizer(Reader input)` Construct a new ICUTokenizer that breaks text into words from the given Reader.
`ICUTokenizer(Reader input, ICUTokenizerConfig config)` Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

Method Summary

Methods
Modifier and Type Method and Description

void end()

boolean incrementToken()

void reset()
- Methods inherited from class org.apache.lucene.analysis.Tokenizer
  close, correctOffset, setReader
- Methods inherited from class org.apache.lucene.util.AttributeSource
  addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
- Methods inherited from class java.lang.Object
  clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ICUTokenizer
```
public ICUTokenizer(Reader input)
```
    Construct a new ICUTokenizer that breaks text into words from the given Reader.
    The default script-specific handling is used.
    
    Parameters:
    input - Reader containing text to tokenize.
    See Also:
    DefaultICUTokenizerConfig
  - ICUTokenizer
```
public ICUTokenizer(Reader input,
            ICUTokenizerConfig config)
```
    Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
    
    Parameters:
    input - Reader containing text to tokenize.
    config - Tailored BreakIterator configuration
- Method Detail
  - incrementToken
```
public boolean incrementToken()
                       throws IOException
```
    Specified by:
    
    incrementToken in class TokenStream
    
    Throws:
    
    IOException
  - reset
```
public void reset()
           throws IOException
```
    Overrides:
    
    reset in class TokenStream
    
    Throws:
    
    IOException
  - end
```
public void end()
```
    Overrides:
    
    end in class TokenStream

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.