StandardTokenizer (Lucene 4.10.2 API)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.standard.StandardTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable
```
public final class StandardTokenizer
extends Tokenizer
```
A grammar-based tokenizer constructed with JFlex.
As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
You must specify the required Version compatibility when creating StandardTokenizer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
- As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior of ClassicTokenizer for backwards compatibility.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`ACRONYM` Deprecated. (3.1)
`static int`	`ACRONYM_DEP` Deprecated. (3.1)
`static int`	`ALPHANUM`
`static int`	`APOSTROPHE` Deprecated. (3.1)
`static int`	`CJ` Deprecated. (3.1)
`static int`	`COMPANY` Deprecated. (3.1)
`static int`	`EMAIL`
`static int`	`HANGUL`
`static int`	`HIRAGANA`
`static int`	`HOST` Deprecated. (3.1)
`static int`	`IDEOGRAPHIC`
`static int`	`KATAKANA`
`static int`	`NUM`
`static int`	`SOUTHEAST_ASIAN`
`static String[]`	`TOKEN_TYPES` String token types that correspond to token type int constants

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY

Fields inherited from class org.apache.lucene.util.AttributeSource
DEFAULT_ATTRIBUTE_FACTORY

Constructor Summary

Constructors
Constructor and Description
`StandardTokenizer(AttributeFactory factory, Reader input)` Creates a new StandardTokenizer with a given `AttributeFactory`
`StandardTokenizer(Reader input)` Creates a new instance of the `StandardTokenizer`.
`StandardTokenizer(Version matchVersion, AttributeFactory factory, Reader input)` Deprecated. Use `StandardTokenizer(AttributeFactory, Reader)`
`StandardTokenizer(Version matchVersion, Reader input)` Deprecated. Use `StandardTokenizer(Reader)`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()`
`void`	`end()`
`int`	`getMaxTokenLength()`
`boolean`	`incrementToken()`
`void`	`reset()`
`void`	`setMaxTokenLength(int length)` Set the max allowed token length.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - ALPHANUM
```
public static final int ALPHANUM
```
    See Also:
    Constant Field Values
  - APOSTROPHE
```
@Deprecated
public static final int APOSTROPHE
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - ACRONYM
```
@Deprecated
public static final int ACRONYM
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - COMPANY
```
@Deprecated
public static final int COMPANY
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - EMAIL
```
public static final int EMAIL
```
    See Also:
    Constant Field Values
  - HOST
```
@Deprecated
public static final int HOST
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - NUM
```
public static final int NUM
```
    See Also:
    Constant Field Values
  - CJ
```
@Deprecated
public static final int CJ
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - ACRONYM_DEP
```
@Deprecated
public static final int ACRONYM_DEP
```
    Deprecated. (3.1)
    
    See Also:
    Constant Field Values
  - SOUTHEAST_ASIAN
```
public static final int SOUTHEAST_ASIAN
```
    See Also:
    Constant Field Values
  - IDEOGRAPHIC
```
public static final int IDEOGRAPHIC
```
    See Also:
    Constant Field Values
  - HIRAGANA
```
public static final int HIRAGANA
```
    See Also:
    Constant Field Values
  - KATAKANA
```
public static final int KATAKANA
```
    See Also:
    Constant Field Values
  - HANGUL
```
public static final int HANGUL
```
    See Also:
    Constant Field Values
  - TOKEN_TYPES
```
public static final String[] TOKEN_TYPES
```
    String token types that correspond to token type int constants
- Constructor Detail
  - StandardTokenizer
```
public StandardTokenizer(Reader input)
```
    Creates a new instance of the StandardTokenizer. Attaches the input to the newly created JFlex scanner.
    
    Parameters:
    input - The input reader See http://issues.apache.org/jira/browse/LUCENE-1068
  - StandardTokenizer
```
@Deprecated
public StandardTokenizer(Version matchVersion,
                            Reader input)
```
    Deprecated. Use StandardTokenizer(Reader)
  - StandardTokenizer
```
public StandardTokenizer(AttributeFactory factory,
                 Reader input)
```
    Creates a new StandardTokenizer with a given AttributeFactory
  - StandardTokenizer
```
@Deprecated
public StandardTokenizer(Version matchVersion,
                            AttributeFactory factory,
                            Reader input)
```
    Deprecated. Use StandardTokenizer(AttributeFactory, Reader)
- Method Detail
  - setMaxTokenLength
```
public void setMaxTokenLength(int length)
```
    Set the max allowed token length. Any token longer than this is skipped.
  - getMaxTokenLength
```
public int getMaxTokenLength()
```
    See Also:
    setMaxTokenLength(int)
  - incrementToken
```
public final boolean incrementToken()
                             throws IOException
```
    Specified by:
    
    incrementToken in class TokenStream
    
    Throws:
    
    IOException
  - end
```
public final void end()
               throws IOException
```
    Overrides:
    
    end in class TokenStream
    
    Throws:
    
    IOException
  - close
```
public void close()
           throws IOException
```
    Specified by:
    
    close in interface Closeable
    
    Specified by:
    
    close in interface AutoCloseable
    
    Overrides:
    
    close in class Tokenizer
    
    Throws:
    
    IOException
  - reset
```
public void reset()
           throws IOException
```
    Overrides:
    
    reset in class Tokenizer
    
    Throws:
    
    IOException

Class StandardTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Fields inherited from class org.apache.lucene.util.AttributeSource

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Detail

ALPHANUM

APOSTROPHE

ACRONYM

COMPANY

EMAIL

HOST

NUM

CJ

ACRONYM_DEP

SOUTHEAST_ASIAN

IDEOGRAPHIC

HIRAGANA

KATAKANA

HANGUL

TOKEN_TYPES

Constructor Detail

StandardTokenizer

StandardTokenizer

StandardTokenizer

StandardTokenizer

Method Detail

setMaxTokenLength

getMaxTokenLength

incrementToken

end

close

reset