org.apache.lucene.analysis.classic.ClassicTokenizer

All Implemented Interfaces:: Closeable, AutoCloseable

public final class ClassicTokenizer extends Tokenizer

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

static final int

ACRONYM

static final int

ACRONYM_DEP

static final int

ALPHANUM

static final int

APOSTROPHE

static final int

CJ

static final int

COMPANY

static final int

EMAIL

static final int

HOST

static final int

NUM

static final String[]

TOKEN_TYPES

String token types that correspond to token type int constants

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

ClassicTokenizer()

Creates a new instance of the ClassicTokenizer.

ClassicTokenizer(AttributeFactory factory)

Creates a new ClassicTokenizer with a given AttributeFactory
Method Summary

Modifier and Type

Method

Description

void

close()

final void

end()

int

getMaxTokenLength()

final boolean

incrementToken()

void

reset()

void

setMaxTokenLength(int length)

Set the max allowed token length.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- ALPHANUM
  
  public static final int ALPHANUM
  See Also:
  
  Constant Field Values
- APOSTROPHE
  
  public static final int APOSTROPHE
  See Also:
  
  Constant Field Values
- ACRONYM
  
  public static final int ACRONYM
  See Also:
  
  Constant Field Values
- COMPANY
  
  public static final int COMPANY
  See Also:
  
  Constant Field Values
- EMAIL
  
  public static final int EMAIL
  See Also:
  
  Constant Field Values
- HOST
  
  public static final int HOST
  See Also:
  
  Constant Field Values
- NUM
  
  public static final int NUM
  See Also:
  
  Constant Field Values
- CJ
  
  public static final int CJ
  See Also:
  
  Constant Field Values
- ACRONYM_DEP
  
  public static final int ACRONYM_DEP
  See Also:
  
  Constant Field Values
- TOKEN_TYPES
  
  public static final String[] TOKEN_TYPES
  
  String token types that correspond to token type int constants
Constructor Details
- ClassicTokenizer
  
  public ClassicTokenizer()
  
  Creates a new instance of the ClassicTokenizer. Attaches the input to the newly created JFlex scanner.
  See http://issues.apache.org/jira/browse/LUCENE-1068
- ClassicTokenizer
  
  public ClassicTokenizer(AttributeFactory factory)
  
  Creates a new ClassicTokenizer with a given AttributeFactory
Method Details
- setMaxTokenLength
  
  public void setMaxTokenLength(int length)
  
  Set the max allowed token length. Any token longer than this is skipped.
- getMaxTokenLength
  
  public int getMaxTokenLength()
  See Also:
  
  setMaxTokenLength(int)
- incrementToken
  
  public final boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- end
  
  public final void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Overrides:
  
  close in class Tokenizer
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException

Class ClassicTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

ALPHANUM

APOSTROPHE

ACRONYM

COMPANY

EMAIL

HOST

NUM

CJ

ACRONYM_DEP

TOKEN_TYPES

Constructor Details

ClassicTokenizer

ClassicTokenizer

Method Details

setMaxTokenLength

getMaxTokenLength

incrementToken

end

close

reset