LetterTokenizer (Lucene 7.2.0 API)

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.util.CharTokenizer
      - org.apache.lucene.analysis.core.LetterTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable

Direct Known Subclasses:

LowerCaseTokenizer
```
public class LetterTokenizer
extends CharTokenizer
```
A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary
- Fields inherited from class org.apache.lucene.analysis.util.CharTokenizer
  DEFAULT_MAX_WORD_LEN
- Fields inherited from class org.apache.lucene.analysis.Tokenizer
  input
- Fields inherited from class org.apache.lucene.analysis.TokenStream
  DEFAULT_TOKEN_ATTRIBUTE_FACTORY

Constructor Summary

Constructors
Constructor and Description
`LetterTokenizer()` Construct a new LetterTokenizer.
`LetterTokenizer(AttributeFactory factory)` Construct a new LetterTokenizer using a given `AttributeFactory`.
`LetterTokenizer(AttributeFactory factory, int maxTokenLen)` Construct a new LetterTokenizer using a given `AttributeFactory`.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type Method and Description

protected boolean isTokenChar(int c)
Collects only characters which satisfy Character.isLetter(int).
- Methods inherited from class org.apache.lucene.analysis.util.CharTokenizer
  end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, normalize, reset
- Methods inherited from class org.apache.lucene.analysis.Tokenizer
  close, correctOffset, setReader
- Methods inherited from class org.apache.lucene.util.AttributeSource
  addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
- Methods inherited from class java.lang.Object
  clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - LetterTokenizer
```
public LetterTokenizer()
```
    Construct a new LetterTokenizer.
  - LetterTokenizer
```
public LetterTokenizer(AttributeFactory factory)
```
    Construct a new LetterTokenizer using a given AttributeFactory.
    
    Parameters:
    
    factory - the attribute factory to use for this Tokenizer
  - LetterTokenizer
```
public LetterTokenizer(AttributeFactory factory,
                       int maxTokenLen)
```
    Construct a new LetterTokenizer using a given AttributeFactory.
    
    Parameters:
    
    factory - the attribute factory to use for this Tokenizer
    
    maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
    
    Throws:
    
    IllegalArgumentException - if maxTokenLen is invalid.
- Method Detail
  - isTokenChar
```
protected boolean isTokenChar(int c)
```
    Collects only characters which satisfy Character.isLetter(int).
    
    Specified by:
    
    isTokenChar in class CharTokenizer

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.