UnicodeWhitespaceTokenizer (Lucene 8.11.3 API)

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.Tokenizer
    - - org.apache.lucene.analysis.util.CharTokenizer
      - org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer

All Implemented Interfaces:

Closeable, AutoCloseable
```
public final class UnicodeWhitespaceTokenizer
extends CharTokenizer
```
A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens (according to Unicode's WHITESPACE property).
For Unicode version see: UnicodeProps

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.State

Field Summary
- Fields inherited from class org.apache.lucene.analysis.util.CharTokenizer
  DEFAULT_MAX_WORD_LEN
- Fields inherited from class org.apache.lucene.analysis.Tokenizer
  input
- Fields inherited from class org.apache.lucene.analysis.TokenStream
  DEFAULT_TOKEN_ATTRIBUTE_FACTORY

Constructor Summary

Constructors
Constructor and Description
`UnicodeWhitespaceTokenizer()` Construct a new UnicodeWhitespaceTokenizer.
`UnicodeWhitespaceTokenizer(AttributeFactory factory)` Construct a new UnicodeWhitespaceTokenizer using a given `AttributeFactory`.
`UnicodeWhitespaceTokenizer(AttributeFactory factory, int maxTokenLen)` Construct a new UnicodeWhitespaceTokenizer using a given `AttributeFactory`.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected boolean`	`isTokenChar(int c)` Collects only characters which do not satisfy Unicode's WHITESPACE property.

Methods inherited from class org.apache.lucene.analysis.util.CharTokenizer
end, fromSeparatorCharPredicate, fromSeparatorCharPredicate, fromTokenCharPredicate, fromTokenCharPredicate, incrementToken, reset

Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - UnicodeWhitespaceTokenizer
```
public UnicodeWhitespaceTokenizer()
```
    Construct a new UnicodeWhitespaceTokenizer.
  - UnicodeWhitespaceTokenizer
```
public UnicodeWhitespaceTokenizer(AttributeFactory factory)
```
    Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
    
    Parameters:
    
    factory - the attribute factory to use for this Tokenizer
  - UnicodeWhitespaceTokenizer
```
public UnicodeWhitespaceTokenizer(AttributeFactory factory,
                                  int maxTokenLen)
```
    Construct a new UnicodeWhitespaceTokenizer using a given AttributeFactory.
    
    Parameters:
    
    factory - the attribute factory to use for this Tokenizer
    
    maxTokenLen - maximum token length the tokenizer will emit. Must be greater than 0 and less than MAX_TOKEN_LENGTH_LIMIT (1024*1024)
    
    Throws:
    
    IllegalArgumentException - if maxTokenLen is invalid.
- Method Detail
  - isTokenChar
```
protected boolean isTokenChar(int c)
```
    Collects only characters which do not satisfy Unicode's WHITESPACE property.
    
    Specified by:
    
    isTokenChar in class CharTokenizer

Skip navigation links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2000-2024 Apache Software Foundation. All Rights Reserved.