Package org.apache.lucene.tests.analysis
Class MockTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.tests.analysis.MockTokenizer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class MockTokenizer extends Tokenizer
Tokenizer for testing.This tokenizer is a replacement for
WHITESPACE
,SIMPLE
, andKEYWORD
tokenizers. If you are writing a component such as a TokenFilter, it's a great idea to test it wrapping this tokenizer instead for extra checks. This tokenizer has the following behavior:- An internal state-machine is used for checking consumer consistency. These checks can be
disabled with
setEnableChecks(boolean)
. - For convenience, optionally lowercases terms that it outputs.
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_MAX_TOKEN_LENGTH
Limit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.static CharacterRunAutomaton
KEYWORD
Acts Similar to KeywordTokenizer.static CharacterRunAutomaton
SIMPLE
Acts like LetterTokenizer.static CharacterRunAutomaton
WHITESPACE
Acts Similar to WhitespaceTokenizer-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description MockTokenizer()
MockTokenizer(AttributeFactory factory)
MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase)
MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase)
MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
void
end()
boolean
incrementToken()
protected boolean
isTokenChar(int c)
protected int
normalize(int c)
protected int
readChar()
protected int
readCodePoint()
void
reset()
void
setEnableChecks(boolean enableChecks)
Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.protected void
setReaderTestPoint()
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
WHITESPACE
public static final CharacterRunAutomaton WHITESPACE
Acts Similar to WhitespaceTokenizer
-
KEYWORD
public static final CharacterRunAutomaton KEYWORD
Acts Similar to KeywordTokenizer. TODO: Keyword returns an "empty" token for an empty reader...
-
SIMPLE
public static final CharacterRunAutomaton SIMPLE
Acts like LetterTokenizer.
-
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTH
Limit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.This value defaults to
CharTokenizer.DEFAULT_MAX_WORD_LEN
(255).- See Also:
- "https://issues.apache.org/jira/browse/LUCENE-10541", Constant Field Values
-
-
Constructor Detail
-
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
MockTokenizer
public MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
MockTokenizer
public MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase)
-
MockTokenizer
public MockTokenizer()
-
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase)
-
MockTokenizer
public MockTokenizer(AttributeFactory factory)
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
readCodePoint
protected int readCodePoint() throws IOException
- Throws:
IOException
-
readChar
protected int readChar() throws IOException
- Throws:
IOException
-
isTokenChar
protected boolean isTokenChar(int c)
-
normalize
protected int normalize(int c)
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
close
public void close() throws IOException
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
setReaderTestPoint
protected void setReaderTestPoint()
- Overrides:
setReaderTestPoint
in classTokenizer
-
end
public void end() throws IOException
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
setEnableChecks
public void setEnableChecks(boolean enableChecks)
Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.
-
-