Package org.apache.lucene.analysis.ko
Class KoreanTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.ko.KoreanTokenizer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class KoreanTokenizer extends Tokenizer
Tokenizer for Korean that uses morphological analysis.This tokenizer sets a number of additional attributes:
PartOfSpeechAttribute
containing part-of-speech.ReadingAttribute
containing reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
KoreanTokenizer.DecompoundMode
Decompound mode: this determines how the tokenizer handlesPOS.Type.COMPOUND
,POS.Type.INFLECT
andPOS.Type.PREANALYSIS
tokens.static class
KoreanTokenizer.Type
Token type reflecting the original source of this token-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static KoreanTokenizer.DecompoundMode
DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD
.-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
void
end()
boolean
incrementToken()
void
reset()
void
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_DECOMPOUND
public static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD
.
-
-
Constructor Detail
-
KoreanTokenizer
public KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.Uses the default AttributeFactory.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory
- the AttributeFactory to useuserDictionary
- Optional: if non-null, user dictionary.mode
- Decompound mode.outputUnknownUnigrams
- if true outputs unigrams for unknown words.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory
- the AttributeFactory to useuserDictionary
- Optional: if non-null, user dictionary.mode
- Decompound mode.outputUnknownUnigrams
- if true outputs unigrams for unknown words.discardPunctuation
- true if punctuation tokens should be dropped from the output.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to
DictionaryBuilder
.- Parameters:
factory
- the AttributeFactory to usesystemDictionary
- a custom known token dictionaryunkDictionary
- a custom unknown token dictionaryconnectionCosts
- custom token transition costsuserDictionary
- Optional: if non-null, user dictionary.mode
- Decompound mode.outputUnknownUnigrams
- if true outputs unigrams for unknown words.discardPunctuation
- true if punctuation tokens should be dropped from the output.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Method Detail
-
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice
-
close
public void close() throws IOException
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
incrementToken
public boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
-