Class WikipediaTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class WikipediaTokenizer extends Tokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static int
ACRONYM_ID
static int
ALPHANUM_ID
static int
APOSTROPHE_ID
static String
BOLD
static int
BOLD_ID
static String
BOLD_ITALICS
static int
BOLD_ITALICS_ID
static int
BOTH
Output the both the untokenized token and the splitsstatic String
CATEGORY
static int
CATEGORY_ID
static String
CITATION
static int
CITATION_ID
static int
CJ_ID
static int
COMPANY_ID
static int
EMAIL_ID
static String
EXTERNAL_LINK
static int
EXTERNAL_LINK_ID
static String
EXTERNAL_LINK_URL
static int
EXTERNAL_LINK_URL_ID
static String
HEADING
static int
HEADING_ID
static int
HOST_ID
static String
INTERNAL_LINK
static int
INTERNAL_LINK_ID
static String
ITALICS
static int
ITALICS_ID
static int
NUM_ID
static String
SUB_HEADING
static int
SUB_HEADING_ID
static String[]
TOKEN_TYPES
String token types that correspond to token type int constantsstatic int
TOKENS_ONLY
Only output tokensstatic int
UNTOKENIZED_ONLY
Only output untokenized tokens, which are tokens that would normally be split into several tokensstatic int
UNTOKENIZED_TOKEN_FLAG
This flag is used to indicate that the produced "Token" would, ifTOKENS_ONLY
was used, produce multiple tokens.-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description WikipediaTokenizer()
Creates a new instance of theWikipediaTokenizer
.WikipediaTokenizer(int tokenOutput, Set<String> untokenizedTypes)
Creates a new instance of theWikipediaTokenizer
.WikipediaTokenizer(AttributeFactory factory, int tokenOutput, Set<String> untokenizedTypes)
Creates a new instance of theWikipediaTokenizer
.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
void
end()
boolean
incrementToken()
void
reset()
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
INTERNAL_LINK
public static final String INTERNAL_LINK
- See Also:
- Constant Field Values
-
EXTERNAL_LINK
public static final String EXTERNAL_LINK
- See Also:
- Constant Field Values
-
EXTERNAL_LINK_URL
public static final String EXTERNAL_LINK_URL
- See Also:
- Constant Field Values
-
CITATION
public static final String CITATION
- See Also:
- Constant Field Values
-
CATEGORY
public static final String CATEGORY
- See Also:
- Constant Field Values
-
BOLD
public static final String BOLD
- See Also:
- Constant Field Values
-
ITALICS
public static final String ITALICS
- See Also:
- Constant Field Values
-
BOLD_ITALICS
public static final String BOLD_ITALICS
- See Also:
- Constant Field Values
-
HEADING
public static final String HEADING
- See Also:
- Constant Field Values
-
SUB_HEADING
public static final String SUB_HEADING
- See Also:
- Constant Field Values
-
ALPHANUM_ID
public static final int ALPHANUM_ID
- See Also:
- Constant Field Values
-
APOSTROPHE_ID
public static final int APOSTROPHE_ID
- See Also:
- Constant Field Values
-
ACRONYM_ID
public static final int ACRONYM_ID
- See Also:
- Constant Field Values
-
COMPANY_ID
public static final int COMPANY_ID
- See Also:
- Constant Field Values
-
EMAIL_ID
public static final int EMAIL_ID
- See Also:
- Constant Field Values
-
HOST_ID
public static final int HOST_ID
- See Also:
- Constant Field Values
-
NUM_ID
public static final int NUM_ID
- See Also:
- Constant Field Values
-
CJ_ID
public static final int CJ_ID
- See Also:
- Constant Field Values
-
INTERNAL_LINK_ID
public static final int INTERNAL_LINK_ID
- See Also:
- Constant Field Values
-
EXTERNAL_LINK_ID
public static final int EXTERNAL_LINK_ID
- See Also:
- Constant Field Values
-
CITATION_ID
public static final int CITATION_ID
- See Also:
- Constant Field Values
-
CATEGORY_ID
public static final int CATEGORY_ID
- See Also:
- Constant Field Values
-
BOLD_ID
public static final int BOLD_ID
- See Also:
- Constant Field Values
-
ITALICS_ID
public static final int ITALICS_ID
- See Also:
- Constant Field Values
-
BOLD_ITALICS_ID
public static final int BOLD_ITALICS_ID
- See Also:
- Constant Field Values
-
HEADING_ID
public static final int HEADING_ID
- See Also:
- Constant Field Values
-
SUB_HEADING_ID
public static final int SUB_HEADING_ID
- See Also:
- Constant Field Values
-
EXTERNAL_LINK_URL_ID
public static final int EXTERNAL_LINK_URL_ID
- See Also:
- Constant Field Values
-
TOKEN_TYPES
public static final String[] TOKEN_TYPES
String token types that correspond to token type int constants
-
TOKENS_ONLY
public static final int TOKENS_ONLY
Only output tokens- See Also:
- Constant Field Values
-
UNTOKENIZED_ONLY
public static final int UNTOKENIZED_ONLY
Only output untokenized tokens, which are tokens that would normally be split into several tokens- See Also:
- Constant Field Values
-
BOTH
public static final int BOTH
Output the both the untokenized token and the splits- See Also:
- Constant Field Values
-
UNTOKENIZED_TOKEN_FLAG
public static final int UNTOKENIZED_TOKEN_FLAG
This flag is used to indicate that the produced "Token" would, ifTOKENS_ONLY
was used, produce multiple tokens.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
WikipediaTokenizer
public WikipediaTokenizer()
Creates a new instance of theWikipediaTokenizer
. Attaches theinput
to a newly created JFlex scanner.
-
WikipediaTokenizer
public WikipediaTokenizer(int tokenOutput, Set<String> untokenizedTypes)
Creates a new instance of theWikipediaTokenizer
. Attaches theinput
to the newly created JFlex scanner.- Parameters:
tokenOutput
- One ofTOKENS_ONLY
,UNTOKENIZED_ONLY
,BOTH
-
WikipediaTokenizer
public WikipediaTokenizer(AttributeFactory factory, int tokenOutput, Set<String> untokenizedTypes)
Creates a new instance of theWikipediaTokenizer
. Attaches theinput
to the newly created JFlex scanner. Uses the givenAttributeFactory
.- Parameters:
tokenOutput
- One ofTOKENS_ONLY
,UNTOKENIZED_ONLY
,BOTH
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
close
public void close() throws IOException
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
-