|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.analysis.Analyzer org.apache.lucene.analysis.miscellaneous.PatternAnalyzer
public final class PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a
Reader
, that can flexibly separate text into terms via a regular expression Pattern
(with behaviour identical to String.split(String)
),
and that combines the functionality of
LetterTokenizer
,
LowerCaseTokenizer
,
WhitespaceTokenizer
,
StopFilter
into a single efficient
multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider
prototyping by simply trying various expressions on some test texts via
String.split(String)
. Once you are satisfied, give that regex to
PatternAnalyzer. Also see Java Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers.
It can also serve as a building block in a compound Lucene
TokenFilter
chain. For example as in this
stemming example:
PatternAnalyzer pat = ... TokenStream tokenStream = new SnowballFilter( pat.tokenStream("content", "James is running round in the woods"), "English"));
Field Summary | |
---|---|
static PatternAnalyzer |
DEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader. |
static PatternAnalyzer |
EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. |
static Pattern |
NON_WORD_PATTERN
"\\W+" ; Divides text at non-letters (NOT Character.isLetter(c)) |
static Pattern |
WHITESPACE_PATTERN
"\\s+" ; Divides text at whitespaces (Character.isWhitespace(c)) |
Constructor Summary | |
---|---|
PatternAnalyzer(Version matchVersion,
Pattern pattern,
boolean toLowerCase,
Set<?> stopWords)
Constructs a new instance with the given parameters. |
Method Summary | |
---|---|
boolean |
equals(Object other)
Indicates whether some other object is "equal to" this one. |
int |
hashCode()
Returns a hash code value for the object. |
TokenStream |
tokenStream(String fieldName,
Reader reader)
Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, String) and is
less efficient than tokenStream(String, String) . |
TokenStream |
tokenStream(String fieldName,
String text)
Creates a token stream that tokenizes the given string into token terms (aka words). |
Methods inherited from class org.apache.lucene.analysis.Analyzer |
---|
close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, reusableTokenStream, setPreviousTokenStream |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final Pattern NON_WORD_PATTERN
"\\W+"
; Divides text at non-letters (NOT Character.isLetter(c))
public static final Pattern WHITESPACE_PATTERN
"\\s+"
; Divides text at whitespaces (Character.isWhitespace(c))
public static final PatternAnalyzer DEFAULT_ANALYZER
public static final PatternAnalyzer EXTENDED_ANALYZER
Constructor Detail |
---|
public PatternAnalyzer(Version matchVersion, Pattern pattern, boolean toLowerCase, Set<?> stopWords)
matchVersion
- If >= Version.LUCENE_29
, StopFilter.enablePositionIncrement is set to truepattern
- a regular expression delimiting tokenstoLowerCase
- if true
returns tokens after applying
String.toLowerCase()stopWords
- if non-null, ignores all tokens that are contained in the
given stop set (after previously having applied toLowerCase()
if applicable). For example, created via
StopFilter.makeStopSet(Version, String[])
and/or
WordlistLoader
as in
WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words
lists .Method Detail |
---|
public TokenStream tokenStream(String fieldName, String text)
fieldName
- the name of the field to tokenize (currently ignored).text
- the string to tokenize
public TokenStream tokenStream(String fieldName, Reader reader)
tokenStream(String, String)
and is
less efficient than tokenStream(String, String)
.
tokenStream
in class Analyzer
fieldName
- the name of the field to tokenize (currently ignored).reader
- the reader delivering the text
public boolean equals(Object other)
equals
in class Object
other
- the reference object with which to compare.
public int hashCode()
hashCode
in class Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |