PatternAnalyzer (Lucene 3.6.2 API)

java.lang.Object
- org.apache.lucene.analysis.Analyzer
- - org.apache.lucene.analysis.ReusableAnalyzerBase
  - - org.apache.lucene.analysis.miscellaneous.PatternAnalyzer

All Implemented Interfaces:

Closeable
```
public final class PatternAnalyzer
extends ReusableAnalyzerBase
```
Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a Reader, that can flexibly separate text into terms via a regular expression Pattern (with behaviour identical to String.split(String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via String.split(String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene TokenFilter chain. For example as in this stemming example:
```
 PatternAnalyzer pat = ...
 TokenStream tokenStream = new SnowballFilter(
     pat.tokenStream("content", "James is running round in the woods"), 
     "English"));
 
```

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.analysis.ReusableAnalyzerBase
  ReusableAnalyzerBase.TokenStreamComponents

Field Summary

Fields
Modifier and Type	Field and Description
`static PatternAnalyzer`	`DEFAULT_ANALYZER` A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
`static PatternAnalyzer`	`EXTENDED_ANALYZER` A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader.
`static Pattern`	`NON_WORD_PATTERN` `"\\W+"`; Divides text at non-letters (NOT Character.isLetter(c))
`static Pattern`	`WHITESPACE_PATTERN` `"\\s+"`; Divides text at whitespaces (Character.isWhitespace(c))

Constructor Summary

Constructors
Constructor and Description
`PatternAnalyzer(Version matchVersion, Pattern pattern, boolean toLowerCase, Set<?> stopWords)` Constructs a new instance with the given parameters.

Method Summary

Methods
Modifier and Type	Method and Description
`ReusableAnalyzerBase.TokenStreamComponents`	`createComponents(String fieldName, Reader reader)` Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to `tokenStream(String, Reader, String)` and is less efficient than `tokenStream(String, Reader, String)`.
`ReusableAnalyzerBase.TokenStreamComponents`	`createComponents(String fieldName, Reader reader, String text)` Creates a token stream that tokenizes the given string into token terms (aka words).
`boolean`	`equals(Object other)` Indicates whether some other object is "equal to" this one.
`int`	`hashCode()` Returns a hash code value for the object.

Methods inherited from class org.apache.lucene.analysis.ReusableAnalyzerBase
initReader, reusableTokenStream, tokenStream

Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, setPreviousTokenStream

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - NON_WORD_PATTERN
```
public static final Pattern NON_WORD_PATTERN
```
    "\\W+"; Divides text at non-letters (NOT Character.isLetter(c))
  - WHITESPACE_PATTERN
```
public static final Pattern WHITESPACE_PATTERN
```
    "\\s+"; Divides text at whitespaces (Character.isWhitespace(c))
  - DEFAULT_ANALYZER
```
public static final PatternAnalyzer DEFAULT_ANALYZER
```
    A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
  - EXTENDED_ANALYZER
```
public static final PatternAnalyzer EXTENDED_ANALYZER
```
    A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
- Constructor Detail
  - PatternAnalyzer
```
public PatternAnalyzer(Version matchVersion,
               Pattern pattern,
               boolean toLowerCase,
               Set<?> stopWords)
```
    Constructs a new instance with the given parameters.
    
    Parameters:
    matchVersion - currently does nothing
    pattern - a regular expression delimiting tokens
    toLowerCase - if true returns tokens after applying String.toLowerCase()
    stopWords - if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via StopFilter.makeStopSet(Version, String[])and/or WordlistLoaderas in WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt") or other stop words lists .
- Method Detail
  - createComponents
```
public ReusableAnalyzerBase.TokenStreamComponents createComponents(String fieldName,
                                                          Reader reader,
                                                          String text)
```
    Creates a token stream that tokenizes the given string into token terms (aka words).
    
    Parameters:
    fieldName - the name of the field to tokenize (currently ignored).
    reader - reader (e.g. charfilter) of the original text. can be null.
    text - the string to tokenize
    
    Returns:
    a new token stream
  - createComponents
```
public ReusableAnalyzerBase.TokenStreamComponents createComponents(String fieldName,
                                                          Reader reader)
```
    Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, Reader, String) and is less efficient than tokenStream(String, Reader, String).
    
    Specified by:
    
    createComponents in class ReusableAnalyzerBase
    
    Parameters:
    fieldName - the name of the field to tokenize (currently ignored).
    reader - the reader delivering the text
    
    Returns:
    a new token stream
  - equals
```
public boolean equals(Object other)
```
    Indicates whether some other object is "equal to" this one.
    
    Overrides:
    
    equals in class Object
    
    Parameters:
    other - the reference object with which to compare.
    
    Returns:
    true if equal, false otherwise
  - hashCode
```
public int hashCode()
```
    Returns a hash code value for the object.
    
    Overrides:
    
    hashCode in class Object
    
    Returns:
    the hash code.

Class PatternAnalyzer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.ReusableAnalyzerBase

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.ReusableAnalyzerBase

Methods inherited from class org.apache.lucene.analysis.Analyzer

Methods inherited from class java.lang.Object

Field Detail

NON_WORD_PATTERN

WHITESPACE_PATTERN

DEFAULT_ANALYZER

EXTENDED_ANALYZER

Constructor Detail

PatternAnalyzer

Method Detail

createComponents

createComponents

equals

hashCode