|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.util.AttributeImpl org.apache.lucene.analysis.Token
public class Token
A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc.
The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. Payload) in the form of a variable
length byte array. Use TermPositions.getPayloadLength()
and
TermPositions.getPayload(byte[], int)
to retrieve the payloads from the index.
NOTE: As of 2.9, Token implements all Attribute
interfaces
that are part of core Lucene and can be found in the tokenattributes
subpackage.
Even though it is not necessary to use Token anymore, with the new TokenStream API it can
be used as convenience class that implements all Attribute
s, which is especially useful
to easily switch from the old to the new TokenStream API.
NOTE: As of 2.3, Token stores the term text
internally as a malleable char[] termBuffer instead of
String termText. The indexing code and core tokenizers
have been changed to re-use a single Token instance, changing
its buffer and other fields in-place as the Token is
processed. This provides substantially better indexing
performance as it saves the GC cost of new'ing a Token and
String for every term. The APIs that accept String
termText are still available but a warning about the
associated performance cost has been added (below). The
termText()
method has been deprecated.
Tokenizers and TokenFilters should try to re-use a Token
instance when possible for best performance, by
implementing the TokenStream.incrementToken()
API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. To load
the token from a char[] use setTermBuffer(char[], int, int)
.
To load from a String use setTermBuffer(String)
or setTermBuffer(String, int, int)
.
Alternatively you can get the Token's termBuffer by calling either termBuffer()
,
if you know that your text is shorter than the capacity of the termBuffer
or resizeTermBuffer(int)
, if there is any possibility
that you may need to grow the buffer. Fill in the characters of your term into this
buffer, with String.getChars(int, int, char[], int)
if loading from a string,
or with System.arraycopy(Object, int, Object, int, int)
, and finally call setTermLength(int)
to
set the length of the term text. See LUCENE-969
for details.
Typical Token reuse patterns:
DEFAULT_TYPE
if not specified):return reusableToken.reinit(string, startOffset, endOffset[, type]);
DEFAULT_TYPE
if not specified):return reusableToken.reinit(string, 0, string.length(), startOffset, endOffset[, type]);
DEFAULT_TYPE
if not specified):return reusableToken.reinit(buffer, 0, buffer.length, startOffset, endOffset[, type]);
DEFAULT_TYPE
if not specified):return reusableToken.reinit(buffer, start, end - start, startOffset, endOffset[, type]);
DEFAULT_TYPE
if not specified):return reusableToken.reinit(source.termBuffer(), 0, source.termLength(), source.startOffset(), source.endOffset()[, source.type()]);
TokenStreams
can be chained, one cannot assume that the Token's
current type is correct.
Payload
,
Serialized FormField Summary | |
---|---|
static String |
DEFAULT_TYPE
|
Constructor Summary | |
---|---|
Token()
Constructs a Token will null text. |
|
Token(char[] startTermBuffer,
int termBufferOffset,
int termBufferLength,
int start,
int end)
Constructs a Token with the given term buffer (offset & length), start and end offsets |
|
Token(int start,
int end)
Constructs a Token with null text and start & end offsets. |
|
Token(int start,
int end,
int flags)
Constructs a Token with null text and start & end offsets plus flags. |
|
Token(int start,
int end,
String typ)
Constructs a Token with null text and start & end offsets plus the Token type. |
|
Token(String text,
int start,
int end)
Constructs a Token with the given term text, and start & end offsets. |
|
Token(String text,
int start,
int end,
int flags)
Constructs a Token with the given text, start and end offsets, & type. |
|
Token(String text,
int start,
int end,
String typ)
Constructs a Token with the given text, start and end offsets, & type. |
Method Summary | |
---|---|
void |
clear()
Resets the term text, payload, flags, and positionIncrement, startOffset, endOffset and token type to default. |
Object |
clone()
Shallow clone. |
Token |
clone(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Makes a clone, but replaces the term buffer & start/end offset in the process. |
void |
copyTo(AttributeImpl target)
Copies the values from this Attribute into the passed-in target attribute. |
int |
endOffset()
Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text. |
boolean |
equals(Object obj)
All values used for computation of AttributeImpl.hashCode()
should be checked here for equality. |
int |
getFlags()
EXPERIMENTAL: While we think this is here to stay, we may want to change it to be a long. |
Payload |
getPayload()
Returns this Token's payload. |
int |
getPositionIncrement()
Returns the position increment of this Token. |
int |
hashCode()
Subclasses must implement this method and should compute a hashCode similar to this: |
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(char[] newTermBuffer,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(char[], int, int) ,
setStartOffset(int) ,
setEndOffset(int) ,
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset)
Shorthand for calling clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) on Token.DEFAULT_TYPE |
Token |
reinit(String newTerm,
int newTermOffset,
int newTermLength,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(String, int, int) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
Token |
reinit(String newTerm,
int newStartOffset,
int newEndOffset,
String newType)
Shorthand for calling clear() ,
setTermBuffer(String) ,
setStartOffset(int) ,
setEndOffset(int)
setType(java.lang.String) |
void |
reinit(Token prototype)
Copy the prototype token's fields into this one. |
void |
reinit(Token prototype,
char[] newTermBuffer,
int offset,
int length)
Copy the prototype token's fields into this one, with a different term. |
void |
reinit(Token prototype,
String newTerm)
Copy the prototype token's fields into this one, with a different term. |
char[] |
resizeTermBuffer(int newSize)
Grows the termBuffer to at least size newSize, preserving the existing content. |
void |
setEndOffset(int offset)
Set the ending offset. |
void |
setFlags(int flags)
|
void |
setOffset(int startOffset,
int endOffset)
Set the starting and ending offset. |
void |
setPayload(Payload payload)
Sets this Token's payload. |
void |
setPositionIncrement(int positionIncrement)
Set the position increment. |
void |
setStartOffset(int offset)
Set the starting offset. |
void |
setTermBuffer(char[] buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset for length characters, into the termBuffer array. |
void |
setTermBuffer(String buffer)
Copies the contents of buffer into the termBuffer array. |
void |
setTermBuffer(String buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset and continuing for length characters, into the termBuffer array. |
void |
setTermLength(int length)
Set number of valid characters (length of the term) in the termBuffer array. |
void |
setTermText(String text)
Deprecated. use setTermBuffer(char[], int, int) or
setTermBuffer(String) or
setTermBuffer(String, int, int) . |
void |
setType(String type)
Set the lexical type. |
int |
startOffset()
Returns this Token's starting offset, the position of the first character corresponding to this token in the source text. |
String |
term()
Returns the Token's term text. |
char[] |
termBuffer()
Returns the internal termBuffer character array which you can then directly alter. |
int |
termLength()
Return number of valid characters (length of the term) in the termBuffer array. |
String |
termText()
Deprecated. This method now has a performance penalty because the text is stored internally in a char[]. If possible, use termBuffer() and termLength() directly instead. If you really need a
String, use term() |
String |
toString()
The default implementation of this method accesses all declared fields of this object and prints the values in the following syntax: |
String |
type()
Returns this Token's lexical type. |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_TYPE
Constructor Detail |
---|
public Token()
public Token(int start, int end)
start
- start offset in the source textend
- end offset in the source textpublic Token(int start, int end, String typ)
start
- start offset in the source textend
- end offset in the source texttyp
- the lexical type of this Tokenpublic Token(int start, int end, int flags)
start
- start offset in the source textend
- end offset in the source textflags
- The bits to set for this tokenpublic Token(String text, int start, int end)
text
- term textstart
- start offsetend
- end offsetpublic Token(String text, int start, int end, String typ)
text
- term textstart
- start offsetend
- end offsettyp
- token typepublic Token(String text, int start, int end, int flags)
text
- start
- end
- flags
- token type bitspublic Token(char[] startTermBuffer, int termBufferOffset, int termBufferLength, int start, int end)
startTermBuffer
- termBufferOffset
- termBufferLength
- start
- end
- Method Detail |
---|
public void setPositionIncrement(int positionIncrement)
TokenStream
, used in phrase
searching.
The default value is one.
Some common uses for this are:
setPositionIncrement
in interface PositionIncrementAttribute
positionIncrement
- the distance from the prior termTermPositions
public int getPositionIncrement()
getPositionIncrement
in interface PositionIncrementAttribute
setPositionIncrement(int)
public void setTermText(String text)
setTermBuffer(char[], int, int)
or
setTermBuffer(String)
or
setTermBuffer(String, int, int)
.
public final String termText()
termBuffer()
and termLength()
directly instead. If you really need a
String, use term()
public final String term()
termBuffer()
and termLength()
directly instead. If you really need a
String, use this method, which is nothing more than
a convenience call to new String(token.termBuffer(), 0, token.termLength())
term
in interface TermAttribute
public final void setTermBuffer(char[] buffer, int offset, int length)
setTermBuffer
in interface TermAttribute
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final void setTermBuffer(String buffer)
setTermBuffer
in interface TermAttribute
buffer
- the buffer to copypublic final void setTermBuffer(String buffer, int offset, int length)
setTermBuffer
in interface TermAttribute
buffer
- the buffer to copyoffset
- the index in the buffer of the first character to copylength
- the number of characters to copypublic final char[] termBuffer()
resizeTermBuffer(int)
to increase it. After
altering the buffer be sure to call setTermLength(int)
to record the number of valid
characters that were placed into the termBuffer.
termBuffer
in interface TermAttribute
public char[] resizeTermBuffer(int newSize)
setTermBuffer(char[], int, int)
,
setTermBuffer(String)
, or
setTermBuffer(String, int, int)
to optimally combine the resize with the setting of the termBuffer.
resizeTermBuffer
in interface TermAttribute
newSize
- minimum size of the new termBuffer
public final int termLength()
termLength
in interface TermAttribute
public final void setTermLength(int length)
resizeTermBuffer(int)
first.
setTermLength
in interface TermAttribute
length
- the truncated lengthpublic final int startOffset()
startOffset
in interface OffsetAttribute
public void setStartOffset(int offset)
startOffset()
public final int endOffset()
endOffset
in interface OffsetAttribute
public void setEndOffset(int offset)
endOffset()
public void setOffset(int startOffset, int endOffset)
setOffset
in interface OffsetAttribute
and #endOffset()
public final String type()
type
in interface TypeAttribute
public final void setType(String type)
setType
in interface TypeAttribute
type()
public int getFlags()
type()
, although they do share similar purposes.
The flags can be used to encode information about the token for use by other TokenFilter
s.
getFlags
in interface FlagsAttribute
public void setFlags(int flags)
setFlags
in interface FlagsAttribute
getFlags()
public Payload getPayload()
getPayload
in interface PayloadAttribute
public void setPayload(Payload payload)
setPayload
in interface PayloadAttribute
public String toString()
AttributeImpl
public String toString() { return "start=" + startOffset + ",end=" + endOffset; }This method may be overridden by subclasses.
toString
in class AttributeImpl
public void clear()
clear
in class AttributeImpl
public Object clone()
AttributeImpl
clone
in class AttributeImpl
public Token clone(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
public boolean equals(Object obj)
AttributeImpl
AttributeImpl.hashCode()
should be checked here for equality.
see also Object.equals(Object)
equals
in class AttributeImpl
public int hashCode()
AttributeImpl
public int hashCode() { int code = startOffset; code = code * 31 + endOffset; return code; }see also
AttributeImpl.equals(Object)
hashCode
in class AttributeImpl
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
,
setType(java.lang.String)
public Token reinit(char[] newTermBuffer, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(char[], int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public Token reinit(String newTerm, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset, String newType)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
public Token reinit(String newTerm, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public Token reinit(String newTerm, int newTermOffset, int newTermLength, int newStartOffset, int newEndOffset)
clear()
,
setTermBuffer(String, int, int)
,
setStartOffset(int)
,
setEndOffset(int)
setType(java.lang.String)
on Token.DEFAULT_TYPE
public void reinit(Token prototype)
prototype
- public void reinit(Token prototype, String newTerm)
prototype
- newTerm
- public void reinit(Token prototype, char[] newTermBuffer, int offset, int length)
prototype
- newTermBuffer
- offset
- length
- public void copyTo(AttributeImpl target)
AttributeImpl
copyTo
in class AttributeImpl
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |