See: Description
| Interface | Description |
|---|---|
| BoundaryScanner |
Finds fragment boundaries: pluggable into
BaseFragmentsBuilder |
| FragListBuilder |
FragListBuilder is an interface for FieldFragList builder classes.
|
| FragmentsBuilder |
FragmentsBuilder is an interface for fragments (snippets) builder classes. |
| Class | Description |
|---|---|
| BaseFragListBuilder |
A abstract implementation of
FragListBuilder. |
| BaseFragmentsBuilder |
Base FragmentsBuilder implementation that supports colored pre/post
tags and multivalued fields.
|
| BreakIteratorBoundaryScanner |
A
BoundaryScanner implementation that uses BreakIterator to find
boundaries in the text. |
| FastVectorHighlighter |
Another highlighter implementation.
|
| FieldFragList |
FieldFragList has a list of "frag info" that is used by FragmentsBuilder class
to create fragments (snippets).
|
| FieldFragList.WeightedFragInfo |
List of term offsets + weight for a frag info
|
| FieldFragList.WeightedFragInfo.SubInfo |
Represents the list of term offsets for some text
|
| FieldPhraseList |
FieldPhraseList has a list of WeightedPhraseInfo that is used by FragListBuilder
to create a FieldFragList object.
|
| FieldPhraseList.WeightedPhraseInfo |
Represents the list of term offsets and boost for some text
|
| FieldPhraseList.WeightedPhraseInfo.Toffs |
Term offsets (start + end)
|
| FieldQuery |
FieldQuery breaks down query object into terms/phrases and keeps
them in a QueryPhraseMap structure.
|
| FieldQuery.QueryPhraseMap |
Internal structure of a query for highlighting: represents
a nested query structure
|
| FieldTermStack |
FieldTermStack is a stack that keeps query terms in the specified field
of the document to be highlighted. |
| FieldTermStack.TermInfo |
Single term with its position/offsets in the document and IDF weight.
|
| ScoreOrderFragmentsBuilder |
An implementation of FragmentsBuilder that outputs score-order fragments.
|
| ScoreOrderFragmentsBuilder.ScoreComparator |
Comparator for
FieldFragList.WeightedFragInfo by boost, breaking ties
by offset. |
| SimpleBoundaryScanner |
Simple boundary scanner implementation that divides fragments
based on a set of separator characters.
|
| SimpleFieldFragList |
A simple implementation of
FieldFragList. |
| SimpleFragListBuilder |
A simple implementation of
FragListBuilder. |
| SimpleFragmentsBuilder |
A simple implementation of FragmentsBuilder.
|
| SingleFragListBuilder |
An implementation class of
FragListBuilder that generates one FieldFragList.WeightedFragInfo object. |
| WeightedFieldFragList |
A weighted implementation of
FieldFragList. |
| WeightedFragListBuilder |
A weighted implementation of
FragListBuilder. |
To explain the algorithm, let's use the following sample text (to be highlighted) and user query:
| Sample Text | Lucene is a search engine library. |
| User Query | Lucene^2 OR "search library"~1 |
The user query is a BooleanQuery that consists of TermQuery("Lucene") with boost of 2 and PhraseQuery("search library") with slop of 1.
For your convenience, here is the offsets and positions info of the sample text.
+--------+-----------------------------------+ | | 1111111111222222222233333| | offset|01234567890123456789012345678901234| +--------+-----------------------------------+ |document|Lucene is a search engine library. | +--------*-----------------------------------+ |position|0 1 2 3 4 5 | +--------*-----------------------------------+
In Step 1, Fast Vector Highlighter generates FieldQuery.QueryPhraseMap from the user query.
QueryPhraseMap consists of the following members:
public class QueryPhraseMap {
boolean terminal;
int slop; // valid if terminal == true and phraseHighlight == true
float boost; // valid if terminal == true
Map<String, QueryPhraseMap> subMap;
}
QueryPhraseMap has subMap. The key of the subMap is a term
text in the user query and the value is a subsequent QueryPhraseMap.
If the query is a term (not phrase), then the subsequent QueryPhraseMap
is marked as terminal. If the query is a phrase, then the subsequent QueryPhraseMap
is not a terminal and it has the next term text in the phrase.
From the sample user query, the following QueryPhraseMap
will be generated:
QueryPhraseMap +--------+-+ +-------+-+ |"Lucene"|o+->|boost=2|*| * : terminal +--------+-+ +-------+-+ +--------+-+ +---------+-+ +-------+------+-+ |"search"|o+->|"library"|o+->|boost=1|slop=1|*| +--------+-+ +---------+-+ +-------+------+-+
In Step 2, Fast Vector Highlighter generates FieldTermStack. Fast Vector Highlighter uses term vector data
(must be stored FieldType.setStoreTermVectorOffsets(boolean) and FieldType.setStoreTermVectorPositions(boolean))
to generate it. FieldTermStack keeps the terms in the user query.
Therefore, in this sample case, Fast Vector Highlighter generates the following FieldTermStack:
FieldTermStack +------------------+ |"Lucene"(0,6,0) | +------------------+ |"search"(12,18,3) | +------------------+ |"library"(26,33,5)| +------------------+ where : "termText"(startOffset,endOffset,position)
In Step 3, Fast Vector Highlighter generates FieldPhraseList
by reference to QueryPhraseMap and FieldTermStack.
FieldPhraseList +----------------+-----------------+---+ |"Lucene" |[(0,6)] |w=2| +----------------+-----------------+---+ |"search library"|[(12,18),(26,33)]|w=1| +----------------+-----------------+---+
The type of each entry is WeightedPhraseInfo that consists of
an array of terms offsets and weight.
In Step 4, Fast Vector Highlighter creates FieldFragList by reference to
FieldPhraseList. In this sample case, the following
FieldFragList will be generated:
FieldFragList +---------------------------------+ |"Lucene"[(0,6)] | |"search library"[(12,18),(26,33)]| |totalBoost=3 | +---------------------------------+
The calculation for each FieldFragList.WeightedFragInfo.totalBoost (weight)
depends on the implementation of FieldFragList.add( ... ):
public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
float totalBoost = 0;
List<SubInfo> subInfos = new ArrayList<SubInfo>();
for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
totalBoost += phraseInfo.getBoost();
}
getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
}
The used implementation of FieldFragList is noted in BaseFragListBuilder.createFieldFragList( ... ):
public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
}
Currently there are basically to approaches available:
SimpleFragListBuilder using SimpleFieldFragList: sum-of-boosts-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0WeightedFragListBuilder using WeightedFieldFragList: sum-of-distinct-weights-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.Comparison of the two approaches:
| Terms in fragment | sum-of-distinct-weights | sum-of-boosts |
|---|---|---|
| das alte testament | 5.339621 | 3.0 |
| das alte testament | 5.339621 | 3.0 |
| das testament alte | 5.339621 | 3.0 |
| das alte testament | 5.339621 | 3.0 |
| das testament | 2.9455688 | 2.0 |
| das alte | 2.4759595 | 2.0 |
| das das das das | 1.5015357 | 4.0 |
| das das das | 1.3003681 | 3.0 |
| das das | 1.061746 | 2.0 |
| alte | 1.0 | 1.0 |
| alte | 1.0 | 1.0 |
| das | 0.7507678 | 1.0 |
| das | 0.7507678 | 1.0 |
| das | 0.7507678 | 1.0 |
| das | 0.7507678 | 1.0 |
| das | 0.7507678 | 1.0 |
In Step 5, by using FieldFragList and the field stored data,
Fast Vector Highlighter creates highlighted snippets!
Copyright © 2000-2018 Apache Software Foundation. All Rights Reserved.