org.apache.lucene.util
Class WAH8DocIdSet

java.lang.Object
  extended by org.apache.lucene.search.DocIdSet
      extended by org.apache.lucene.util.WAH8DocIdSet

public final class WAH8DocIdSet
extends DocIdSet

DocIdSet implementation based on word-aligned hybrid encoding on words of 8 bits.

This implementation doesn't support random-access but has a fast DocIdSetIterator which can advance in logarithmic time thanks to an index.

The compression scheme is simplistic and should work well with sparse and very dense doc id sets while being only slightly larger than a FixedBitSet for incompressible sets (overhead<2% in the worst case) in spite of the index.

Format: The format is byte-aligned. An 8-bits word is either clean, meaning composed only of zeros or ones, or dirty, meaning that it contains between 1 and 7 bits set. The idea is to encode sequences of clean words using run-length encoding and to leave sequences of dirty words as-is.

TokenClean length+Dirty length+Dirty words
1 byte0-n bytes0-n bytes0-n bytes

This format cannot encode sequences of less than 2 clean words and 0 dirty word. The reason is that if you find a single clean word, you should rather encode it as a dirty word. This takes the same space as starting a new sequence (since you need one byte for the token) but will be lighter to decode. There is however an exception for the first sequence. Since the first sequence may start directly with a dirty word, the clean length is encoded directly, without subtracting 2.

There is an additional restriction on the format: the sequence of dirty words is not allowed to contain two consecutive clean words. This restriction exists to make sure no space is wasted and to make sure iterators can read the next doc ID by reading at most 2 dirty words.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
static class WAH8DocIdSet.Builder
          A builder for WAH8DocIdSets.
 
Field Summary
static int DEFAULT_INDEX_INTERVAL
          Default index interval.
 
Method Summary
 int cardinality()
          Return the number of documents in this DocIdSet in constant time.
static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets)
          Same as intersect(Collection, int) with the default index interval.
static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
          Compute the intersection of the provided sets.
 boolean isCacheable()
          This method is a hint for CachingWrapperFilter, if this DocIdSet should be cached without copying it into a BitSet.
 org.apache.lucene.util.WAH8DocIdSet.Iterator iterator()
          Provides a DocIdSetIterator to access the set.
 long ramBytesUsed()
          Return the memory usage of this class in bytes.
static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets)
          Same as union(Collection, int) with the default index interval.
static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
          Compute the union of the provided sets.
 
Methods inherited from class org.apache.lucene.search.DocIdSet
bits
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_INDEX_INTERVAL

public static final int DEFAULT_INDEX_INTERVAL
Default index interval.

See Also:
Constant Field Values
Method Detail

intersect

public static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets)
Same as intersect(Collection, int) with the default index interval.


intersect

public static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets,
                                     int indexInterval)
Compute the intersection of the provided sets. This method is much faster than computing the intersection manually since it operates directly at the byte level.


union

public static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets)
Same as union(Collection, int) with the default index interval.


union

public static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets,
                                 int indexInterval)
Compute the union of the provided sets. This method is much faster than computing the union manually since it operates directly at the byte level.


isCacheable

public boolean isCacheable()
Description copied from class: DocIdSet
This method is a hint for CachingWrapperFilter, if this DocIdSet should be cached without copying it into a BitSet. The default is to return false. If you have an own DocIdSet implementation that does its iteration very effective and fast without doing disk I/O, override this method and return true.

Overrides:
isCacheable in class DocIdSet

iterator

public org.apache.lucene.util.WAH8DocIdSet.Iterator iterator()
Description copied from class: DocIdSet
Provides a DocIdSetIterator to access the set. This implementation can return null if there are no docs that match.

Specified by:
iterator in class DocIdSet

cardinality

public int cardinality()
Return the number of documents in this DocIdSet in constant time.


ramBytesUsed

public long ramBytesUsed()
Return the memory usage of this class in bytes.



Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.