-
Notifications
You must be signed in to change notification settings - Fork 15
Constructing compressed bitvectors
This tutorial shows how to construct and use different compressed bit vectors. Depending on the density and distribution of the one bits in a bit vector, different bit vectors types provide better compression.
- R. Pagh: Low redundancy in dictionaries with O(1) worst case lookup time, Technical Report: ftp://ftp.cs.au.dk/BRICS/Reports/RS/98/28/BRICS-RS-98-28.pdf, Section 2.
- R. Raman, V. Raman and S. Rao: Succinct Indexable Dictionaries with Applications to representations of k-ary trees and multi-sets, SODA 2002
- G. Navarro and E. Providel: Fast, Small, Simple Rank/Select on Bitmaps., SEA 2012
The rrr_vector
can be parameterized with the block_size
template parameter. The block_size
specifies the size of the chunks for which the rrr_vector
calculates the class C[]
and offset O[]
. For more information we refer to the theory papers above.
We suggest using block sizes of either 15,31,63,127 or 255. Larger block sizes provide better compression at the cost of decreased query performance.
Example: A 64GB text can be compressed to 5.2GB using block_size=255
when constructing a wavelet tree over the text using rrr_vector
.
The rrr_vector
further provides a sample_rate
constructor parameter which specified at which intervals the C[]
array will be sampled to provide faster access
,rank
and select
support. The default sample_rate
is 32.
The following code snippet how to create a rrr_vector
from an uncompressed bit_vector
representation.
#include <stdint.h>
#include <sdsl/bit_vectors.hpp>
#include <sdsl/util.hpp>
using namespace sdsl;
int main(int argc,char** argv) {
uint64_t size = 123456789;
// create bv of size bits
bit_vector bv(size);
// set sparse bits
uint64_t num_ones = size * 0.1;
for(uint64_t i=0;i<num_ones;i++) {
bv[rand()%size] = 1;
}
// create rrr_vectors
rrr_vector<63> rrr_vec63(bv,32);
rrr_vector<31> rrr_vec31(bv);
rrr_vector<15> rrr_vec15(bv,32);
rrr_vector<127> rrr_vec127(bv,32);
// bv not needed anymore after this
// cout one bits
size_t count = 0;
for(rrr_vector<>::size_type i=0;i<rrr_vec15.size();i++) {
if(rrr_vec63[i] == 1) count++;
}
std::cout << "bv uses " << util::get_size_in_mega_bytes(bv) << " MB." << std::endl;
std::cout << "rrr_vec15 uses " << util::get_size_in_mega_bytes(rrr_vec15) << " MB." << std::endl;
std::cout << "rrr_vec31 uses " << util::get_size_in_mega_bytes(rrr_vec31) << " MB." << std::endl;
std::cout << "rrr_vec63 uses " << util::get_size_in_mega_bytes(rrr_vec63) << " MB." << std::endl;
std::cout << "rrr_vec127 uses " << util::get_size_in_mega_bytes(rrr_vec127) << " MB." << std::endl;
return EXIT_SUCCESS;
}
We construct the compressed rrr_vector
from an uncompressed bit_vector
. The block_size
is passed as a template parameter:
rrr_vector<63> rrr_vec63(bv,32);
rrr_vector<31> rrr_vec31(bv);
rrr_vector<15> rrr_vec15(bv,32);
rrr_vector<127> rrr_vec127(bv,32);
The individual bits of the compressed bit vector can again be accessed via the []
operator:
if(rrr_vec63[i] == 1) count++;
Note that rrr_vector
cannot be modified after construction.
A bit vector which compresses very sparse populated bit vectors by representing the positions of 1 by the Elias-Fano representation for non-decreasing sequences.
- P. Elias: Efficient storage and retrieval by content and address of static files, Journal of the ACM, 1974
- R. Fano: On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, Project MAC, MIT, 1971
- D. Okanohara, K. Sadakane: Practical Entropy-Compressed Rank/Select Dictionary, Proceedings of ALENEX 2007.