Skip to content

Constructing compressed bitvectors

simongog edited this page May 19, 2012 · 11 revisions

Introduction

This tutorial shows how to construct and use different compressed bit vectors. Depending on the density and distribution of the one bits in a bit vector, different bit vectors types provide better compression.

Overview

RRR Vectors

Theory

Parameters

The rrr_vector can be parameterized with the block_size template parameter. The block_size specifies the size of the chunks for which the rrr_vector calculates the class C[] and offset O[]. For more information we refer to the theory papers above.

We suggest using block sizes of either 15,31,63,127 or 255. Larger block sizes provide better compression at the cost of decreased query performance.

Example: A 64GB text can be compressed to 5.2GB using block_size=255 when constructing a wavelet tree over the text using rrr_vector.

The rrr_vector further provides a sample_rate constructor parameter which specified at which intervals the C[] array will be sampled to provide faster access,rank and select support. The default sample_rate is 32.

Code

The following code snippet how to create a rrr_vector from an uncompressed bit_vector representation.

#include <stdint.h>
#include <sdsl/bit_vectors.hpp>
#include <sdsl/util.hpp>

using namespace sdsl;

int main(int argc,char** argv) {
    uint64_t size = 123456789;
    // create bv of size bits
    bit_vector bv(size);

    // set sparse bits
    uint64_t num_ones = size * 0.1;
    for(uint64_t i=0;i<num_ones;i++) {
        bv[rand()%size] = 1;
    }

    // create rrr_vectors
    rrr_vector<63> rrr_vec63(bv,32);
    rrr_vector<31> rrr_vec31(bv);
    rrr_vector<15> rrr_vec15(bv,32);
    rrr_vector<127> rrr_vec127(bv,32);
    // bv not needed anymore after this

    // cout one bits
    size_t count = 0;
    for(rrr_vector<>::size_type i=0;i<rrr_vec15.size();i++) {
        if(rrr_vec63[i] == 1) count++;
    }

    std::cout << "bv uses " << util::get_size_in_mega_bytes(bv) << " MB." << std::endl;
    std::cout << "rrr_vec15 uses " << util::get_size_in_mega_bytes(rrr_vec15) << " MB." << std::endl;
    std::cout << "rrr_vec31 uses " << util::get_size_in_mega_bytes(rrr_vec31) << " MB." << std::endl;
    std::cout << "rrr_vec63 uses " << util::get_size_in_mega_bytes(rrr_vec63) << " MB." << std::endl;
    std::cout << "rrr_vec127 uses " << util::get_size_in_mega_bytes(rrr_vec127) << " MB." << std::endl;
    return EXIT_SUCCESS;
}

Explanation

We construct the compressed rrr_vector from an uncompressed bit_vector. The block_size is passed as a template parameter:

rrr_vector<63> rrr_vec63(bv,32);
rrr_vector<31> rrr_vec31(bv);
rrr_vector<15> rrr_vec15(bv,32);
rrr_vector<127> rrr_vec127(bv,32);

The individual bits of the compressed bit vector can again be accessed via the [] operator:

if(rrr_vec63[i] == 1) count++;

Note that rrr_vector cannot be modified after construction.

SD-Vectors

A bit vector which compresses very sparse populated bit vectors by representing the positions of 1 by the Elias-Fano representation for non-decreasing sequences.

Theory

  • P. Elias: Efficient storage and retrieval by content and address of static files, Journal of the ACM, 1974
  • R. Fano: On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, Project MAC, MIT, 1971
  • D. Okanohara, K. Sadakane: Practical Entropy-Compressed Rank/Select Dictionary, Proceedings of ALENEX 2007.

Parameters

Code

Explanation