Skip to content
tb38 edited this page Aug 30, 2012 · 15 revisions

Getting started with the succinct data structure library (sdsl)

Since the sdsl is a generic C++ library you should be familiar with C++ and it is advantageously to be also familiar with the C++ standard template library (STL). In fact the structure of the sdsl is very similar to the structure of the STL. Let us explain that by and example.

Using bit-compressed integer vectors

Suppose you want to store 1000 integers in a vector and each integers is in the range from 0 to 7. Using the STL you would take a vector<char> to solve this task space-efficient. Whoever this is still a waste of space, since one integer can be encoded in (\log(8)=3) bits and the vector<char> uses 8 bits for each element. So we waste (5/8=62.5%) space.

By using the sdsl int_vector<> we can easily use bit compression, i.e. we do not use a predefined fixed number of bits (like 8, 16, 32 or 64), but we use the width of the largest integer in the vector. The constructor of the int_vector<> therefore takes a third argument which specifies that width:

sdsl::int_vector<> bc_vec(1000, 0, 3);

Note that the first two arguments -- the length of the vector and the default value of each element -- are the same as in the STL implementation. The i-th element of the vector can be read and written by operator[]. This enables us to easily replace STL integer vectors by sdsl integer vectors, since we only have to replace the definition and the remaining code is not touched. The bit compression of the vector gains space but can also decrease the access time for retrieving an element. But usually, in case of random access, the time is not twice as slow as using a more space greedy STL array. We can easily get the same speed as the STL implementation for a fixed number of bits w (w in {8,16,32,64}) by setting the template parameter in the vector to w. I.e.

sdsl::int_vector<32> bc_vec(1000, 0);

generates a vector which consists of 1000 32-bit integers initialised with zero. And this specialised version of the data structure provides the same speed as a usual array on 32-bit integers.

Using bitvectors

The most important special case of the sdsl::int_vector<> is a fixed bit-width of 1. Since then we have a bit vector and the vector is also called sdsl::bit_vector. Here is a first example for its usage:

#include <sdsl/int_vector.hpp> // for the bit_vector class
#include <sdsl/util.hpp> // for counting the set bits in a bit_vector 
#include <iostream> // for output
#include <cstdlib>  // for srand() and rand()

 using namespace std; // use the std namespace 
 using namespace sdsl;// and in addition the sdsl namespace

 typedef bit_vector::size_type size_type;
 
 int main(){
        size_type n = 10000000;
        bit_vector bv(n, 0);
        cout << "A bit vector of length "<< bv.size() << " was created." << endl;
        cout << util::get_size_in_bytes( bv ) << " bytes are occupied by the bit_vector" << endl;
        srand(13); // initialize random number generator
        for(size_type i=0; i<n; ++i){
                size_type x =  rand()%n;
                bv[ x ] = 1;
        }
        cout << n << " calls of rand() generated " << endl;

        size_type m = util::get_one_bits( bv );
        cout << m << " different numbers modulo "<< n << endl;
        
        cout << "b[0..9] = ";
        for(size_type i=0; i<10; ++i)
                cout << bv[i];
        cout << endl;
}

The example shows several things:

  • the usage of the constructor of the bit_vector class. Note that we can omit the second argument here since the default value is 0 and therefore all bits are initialised to 0.
  • the usage of the size() method to determine how many elements (in this case bits) can be stored in the vector.
  • how to determine the space of the object in bytes. Note that this can be done with the util::get_size_in_bytes method for every instance of a sdsl class.
  • how to assign a single bit of the vector: very intuitive via operator[].
  • how to read a single bit of the vector: also very intuitive via operator[].

Using a rank data structure on a bitvector

Ok, so far everything was straightforward. Actually in the previous example we could also use a the STL bitset. Now we will add a more advanced operation on the bit vector: rank(i). This operation returns the number of set bits in the prefix bv[0..i-1] of a bit vector bv. The amazing thing: It can be done in constant time (after linear preprocessing) and only little extra space on top of bv. Here is a first example:

#include <sdsl/int_vector.hpp> // for the bit_vector class
#include <sdsl/util.hpp> // for counting the set bits in a bit_vector 
#include <sdsl/rank_support.hpp> // for rank data structures
#include <iostream> // for output
#include <cstdlib>  // for srand() and rand()
#include <iomanip>  // for setw(), formatting the output

using namespace std; // use the std namespace 
using namespace sdsl;// and in addition the sdsl namespace

typedef bit_vector::size_type size_type;

int main(){
        size_type n = 100000000;
        bit_vector bv(n, 0);
        cout << "A bit vector of length "<< bv.size() << " was created." << endl;
        cout << util::get_size_in_bytes( bv ) << " bytes are occupied by the bit_vector" << endl;
        srand(13);
        for(size_type i=0; i<n; ++i){
                bv[ rand()%n ] = 1;
        }
        cout << n << " calls of rand() generated " << endl; 

        rank_support_v<> rs( &bv ); // initialize rank data structure with pointer to the bit vector
        // now a call to rs.rank(i) or shorthand rs(i) returns the number of 1 in the prefix of bv[0..i-1]
        // in constant time
        size_type m = rs.rank( bv.size() );
        cout << m << " different numbers modulo " << n << endl; 

        // it is now also possible to count the number of 1-bit 
        // in every interval [lb, rb]. Lets do it for 10 random intervals:
        cout << "interval I              |size of I    |set bits in I| \% of set bits" << endl;
        for(size_type i=0; i<10; ++i){
                size_type lb = rand()%n;
                size_type rb  = lb + rand()%(n-lb);
                size_type cnt = rs(rb+1)-rs(lb); 
                cout << "[" << setw(10) << lb << "," << setw(10) << rb << "]  ";
                cout << setw(13) << rb-lb+1 << " " << setw(13) << cnt << " ";
                cout << setw(14) << setprecision(6) << fixed << 100.0*cnt/(double)(rb-lb+1);
                cout << endl;
        }
}

This example shows

  • the usage of a rank_support class in general. A rank_support class can answer rank queries over a bit_vector efficiently. For example the class rank_support_v answers a query in constant time. We will get to know further rank_support classes later in the tutorial.
  • how to construct a rank_support object. The constructor of a rank_support class takes expects a pointer to the bit vector for which we would like to answer rank queries. After a linear time pre-computation the data rank_support object is initialised and can then be used. The initialisation is lightning fast, since it uses bit-parallelism on 64-bit words.
  • how to do a rank query with a rank_support object.

Measure the performance of our solution

#include <sdsl/int_vector.hpp> // for the bit_vector class
#include <sdsl/util.hpp> // for counting the set bits in a bit_vector 
#include <sdsl/rank_support.hpp> // for the rank data structure
#include <sdsl/testutils.hpp> // for the stop_watch class
#include <iostream> // for output
#include <cstdlib>  // for srand() and rand()
#include <iomanip>  // for setw(), formatting the output
using namespace std; // use the std namespace 
using namespace sdsl;// and in addition the sdsl namespace

typedef bit_vector::size_type size_type;

int main(){
        size_type n = 100000000;
        stop_watch sw;
        sw.start();
        bit_vector bv(n, 0);
        sw.stop();
        cout << "A bit vector bv of length "<< bv.size() << " was created." << endl;
        cout << util::get_size_in_mega_bytes( bv );
        cout << " MB are occupied by the bit_vector" << endl;
        cout << "Took " << sw.get_real_time() << " ms" << endl << endl;
        sw.start();
        srand(13);
        for(size_type i=0; i<n; ++i){
                bv[ rand()%n ] = 1;
        }
        sw.stop();
        cout << n << " calls to rand()" << endl; 
        cout << "Took " << sw.get_real_time() << " ms" << endl << endl;

        sw.start();
        rank_support_v<> rs( &bv ); 
        sw.stop();
        cout << "A rank_support object for bv was created." << endl;
        cout << util::get_size_in_mega_bytes( rs );
    cout << " MB are occupied by the rank_support for bv" << endl;
        cout << "Took " << sw.get_real_time() << " ms" << endl << endl;

        size_type checksum = 0;
        sw.start();
        for(size_type i=0; i<n; ++i){
                checksum += rs(rand()%(n+1));
        }
        sw.stop();
        cout << n << " rank queries were done. Checksum = " << checksum << "." <<endl;
        cout << "Took " << sw.get_real_time() << " ms" << endl;
}

This example shows

  • how to measure the time performance of our solution.
  • how fast the construction of the rank_support_v class is. It is in the same order as the construction of the supported bitvector.
  • Rank queries are extremely fast on uncompressed bitvectors. See the output of the executable. If you have a modern CPU and compiler (GCC version > 4.5) then you can also compile with option -msse4.2. This will result in an even fast version of the rank data structure. If the results are slower than your CPU does probably not support the SSE4.2 or SSE4a instruction set.
  • that a rank_support_v object takes about 25% of the original bit_vector. We will see later, that there exists a second implementation of the same approach which uses only 5% on top of the orignal vector and is only slightly slower (rank_support_v5).
    Remarks: The two rank data structures for uncompressed bit vectors can be parametrized by a bit-pattern x and the length l (l < 3) of the bit-pattern. Then rank(i) will return the number of occurrences of x in the prefix bv[0, i-1]. In the above example we would write rank_support_v<10,2> to be able to support the bit-pattern 10.

Using a select data structure on bitvectors

Using compressed bitvectors