Skip to content

Commit

Permalink
restructure names and update readme, provide count, extract, locate m…
Browse files Browse the repository at this point in the history
…ethods in suffix trees
  • Loading branch information
Konstantin Podshumok committed Jul 23, 2018
1 parent 631a16d commit f15e850
Show file tree
Hide file tree
Showing 8 changed files with 315 additions and 139 deletions.
9 changes: 0 additions & 9 deletions LICENSE

This file was deleted.

97 changes: 60 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ Most of examples from [SDSL cheat sheet][SDSL-CHEAT-SHEET] and [SDSL tutorial][S

## Mutable bit-compressed vectors

Core classes:
Core classes (see `pysdsl.int_vector` for dict of all of them):

* `pysdsl.IntVector(size, default_value, bit_width=64)` — dynamic bit width
* `pysdsl.BitVector(size, default_value)` — static bit width (1)
* `pysdsl.BitVector(size, default_value)` — static (fixed) bit width (1)
* `pysdsl.Int4Vector(size, default_value)` — static bit width (4)
* `pysdsl.Int8Vector(size, default_value)` — static bit width (8)
* `pysdsl.Int16Vector(size, default_value)` — static bit width (16)
Expand Down Expand Up @@ -49,8 +49,21 @@ Out[8]: 896.0000085830688

```

Buffer interface:

```python
In [9]: import array

In [10]: v = pysdsl.Int64Vector([1, 2, 3])

In [11]: array.array('Q', v)
Out[11]: array('Q', [1, 2, 3])
```

## Immutable compressed integer vectors

(See `pysdsl.enc_vector`):

* `EncVectorEliasDelta(IntVector)`
* `EncVectorEliasGamma(IntVector)`
* `EncVectorFibonacci(IntVector)`
Expand All @@ -66,41 +79,51 @@ In [10]: ev.size_in_mega_bytes
Out[10]: 45.75003242492676
```

Encoding values with variable length codes:
Encoding values with variable length codes (see `pysdsl.variable_length_codes_vector`):

* `VlcVectorEliasDelta(IntVector)`
* `VlcVectorEliasGamma(IntVector)`
* `VlcVectorFibonacci(IntVector)`
* `VlcVectorComma2(IntVector)`
* `VlcVectorComma4(IntVector)`
* `VariableLengthCodesVectorEliasDelta(IntVector)`
* `VariableLengthCodesVectorEliasGamma(IntVector)`
* `VariableLengthCodesVectorFibonacci(IntVector)`
* `VariableLengthCodesVectorComma2(IntVector)`
* `VariableLengthCodesVectorComma4(IntVector)`

Encoding values with "escaping" technique:
Encoding values with "escaping" technique (see `pysdsl.direct_accessible_codes_vector`):

* `DacVector(IntVector)`
* `DacVectorDP(IntVector)` — number of layers is chosen
with dynamic programming
* `DirectAccessibleCodesVector(IntVector)`
* `DirectAccessibleCodesVector8(IntVector)`,
* `DirectAccessibleCodesVector16(IntVector)`,
* `DirectAccessibleCodesVector63(IntVector)`,
* `DirectAccessibleCodesVectorDP(IntVector)` — number of layers is chosen
with dynamic programming
* `DirectAccessibleCodesVectorDPRRR(IntVector)` — same but built on top of
RamanRamanRaoVector (see later)

Construction from python sequences is also supported.

## Immutable compressed bit (boolean) vectors

* `BitVectorIL64(BitVector)`
* `BitVectorIL128(BitVector)`
* `BitVectorIL256(BitVector)`
* `BitVectorIL512(BitVector)` — A bit vector which interleaves the
original `BitVector` with rank information.
(See pysdsl.`all_immutable_bitvectors`)

* `BitVectorInterLeaved64(BitVector)`
* `BitVectorInterLeaved128(BitVector)`
* `BitVectorInterLeaved256(BitVector)`
* `BitVectorInterLeaved512(BitVector)` — A bit vector which interleaves the
original `BitVector` with rank information
(see later)
* `SDVector(BitVector)` — A bit vector which compresses very sparse populated
bit vectors by representing the positions of 1 by the
Elias-Fano representation for
non-decreasing sequences
* `RRRVector3(BitVector)`
* `RRRVector15(BitVector)`
* `RRRVector63(BitVector)`
* `RRRVector256(BitVector)` — An H₀-compressed bitvector representation.
* `RamanRamanRaoVector15(BitVector)`
* `RamanRamanRaoVector63(BitVector)`
* `RamanRamanRaoVector256(BitVector)` — An H₀-compressed bitvector representation.
* `HybVector8(BitVector)`
* `HybVector16(BitVector)` — A hybrid-encoded compressed bitvector
representation

See also: `pysdsl.raman_raman_rao_vectors`, `pysdsl.sparse_bit_vectors`,
`pysdsl.hybrid_bit_vectors` and `pysdsl.bit_vector_interleaved`.

## Rank and select operations on bitvectors

For bitvector `v` `rank(i)` for pattern `P` (by default `P` is a bitstring of
Expand Down Expand Up @@ -134,6 +157,22 @@ the results.
mutable and was modified.


## Wavelet trees

The wavelet tree is a data structure that provides three efficient methods:

* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.

## Comressed suffix arrays

Suffix array is a sorted array of all suffixes of a string.

SDSL supports bitcompressed and compressed suffix arrays.

Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.

## Objects memory structure

Any object has a `.structure` property with technical information about an
Expand All @@ -151,22 +190,6 @@ object into a file.
All classes provide `.load_from_checkded_file()` static method allowing one to
load object stored with `.store_to_checked_file()`

## Wavelet trees

The wavelet tree is a data structure that provides three efficient methods:

* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.

## Comressed suffix arrays

Suffix array is a sorted array of all suffixes of a string.

SDSL supports bitcompressed and compressed suffix arrays.

Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.


## Building

Expand Down
8 changes: 4 additions & 4 deletions pysdsl/__init__.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
#include <cstdint>
#include <string>
#include <tuple>
#include <stdexcept>

#define assert(x) if(!x) {throw std::runtime_error("assertion failed");}

#include <sdsl/vectors.hpp>

Expand All @@ -14,7 +17,6 @@
#include "types/suffixarray.hpp"
#include "types/wavelet.hpp"


namespace py = pybind11;


Expand All @@ -31,7 +33,7 @@ PYBIND11_MODULE(pysdsl, m)

auto enc_classes = add_encoded_vectors(m);

auto wavelet_classes = add_wavelet(m);
auto wavelet_classes = add_wavelet(m, compressed_bit_vector_classes);

auto csa_classes = add_csa(m);

Expand All @@ -56,7 +58,6 @@ PYBIND11_MODULE(pysdsl, m)
for_each_in_tuple(wavelet_classes, make_inits_many_functor(iv_classes));
#ifndef NOCROSSCONSTRUCTORS
for_each_in_tuple(wavelet_classes, make_inits_many_functor(enc_classes));
for_each_in_tuple(wavelet_classes, make_inits_many_functor(compressed_bit_vector_classes));
for_each_in_tuple(wavelet_classes,
make_inits_many_functor(wavelet_classes));
#endif
Expand All @@ -67,6 +68,5 @@ PYBIND11_MODULE(pysdsl, m)
//for_each_in_tuple(sd_classes, make_pysequence_init_functor());

for_each_in_tuple(wavelet_classes, make_pysequence_init_functor());

for_each_in_tuple(csa_classes, make_pysequence_init_functor());
}
14 changes: 11 additions & 3 deletions pysdsl/operations/creation.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,20 @@ namespace py = pybind11;
namespace detail
{

template <class T, typename value_type = typename T::value_type>
template <class T, typename value_type = typename T::value_type,
bool is_bitvector1 = std::is_same<sdsl::int_vector<1>, T>::value>
struct IntermediateVector { using type = sdsl::int_vector<>; };


template <class T>
struct IntermediateVector<T, bool> { using type = sdsl::int_vector<1>; };
template <class T, bool b>
struct IntermediateVector<T, bool, b> { using type = sdsl::int_vector<1>; };


template <uint8_t N, typename value_type>
struct IntermediateVector<sdsl::int_vector<N>, value_type, false>
{
using type = sdsl::int_vector<N>;
};


template <
Expand Down
53 changes: 40 additions & 13 deletions pysdsl/types/intvector.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,42 @@
namespace py = pybind11;


template <class T, typename S = typename T::value_type, typename KEY>
inline
auto add_int_class(py::module& m, py::dict& dict, KEY key,
const char *name, const char *doc = nullptr)
template <class T,
unsigned int width = static_cast<unsigned int>(T::fixed_int_width)>
inline auto add_int_init(py::module& m, const char* name)
{
auto cls = py::class_<T>(m, name)
if (width == 8 || width == 16 || width == 32 || width == 64)
{
return py::class_<T>(m, name, py::buffer_protocol())
.def_buffer([] (T& self) {
char sym;
if (width == 8) {
sym = 'B'; }
else if (width == 16) {
sym = 'H'; }
else if (width == 32) {
sym = 'I'; }
else if (width == 64) {
sym = 'Q'; }

return py::buffer_info(
reinterpret_cast<void*>(self.data()),
width / 8,
std::string(1, sym),
1,
{ detail::size(self) },
{ width / 8 }
); });
}
return py::class_<T>(m, name);
}


template <class T, typename S = typename T::value_type, typename KEY_T>
inline auto add_int_class(py::module& m, py::dict& dict, KEY_T key,
const char *name, const char *doc = nullptr)
{
auto cls = add_int_init<T>(m, name)
.def_property_readonly("width", (uint8_t(T::*)(void) const) & T::width)
.def_property_readonly("data",
(const uint64_t *(T::*)(void)const) & T::data)
Expand All @@ -40,10 +70,8 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
.def(
"__setitem__",
[](T &self, size_t position, S value) {
if (position >= self.size())
{
throw std::out_of_range(std::to_string(position));
}
if (position >= self.size()) {
throw std::out_of_range(std::to_string(position)); }
self[position] = value; })

.def("set_to_id",
Expand Down Expand Up @@ -141,8 +169,7 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
}


inline
auto add_int_vectors(py::module& m)
inline auto add_int_vectors(py::module& m)
{
py::dict int_vectors_dict;

Expand Down Expand Up @@ -183,14 +210,14 @@ auto add_int_vectors(py::module& m)
"Flip all bits of bit_vector",
py::call_guard<py::gil_scoped_release>()),

add_int_class<sdsl::int_vector<4>, uint8_t>(
add_int_class<sdsl::int_vector<4>, uint16_t>(
m, int_vectors_dict, 4, "Int4Vector")
.def(py::init(
[](size_t size, uint8_t default_value) {
return sdsl::int_vector<4>(size, default_value, 4); }),
py::arg("size") = 0, py::arg("default_value") = 0),

add_int_class<sdsl::int_vector<8>, uint8_t>(
add_int_class<sdsl::int_vector<8>, uint16_t>(
m, int_vectors_dict, 8, "Int8Vector")
.def(py::init(
[](size_t size, uint8_t default_value) {
Expand Down
Loading

0 comments on commit f15e850

Please sign in to comment.