restructure names and update readme, provide count, extract, locate m…

…ethods in suffix trees
QratorLabs · Jul 23, 2018 · f15e850 · f15e850
1 parent 631a16d
commit f15e850
Show file tree

Hide file tree

Showing 8 changed files with 315 additions and 139 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -6,10 +6,10 @@ Most of examples from [SDSL cheat sheet][SDSL-CHEAT-SHEET] and [SDSL tutorial][S
 
 ## Mutable bit-compressed vectors
 
-Core classes:
+Core classes (see `pysdsl.int_vector` for dict of all of them):
 
  * `pysdsl.IntVector(size, default_value, bit_width=64)` — dynamic bit width
- * `pysdsl.BitVector(size, default_value)` — static bit width (1)
+ * `pysdsl.BitVector(size, default_value)` — static (fixed) bit width (1)
  * `pysdsl.Int4Vector(size, default_value)` — static bit width (4)
  * `pysdsl.Int8Vector(size, default_value)` — static bit width (8)
  * `pysdsl.Int16Vector(size, default_value)` — static bit width (16)
@@ -49,8 +49,21 @@ Out[8]: 896.0000085830688
 
 ```
 
+Buffer interface:
+
+```python
+In [9]: import array
+
+In [10]: v = pysdsl.Int64Vector([1, 2, 3])
+
+In [11]: array.array('Q', v)
+Out[11]: array('Q', [1, 2, 3])
+```
+
 ## Immutable compressed integer vectors
 
+(See `pysdsl.enc_vector`):
+
  * `EncVectorEliasDelta(IntVector)`
  * `EncVectorEliasGamma(IntVector)`
  * `EncVectorFibonacci(IntVector)`
@@ -66,41 +79,51 @@ In [10]: ev.size_in_mega_bytes
 Out[10]: 45.75003242492676
 ```
 
-Encoding values with variable length codes:
+Encoding values with variable length codes (see `pysdsl.variable_length_codes_vector`):
 
- * `VlcVectorEliasDelta(IntVector)`
- * `VlcVectorEliasGamma(IntVector)`
- * `VlcVectorFibonacci(IntVector)`
- * `VlcVectorComma2(IntVector)`
- * `VlcVectorComma4(IntVector)`
+ * `VariableLengthCodesVectorEliasDelta(IntVector)`
+ * `VariableLengthCodesVectorEliasGamma(IntVector)`
+ * `VariableLengthCodesVectorFibonacci(IntVector)`
+ * `VariableLengthCodesVectorComma2(IntVector)`
+ * `VariableLengthCodesVectorComma4(IntVector)`
 
-Encoding values with "escaping" technique:
+Encoding values with "escaping" technique (see `pysdsl.direct_accessible_codes_vector`):
 
- * `DacVector(IntVector)`
- * `DacVectorDP(IntVector)` — number of layers is chosen
-                              with dynamic programming
+ * `DirectAccessibleCodesVector(IntVector)`
+ * `DirectAccessibleCodesVector8(IntVector)`,
+ * `DirectAccessibleCodesVector16(IntVector)`,
+ * `DirectAccessibleCodesVector63(IntVector)`,
+ * `DirectAccessibleCodesVectorDP(IntVector)` — number of layers is chosen
+                                                with dynamic programming
+ * `DirectAccessibleCodesVectorDPRRR(IntVector)` — same but built on top of
+                                                   RamanRamanRaoVector (see later)
 
 Construction from python sequences is also supported.
 
 ## Immutable compressed bit (boolean) vectors
 
- * `BitVectorIL64(BitVector)`
- * `BitVectorIL128(BitVector)`
- * `BitVectorIL256(BitVector)`
- * `BitVectorIL512(BitVector)` — A bit vector which interleaves the
-                                 original `BitVector` with rank information.
+(See pysdsl.`all_immutable_bitvectors`)
+
+ * `BitVectorInterLeaved64(BitVector)`
+ * `BitVectorInterLeaved128(BitVector)`
+ * `BitVectorInterLeaved256(BitVector)`
+ * `BitVectorInterLeaved512(BitVector)` — A bit vector which interleaves the
+                                          original `BitVector` with rank information
+                                          (see later)
  * `SDVector(BitVector)` — A bit vector which compresses very sparse populated
                            bit vectors by representing the positions of 1 by the
                            Elias-Fano representation for
                            non-decreasing sequences
- * `RRRVector3(BitVector)`
- * `RRRVector15(BitVector)`
- * `RRRVector63(BitVector)`
- * `RRRVector256(BitVector)` — An H₀-compressed bitvector representation.
+ * `RamanRamanRaoVector15(BitVector)`
+ * `RamanRamanRaoVector63(BitVector)`
+ * `RamanRamanRaoVector256(BitVector)` — An H₀-compressed bitvector representation.
  * `HybVector8(BitVector)`
  * `HybVector16(BitVector)` — A hybrid-encoded compressed bitvector
                               representation
 
+See also: `pysdsl.raman_raman_rao_vectors`, `pysdsl.sparse_bit_vectors`,
+`pysdsl.hybrid_bit_vectors` and `pysdsl.bit_vector_interleaved`.
+
 ## Rank and select operations on bitvectors
 
 For bitvector `v` `rank(i)` for pattern `P` (by default `P` is a bitstring of
@@ -134,6 +157,22 @@ the results.
 mutable and was modified.
 
 
+## Wavelet trees
+
+The wavelet tree is a data structure that provides three efficient methods:
+
+* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
+* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
+* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.
+
+## Comressed suffix arrays
+
+Suffix array is a sorted array of all suffixes of a string.
+
+SDSL supports bitcompressed and compressed suffix arrays.
+
+Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.
+
 ## Objects memory structure
 
 Any object has a `.structure` property with technical information about an
@@ -151,22 +190,6 @@ object into a file.
 All classes provide `.load_from_checkded_file()` static method allowing one to
 load object stored  with `.store_to_checked_file()`
 
-## Wavelet trees
-
-The wavelet tree is a data structure that provides three efficient methods:
-
-* The `[]`-operator: `wt[i]` returns the `i`-th symbol of vector for which the wavelet tree was build for.
-* The rank method: `wt.rank(i, c)` returns the number of occurrences of symbol `c` in the prefix `[0..i-1]` in the vector for which the wavelet tree was build for.
-* The select method: `wt.select(j, c)` returns the index `i` from `[0..size()-1]` of the `j`-th occurrence of symbol `c`.
-
-## Comressed suffix arrays
-
-Suffix array is a sorted array of all suffixes of a string.
-
-SDSL supports bitcompressed and compressed suffix arrays.
-
-Byte representaion of original IntVector should have no zero symbols in order to construct SuffixArray.
-
 
 ## Building
 

diff --git a/pysdsl/__init__.cpp b/pysdsl/__init__.cpp
@@ -1,6 +1,9 @@
 #include <cstdint>
 #include <string>
 #include <tuple>
+#include <stdexcept>
+
+#define assert(x) if(!x) {throw std::runtime_error("assertion failed");}
 
 #include <sdsl/vectors.hpp>
 
@@ -14,7 +17,6 @@
 #include "types/suffixarray.hpp"
 #include "types/wavelet.hpp"
 
-
 namespace py = pybind11;
 
 
@@ -31,7 +33,7 @@ PYBIND11_MODULE(pysdsl, m)
 
     auto enc_classes = add_encoded_vectors(m);
 
-    auto wavelet_classes = add_wavelet(m);
+    auto wavelet_classes = add_wavelet(m, compressed_bit_vector_classes);
 
     auto csa_classes = add_csa(m);
 
@@ -56,7 +58,6 @@ PYBIND11_MODULE(pysdsl, m)
     for_each_in_tuple(wavelet_classes, make_inits_many_functor(iv_classes));
 #ifndef NOCROSSCONSTRUCTORS
     for_each_in_tuple(wavelet_classes, make_inits_many_functor(enc_classes));
-    for_each_in_tuple(wavelet_classes, make_inits_many_functor(compressed_bit_vector_classes));
     for_each_in_tuple(wavelet_classes,
                       make_inits_many_functor(wavelet_classes));
 #endif
@@ -67,6 +68,5 @@ PYBIND11_MODULE(pysdsl, m)
     //for_each_in_tuple(sd_classes, make_pysequence_init_functor());
 
     for_each_in_tuple(wavelet_classes, make_pysequence_init_functor());
-
     for_each_in_tuple(csa_classes, make_pysequence_init_functor());
 }
diff --git a/pysdsl/operations/creation.hpp b/pysdsl/operations/creation.hpp
@@ -22,12 +22,20 @@ namespace py = pybind11;
 namespace detail
 {
 
-template <class T, typename value_type = typename T::value_type>
+template <class T, typename value_type = typename T::value_type,
+          bool is_bitvector1 = std::is_same<sdsl::int_vector<1>, T>::value>
 struct IntermediateVector { using type = sdsl::int_vector<>; };
 
 
-template <class T>
-struct IntermediateVector<T, bool> { using type = sdsl::int_vector<1>; };
+template <class T, bool b>
+struct IntermediateVector<T, bool, b> { using type = sdsl::int_vector<1>; };
+
+
+template <uint8_t N, typename value_type>
+struct IntermediateVector<sdsl::int_vector<N>, value_type, false>
+{
+    using type = sdsl::int_vector<N>;
+};
 
 
 template <

diff --git a/pysdsl/types/intvector.hpp b/pysdsl/types/intvector.hpp
@@ -18,12 +18,42 @@
 namespace py = pybind11;
 
 
-template <class T, typename S = typename T::value_type, typename KEY>
-inline
-auto add_int_class(py::module& m, py::dict& dict, KEY key,
-                   const char *name, const char *doc = nullptr)
+template <class T,
+          unsigned int width = static_cast<unsigned int>(T::fixed_int_width)>
+inline auto add_int_init(py::module& m, const char* name)
 {
-    auto cls = py::class_<T>(m, name)
+    if (width == 8 || width == 16 || width == 32 || width == 64)
+    {
+        return py::class_<T>(m, name, py::buffer_protocol())
+            .def_buffer([] (T& self) {
+                char sym;
+                if (width == 8) {
+                    sym = 'B'; }
+                else if (width == 16) {
+                    sym = 'H'; }
+                else if (width == 32) {
+                    sym = 'I'; }
+                else if (width == 64) {
+                    sym = 'Q'; }
+
+                return py::buffer_info(
+                    reinterpret_cast<void*>(self.data()),
+                    width / 8,
+                    std::string(1, sym),
+                    1,
+                    { detail::size(self) },
+                    { width / 8 }
+                ); });
+    }
+    return py::class_<T>(m, name);
+}
+
+
+template <class T, typename S = typename T::value_type, typename KEY_T>
+inline auto add_int_class(py::module& m, py::dict& dict, KEY_T key,
+                          const char *name, const char *doc = nullptr)
+{
+    auto cls = add_int_init<T>(m, name)
         .def_property_readonly("width", (uint8_t(T::*)(void) const) & T::width)
         .def_property_readonly("data",
                                (const uint64_t *(T::*)(void)const) & T::data)
@@ -40,10 +70,8 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
         .def(
             "__setitem__",
             [](T &self, size_t position, S value) {
-                if (position >= self.size())
-                {
-                    throw std::out_of_range(std::to_string(position));
-                }
+                if (position >= self.size()) {
+                    throw std::out_of_range(std::to_string(position)); }
                 self[position] = value; })
 
         .def("set_to_id",
@@ -141,8 +169,7 @@ auto add_int_class(py::module& m, py::dict& dict, KEY key,
 }
 
 
-inline
-auto add_int_vectors(py::module& m)
+inline auto add_int_vectors(py::module& m)
 {
     py::dict int_vectors_dict;
 
@@ -183,14 +210,14 @@ auto add_int_vectors(py::module& m)
                  "Flip all bits of bit_vector",
                  py::call_guard<py::gil_scoped_release>()),
 
-        add_int_class<sdsl::int_vector<4>, uint8_t>(
+        add_int_class<sdsl::int_vector<4>, uint16_t>(
                 m, int_vectors_dict, 4, "Int4Vector")
             .def(py::init(
                 [](size_t size, uint8_t default_value) {
                     return sdsl::int_vector<4>(size, default_value, 4); }),
                 py::arg("size") = 0, py::arg("default_value") = 0),
 
-        add_int_class<sdsl::int_vector<8>, uint8_t>(
+        add_int_class<sdsl::int_vector<8>, uint16_t>(
                 m, int_vectors_dict, 8, "Int8Vector")
             .def(py::init(
                 [](size_t size, uint8_t default_value) {