Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance update for RLE posts #30

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions _posts/2019-12-01-rle-array.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -80,13 +80,13 @@ Run-length encoding is a simple yet powerful technique. Instead of storing array
so called "runs" --- consecutive elements of the array where the same value is stored. For each run, it then just keeps
its value and length:

![run-length encoding, step 1](/assets/images/2019-12-01-rle-array/rle_array1.png)
![run-length encoding, step 1](/assets/images/2019-12-01-rle-array/rle_array1.svg)

Pandas requires us to be able to do quick [random access](https://en.wikipedia.org/wiki/Random_access), e.g. for
sorting and group-by operations. Instead of the actual run-lengths we store the end positions of each run (this is the
cumulative sum of the lengths):

![run-length encoding, step 2](/assets/images/2019-12-01-rle-array/rle_array2.png)
![run-length encoding, step 2](/assets/images/2019-12-01-rle-array/rle_array2.svg)

This way, we can use [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm) to implement random access.

Expand Down Expand Up @@ -119,7 +119,7 @@ created as followed:

The whole setup can also be visualized:

![cube](/assets/images/2019-12-01-rle-array/cube.png)
![cube](/assets/images/2019-12-01-rle-array/cube.svg)

You can generate the same data using
[`rle_array.testing.generate_test_dataframe`](https://jdasoftwaregroup.github.io/rle-array/_rst/rle_array.testing.html#rle_array.testing.generate_test_dataframe).
Expand Down Expand Up @@ -162,7 +162,7 @@ encouraged to try these and others.
Dictionary encoding replaces the actual payload data with a mapping. The trick is that mapped values can often be more
memory-efficient, especially when the original data is very long (e.g. for strings) and are repeated multiple times:

![dictionary encoding memory layout](/assets/images/2019-12-01-rle-array/dictionary_encoding.png)
![dictionary encoding memory layout](/assets/images/2019-12-01-rle-array/dictionary_encoding.svg)

This is what [Pandas Categoricals](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) implement.
For data-at-rest, this is implemented by
Expand All @@ -187,7 +187,7 @@ This distinction between semantics and data size is also made by the

Here is how this looks like in memory (for [big endian machines](https://en.wikipedia.org/wiki/Endianness)):

![data types memory layout](/assets/images/2019-12-01-rle-array/data_types.png)
![data types memory layout](/assets/images/2019-12-01-rle-array/data_types.svg)

In this example, we can easily use 16 bits per element instead of 64, resulting in a 75% memory reduction.

Expand All @@ -198,7 +198,7 @@ noticeable exceptions due to the lacking hardware support on most CPUs), it also
### Bit-packing
Bit-packing is similar to [Data Types](#data-types), but allows to create types with non-standard width:

![bit packing memory layout](/assets/images/2019-12-01-rle-array/bit_packing.png)
![bit packing memory layout](/assets/images/2019-12-01-rle-array/bit_packing.svg)

The advantage is that you can save even more memory, but it comes with heavy performance penalties, since
CPUs cannot read unaligned data that efficiently. In some cases however, it can be even faster due to the saved memory
Expand All @@ -212,7 +212,7 @@ Often we find columns in our DataFrames where information only occurs for a very
is often more efficient to explicitly store and look-up these few cases --- e.g. by using a
[HashTable](https://en.wikipedia.org/wiki/Hash_table) --- than using a simple array:

![sparse data memory layout](/assets/images/2019-12-01-rle-array/sparse_data.png)
![sparse data memory layout](/assets/images/2019-12-01-rle-array/sparse_data.svg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick and the only thing that catched my eye: the space between the "values" and "mapping" blocks looks a bit too big, they seem "unrelated". But again, just a nitpick.


This is what [Pandas SparseArray](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) implements. Note
that the default value does not need to be `0`, but can be an arbitrary element. One downside of sparse arrays is that
Expand Down
4 changes: 4 additions & 0 deletions assets/css/main.scss
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,10 @@ h6
padding-right: 0;
}

.page__content img[src$=".svg"] {
width: 80%;
}

.page__footer
{
background-color: $primary-color;
Expand Down
Binary file modified assets/images/2019-12-01-rle-array.graffle
Binary file not shown.
Binary file removed assets/images/2019-12-01-rle-array/bit_packing.png
Binary file not shown.
480 changes: 480 additions & 0 deletions assets/images/2019-12-01-rle-array/bit_packing.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed assets/images/2019-12-01-rle-array/cube.png
Binary file not shown.
89 changes: 89 additions & 0 deletions assets/images/2019-12-01-rle-array/cube.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed assets/images/2019-12-01-rle-array/data_types.png
Binary file not shown.
Loading