Support bitshuffle filter for blosc compressor #1

eschnett · 2023-11-14T15:33:48Z

No description provided.

braingram

Thanks for this PR!

I made some inline comments. Unfortunately I'm advocating for removal of most of the suggested changes (in favor of keeping the compression keyword arguments matching those expected by blosc compress/decompress).

asdf doesn't really provide a good (or any?) mechanism for providing documentation for compression keyword arguments (nor does it save them to the asdf file). Until this can be improved (and perhaps even after) it makes sense to me to by-default keep the arguments consistent with the library providing the compression.

I'm open to discussing this further and am not at all opposed to making changes to asdf to make any of this more straightforward to use/develop.

braingram · 2023-11-14T20:11:48Z

asdf_compression/blsc/compressor.py

@@ -7,10 +7,34 @@ class BloscCompressor(Compressor):
    def compress(self, data, **kwargs):
        import blosc

-        yield blosc.compress(data, **kwargs)
+        # Type size, necessary for shuffle filters
+        typesize = data.itemsize


Good catch! It appears that blosc defaults to 8 whereas here (with the memoryview) the itemsize (based off the blosc examples) should be 1.

No, the itemsize should be the type of the data ultimately stored. This is an optimization hint for the shuffle operators to expose more regularity. For example, when float32 elements are stored, the itemsize should be 4, when complex128 data are stored, it should be 16 etc.

I'm not familiar with all the details of asdf and using data.itemsize might be the wrong choice. I am looking for the datatype (that asdf knows) of the ndarray here.

ASDF blocks are a little weird because they might be shared across arrays (even if the arrays have different dtypes). Because of this every block is just bytes.

Here's one example:

import asdf, numpy as np base_arr = np.arange(10, dtype='uint8') view1 = base_arr.view(dtype='uint32') view2 = base_arr.view(dtype='uint64') asdf.AsdfFile({'v1': view1, 'v2': view2}).write_to('test.asdf')

The example produces a file with only 1 ASDF block but with 2 arrays each with a different dtype:

#ASDF 1.0.0 #ASDF_STANDARD 1.5.0 %YAML 1.1 %TAG ! tag:stsci.edu:asdf/ --- !core/asdf-1.1.0 asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf', name: asdf, version: 3.0.1.dev0} history: extensions: - !core/extension_metadata-1.0.0 extension_class: asdf.extension._manifest.ManifestExtension extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0 software: !core/software-1.0.0 {name: asdf, version: 3.0.1.dev0} v1: !core/ndarray-1.0.0 source: 0 datatype: uint32 byteorder: little shape: [8] v2: !core/ndarray-1.0.0 source: 0 datatype: uint64 byteorder: little shape: [4] ...

See this issue for some discussion on asdf saving 'base' arrays: asdf-format/asdf#1669

The standard is not very detailed about this behavior but does contain a short statement about blocks being bytes and not having a datatype:

Blocks represent a contiguous chunk of binary data and nothing more. Information about how to interpret the block, such as the data type or array shape, is stored entirely in ndarray structures in the tree, as described in ndarray.

Is there much of a performance gain to be had for setting itemsize?

braingram · 2023-11-14T20:14:47Z

asdf_compression/blsc/compressor.py

+        typesize = data.itemsize
+
+        # Compression level
+        clevel = kwargs.get("level", 9)


I'd prefer to keep this as clevel to match the blosc.compress signature. As the default already 9 I'd say let's let **kwargs handle the clevel.

I was assuming that there is a (small) standard set of arguments that asdf supports, and that people could use these without knowing details about the compressor.

Apparently that's not the case, it's level for zlib, compresslevel for bzip2, etc...

It's a bit of the "wild west" when it comes to compression keyword arguments. I'd be open to suggestions if you have any ideas for a less "leaky" api. In general compression has not received much attention in asdf (until now).

braingram · 2023-11-14T20:17:30Z

asdf_compression/blsc/compressor.py

+        clevel = kwargs.get("level", 9)
+
+        # Shuffle filter
+        shuffle_string = kwargs.get("filter", "bitshuffle")


Similar to level vs clevel, I'd say we avoid adding a layer that remaps arguments. Without the remapping we can document this as "kwargs is anything supported by blosc.compress" (or something similar). If we map arguments we'll want to document the new mapping.

braingram · 2023-11-14T20:17:43Z

asdf_compression/blsc/compressor.py

+            shuffle = blosc.BITSHUFFLE
+
+        # CODEC name
+        cname = kwargs.get("codec", "blosclz")


braingram · 2023-11-14T20:23:08Z

asdf_compression/blsc/compressor.py

+
+        # Number of threads
+        nthreads = kwargs.get("threads", 1)
+        self._api.set_nthreads(nthreads)


Should this be:

Suggested change

self._api.set_nthreads(nthreads)

blosc.set_nthreads(nthreads)

If so, I'd say it's cleaner to leave this up to the user to call prior to saving or opening an asdf file and leave it out of the api for this compressor.

Putting it into the compressor allows people to use the compressor without themselves depending on the blosc package.

That's a good point that it would mean that a user could just pass in 'nthreads' to the compression kwargs and not need to call blosc.

Does set_nthreads also effect decompression? Looking at how asdf handles decompression there is no mechanism for providing keyword arguments even though the Compressor supports them (that is certainly a bug and gap in the api).

I think decompression is also multi-threaded. I'm not sure.

braingram · 2023-11-14T20:25:08Z

asdf_compression/blsc/compressor.py


    def decompress(self, data, out, **kwargs):
        import blosc

+        # TODO: call `self._api.decompress_ptr` instead to avoid copying the output


Great idea! I wasn't sure how to do this with the current structure (where data is an iterator that spits out chunks). Is there a way to iteratively decompress with blosc or does it need access to the full data to decompress (I'm not at all familiar with how it works)?

Internally blosc supports all the things you mention. The Python API is very limited and only supports two decompression calls – one that allocates (and we copy the result) and one that stores via a given pointer.

Since asdf seems to want to allocate the output buffer ahead of time, the second API would be nice to use. For this we need to convert asdf's pre-allocated buffer into a pointer. I don't know enough Python for this.

Alternatively we could change asdf's decompress API to not take the out buffer as input, but rather as the decompress function to allocate the output buffer.

How blosc's decompression routine processes its input is unrelated. The Python API doesn't support reading in chunks.

blosc compression and decompression are limited in that they can handle at most 2 GByte. blosc2 handles arbitrary sizes, but there isn't any Python package for it available yet. That Python package should probably support chunked compression and decompression, the blosc2 C library supports it.

Thanks! I like your idea of allowing decompress to allocate the buffer. This combined with passing perhaps the file pointer (instead of chunks read out of the file) might allow blosc to use decompress_ptr and avoid the copy of both the input chunks and the output buffer. I think these would be only minor changes in asdf but I'd have to look in more detail to be sure.

braingram · 2023-11-14T20:32:19Z

I just moved this repo from braingram to asdf-format (braingram/asdf-compression is now a fork of asdf-format/asdf-compression). Please feel free to re-open this PR against https://github.com/asdf-format/asdf-compression (to hopefully garner some more feedback from other asdf developers). I'm going to close this PR for now but please don't take this as me dismissing your contribution.

braingram · 2023-11-14T20:34:55Z

I opened an issue on asdf-compression referencing your comment about the blosc decompression and decompress_ptr: asdf-format#4

eschnett · 2023-11-15T14:32:56Z

Well, omitting the things that you'd rather keep external, there are two things left, and these are actually just the things I don't know how to do:

access the size of the actual ndarray data type
convert asdf's out buffer to a pointer

I started opening a new PR, but I realize that it would just have two TODO entries:

diff --git a/asdf_compression/blsc/compressor.py b/asdf_compression/blsc/compressor.py
index 19886f2..61875e9 100644
--- a/asdf_compression/blsc/compressor.py
+++ b/asdf_compression/blsc/compressor.py
@@ -7,10 +7,15 @@ class BloscCompressor(Compressor):
     def compress(self, data, **kwargs):
         import blosc

-        yield blosc.compress(data, **kwargs)
+        # Type size, necessary for shuffle filter efficiency
+        # TODO: This should be the type size of the data actually stored, not just "byte" or "uint8"
+        typesize = data.itemsize
+
+        yield blosc.compress(data, typesize=typesize, **kwargs)

     def decompress(self, data, out, **kwargs):
         import blosc

+        # TODO: call `self._api.decompress_ptr` instead to avoid copying the output
         out[:] = blosc.decompress(b"".join(data), **kwargs)
         return out.nbytes

braingram · 2023-11-15T16:21:45Z

If you'd like, I wouldn't be opposed to a PR with "TODOs". Another option would be to open issues. The changes you've proposed here are great as they've exposed a number of gaps in the asdf extension api for block compression and I just want to make sure that these contributions aren't lost. Thanks again for opening the PR!

Support bitshuffle filter for blosc compressor

4f415ab

eschnett mentioned this pull request Nov 14, 2023

Add blosc compressor asdf-format/asdf#1678

Closed

braingram requested changes Nov 14, 2023

View reviewed changes

braingram closed this Nov 14, 2023

braingram mentioned this pull request Nov 15, 2023

There is no API for providing **kwargs to Compressor.decompress asdf-format/asdf#1684

Open

eschnett mentioned this pull request Nov 16, 2023

Improve blosc efficiency asdf-format/asdf-compression#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bitshuffle filter for blosc compressor #1

Support bitshuffle filter for blosc compressor #1

eschnett commented Nov 14, 2023

braingram left a comment

braingram Nov 14, 2023

eschnett Nov 15, 2023

braingram Nov 15, 2023

braingram Nov 14, 2023

eschnett Nov 15, 2023

braingram Nov 15, 2023

braingram Nov 14, 2023

braingram Nov 14, 2023

braingram Nov 14, 2023

eschnett Nov 15, 2023

braingram Nov 15, 2023

eschnett Nov 16, 2023

braingram Nov 14, 2023

eschnett Nov 15, 2023

braingram Nov 15, 2023

braingram commented Nov 14, 2023

braingram commented Nov 14, 2023

eschnett commented Nov 15, 2023

braingram commented Nov 15, 2023

	self._api.set_nthreads(nthreads)
	blosc.set_nthreads(nthreads)

Support bitshuffle filter for blosc compressor #1

Support bitshuffle filter for blosc compressor #1

Conversation

eschnett commented Nov 14, 2023

braingram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

braingram commented Nov 14, 2023

braingram commented Nov 14, 2023

eschnett commented Nov 15, 2023

braingram commented Nov 15, 2023