-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bitshuffle filter for blosc compressor #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR!
I made some inline comments. Unfortunately I'm advocating for removal of most of the suggested changes (in favor of keeping the compression keyword arguments matching those expected by blosc compress/decompress).
asdf doesn't really provide a good (or any?) mechanism for providing documentation for compression keyword arguments (nor does it save them to the asdf file). Until this can be improved (and perhaps even after) it makes sense to me to by-default keep the arguments consistent with the library providing the compression.
I'm open to discussing this further and am not at all opposed to making changes to asdf to make any of this more straightforward to use/develop.
@@ -7,10 +7,34 @@ class BloscCompressor(Compressor): | |||
def compress(self, data, **kwargs): | |||
import blosc | |||
|
|||
yield blosc.compress(data, **kwargs) | |||
# Type size, necessary for shuffle filters | |||
typesize = data.itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It appears that blosc defaults to 8 whereas here (with the memoryview) the itemsize (based off the blosc examples) should be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the itemsize
should be the type of the data ultimately stored. This is an optimization hint for the shuffle operators to expose more regularity. For example, when float32
elements are stored, the itemsize
should be 4, when complex128
data are stored, it should be 16 etc.
I'm not familiar with all the details of asdf
and using data.itemsize
might be the wrong choice. I am looking for the datatype (that asdf
knows) of the ndarray
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASDF blocks are a little weird because they might be shared across arrays (even if the arrays have different dtypes). Because of this every block is just bytes.
Here's one example:
import asdf, numpy as np
base_arr = np.arange(10, dtype='uint8')
view1 = base_arr.view(dtype='uint32')
view2 = base_arr.view(dtype='uint64')
asdf.AsdfFile({'v1': view1, 'v2': view2}).write_to('test.asdf')
The example produces a file with only 1 ASDF block but with 2 arrays each with a different dtype:
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 3.0.1.dev0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension._manifest.ManifestExtension
extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
software: !core/software-1.0.0 {name: asdf, version: 3.0.1.dev0}
v1: !core/ndarray-1.0.0
source: 0
datatype: uint32
byteorder: little
shape: [8]
v2: !core/ndarray-1.0.0
source: 0
datatype: uint64
byteorder: little
shape: [4]
...
See this issue for some discussion on asdf saving 'base' arrays: asdf-format/asdf#1669
The standard is not very detailed about this behavior but does contain a short statement about blocks being bytes and not having a datatype:
Blocks represent a contiguous chunk of binary data and nothing more. Information about how to interpret the block, such as the data type or array shape, is stored entirely in ndarray structures in the tree, as described in ndarray.
Is there much of a performance gain to be had for setting itemsize
?
typesize = data.itemsize | ||
|
||
# Compression level | ||
clevel = kwargs.get("level", 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep this as clevel
to match the blosc.compress signature. As the default already 9 I'd say let's let **kwargs
handle the clevel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that there is a (small) standard set of arguments that asdf
supports, and that people could use these without knowing details about the compressor.
Apparently that's not the case, it's level
for zlib
, compresslevel
for bzip2
, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of the "wild west" when it comes to compression keyword arguments. I'd be open to suggestions if you have any ideas for a less "leaky" api. In general compression has not received much attention in asdf (until now).
clevel = kwargs.get("level", 9) | ||
|
||
# Shuffle filter | ||
shuffle_string = kwargs.get("filter", "bitshuffle") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to level
vs clevel
, I'd say we avoid adding a layer that remaps arguments. Without the remapping we can document this as "kwargs is anything supported by blosc.compress" (or something similar). If we map arguments we'll want to document the new mapping.
shuffle = blosc.BITSHUFFLE | ||
|
||
# CODEC name | ||
cname = kwargs.get("codec", "blosclz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
|
||
# Number of threads | ||
nthreads = kwargs.get("threads", 1) | ||
self._api.set_nthreads(nthreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
self._api.set_nthreads(nthreads) | |
blosc.set_nthreads(nthreads) |
If so, I'd say it's cleaner to leave this up to the user to call prior to saving or opening an asdf file and leave it out of the api for this compressor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it into the compressor allows people to use the compressor without themselves depending on the blosc
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point that it would mean that a user could just pass in 'nthreads' to the compression kwargs and not need to call blosc.
Does set_nthreads
also effect decompression? Looking at how asdf handles decompression there is no mechanism for providing keyword arguments even though the Compressor supports them (that is certainly a bug and gap in the api).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think decompression is also multi-threaded. I'm not sure.
|
||
def decompress(self, data, out, **kwargs): | ||
import blosc | ||
|
||
# TODO: call `self._api.decompress_ptr` instead to avoid copying the output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! I wasn't sure how to do this with the current structure (where data is an iterator that spits out chunks). Is there a way to iteratively decompress with blosc or does it need access to the full data to decompress (I'm not at all familiar with how it works)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally blosc
supports all the things you mention. The Python API is very limited and only supports two decompression calls – one that allocates (and we copy the result) and one that stores via a given pointer.
Since asdf
seems to want to allocate the output buffer ahead of time, the second API would be nice to use. For this we need to convert asdf
's pre-allocated buffer into a pointer. I don't know enough Python for this.
Alternatively we could change asdf
's decompress
API to not take the out
buffer as input, but rather as the decompress
function to allocate the output buffer.
How blosc
's decompression routine processes its input is unrelated. The Python API doesn't support reading in chunks.
blosc
compression and decompression are limited in that they can handle at most 2 GByte. blosc2
handles arbitrary sizes, but there isn't any Python package for it available yet. That Python package should probably support chunked compression and decompression, the blosc2
C library supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I like your idea of allowing decompress to allocate the buffer. This combined with passing perhaps the file pointer (instead of chunks read out of the file) might allow blosc to use decompress_ptr
and avoid the copy of both the input chunks and the output buffer. I think these would be only minor changes in asdf but I'd have to look in more detail to be sure.
I just moved this repo from braingram to asdf-format (braingram/asdf-compression is now a fork of asdf-format/asdf-compression). Please feel free to re-open this PR against https://github.com/asdf-format/asdf-compression (to hopefully garner some more feedback from other asdf developers). I'm going to close this PR for now but please don't take this as me dismissing your contribution. |
I opened an issue on asdf-compression referencing your comment about the blosc decompression and |
Well, omitting the things that you'd rather keep external, there are two things left, and these are actually just the things I don't know how to do:
I started opening a new PR, but I realize that it would just have two diff --git a/asdf_compression/blsc/compressor.py b/asdf_compression/blsc/compressor.py
index 19886f2..61875e9 100644
--- a/asdf_compression/blsc/compressor.py
+++ b/asdf_compression/blsc/compressor.py
@@ -7,10 +7,15 @@ class BloscCompressor(Compressor):
def compress(self, data, **kwargs):
import blosc
- yield blosc.compress(data, **kwargs)
+ # Type size, necessary for shuffle filter efficiency
+ # TODO: This should be the type size of the data actually stored, not just "byte" or "uint8"
+ typesize = data.itemsize
+
+ yield blosc.compress(data, typesize=typesize, **kwargs)
def decompress(self, data, out, **kwargs):
import blosc
+ # TODO: call `self._api.decompress_ptr` instead to avoid copying the output
out[:] = blosc.decompress(b"".join(data), **kwargs)
return out.nbytes |
If you'd like, I wouldn't be opposed to a PR with "TODOs". Another option would be to open issues. The changes you've proposed here are great as they've exposed a number of gaps in the asdf extension api for block compression and I just want to make sure that these contributions aren't lost. Thanks again for opening the PR! |
No description provided.