-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bitshuffle filter for blosc compressor #1
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -7,10 +7,34 @@ class BloscCompressor(Compressor): | |||||
def compress(self, data, **kwargs): | ||||||
import blosc | ||||||
|
||||||
yield blosc.compress(data, **kwargs) | ||||||
# Type size, necessary for shuffle filters | ||||||
typesize = data.itemsize | ||||||
|
||||||
# Compression level | ||||||
clevel = kwargs.get("level", 9) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer to keep this as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was assuming that there is a (small) standard set of arguments that Apparently that's not the case, it's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit of the "wild west" when it comes to compression keyword arguments. I'd be open to suggestions if you have any ideas for a less "leaky" api. In general compression has not received much attention in asdf (until now). |
||||||
|
||||||
# Shuffle filter | ||||||
shuffle_string = kwargs.get("filter", "bitshuffle") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to |
||||||
if shuffle_string == "noshuffle": | ||||||
shuffle = blosc.NOSHUFFLE | ||||||
elif shuffle_string == "byteshuffle": | ||||||
shuffle = blosc.BYTESHUFFLE | ||||||
else: | ||||||
# Fallback and default | ||||||
shuffle = blosc.BITSHUFFLE | ||||||
|
||||||
# CODEC name | ||||||
cname = kwargs.get("codec", "blosclz") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. |
||||||
|
||||||
# Number of threads | ||||||
nthreads = kwargs.get("threads", 1) | ||||||
self._api.set_nthreads(nthreads) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be:
Suggested change
If so, I'd say it's cleaner to leave this up to the user to call prior to saving or opening an asdf file and leave it out of the api for this compressor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Putting it into the compressor allows people to use the compressor without themselves depending on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point that it would mean that a user could just pass in 'nthreads' to the compression kwargs and not need to call blosc. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think decompression is also multi-threaded. I'm not sure. |
||||||
|
||||||
yield blosc.compress(data, typesize=typesize, clevel=clevel, shuffle=shuffle, cname=cname, **kwargs) | ||||||
|
||||||
def decompress(self, data, out, **kwargs): | ||||||
import blosc | ||||||
|
||||||
# TODO: call `self._api.decompress_ptr` instead to avoid copying the output | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great idea! I wasn't sure how to do this with the current structure (where data is an iterator that spits out chunks). Is there a way to iteratively decompress with blosc or does it need access to the full data to decompress (I'm not at all familiar with how it works)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Internally Since Alternatively we could change How
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I like your idea of allowing decompress to allocate the buffer. This combined with passing perhaps the file pointer (instead of chunks read out of the file) might allow blosc to use |
||||||
out[:] = blosc.decompress(b"".join(data), **kwargs) | ||||||
return out.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It appears that blosc defaults to 8 whereas here (with the memoryview) the itemsize (based off the blosc examples) should be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the
itemsize
should be the type of the data ultimately stored. This is an optimization hint for the shuffle operators to expose more regularity. For example, whenfloat32
elements are stored, theitemsize
should be 4, whencomplex128
data are stored, it should be 16 etc.I'm not familiar with all the details of
asdf
and usingdata.itemsize
might be the wrong choice. I am looking for the datatype (thatasdf
knows) of thendarray
here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASDF blocks are a little weird because they might be shared across arrays (even if the arrays have different dtypes). Because of this every block is just bytes.
Here's one example:
The example produces a file with only 1 ASDF block but with 2 arrays each with a different dtype:
See this issue for some discussion on asdf saving 'base' arrays: asdf-format/asdf#1669
The standard is not very detailed about this behavior but does contain a short statement about blocks being bytes and not having a datatype:
Is there much of a performance gain to be had for setting
itemsize
?