-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to compress individual arrays in ASDF romancal output files #1555
Comments
Comment by Harry Ferguson on JIRA: Implementation thoughts: - Make the type of compression for each array an option to be specified in the CRDS configuration file for each step of the pipeline? - On the other hand...it could perhaps be encoded in the version of the data models that is used? Or both. The CRDS config file could point to the appropriate version of the roman data models for the type of compression desired. - If we hare happy with LZ4 compression throughout, we may want to migrate to having that be the default for all the arrays. Compression is transparent to the user of the data, since the ASDF reader will infer the type of compression from the metadata and decompress it while reading the array. |
spacetelescope/roman_datamodels#440 is a draft PR which configures roman_datamodels to compress all arrays with lz4 compression. The PR has some test results (using the regression test data). The current example results table is copied here:
|
spacetelescope/roman_datamodels#440 enables Setting per-array compression is possible with asdf but is hindered by the current organization of roman datamodels. With asdf a user could (assuming >>> af.set_array_compression(af['roman']['dq'], 'lz4') to compress only the I propose that for setting per-array compression we:
The motivations for:
|
That's great. I think you also mentioned elsewhere that the timing for the regression tests was not significantly affected; i.e., we save ~30% or so in file size without penalties for normal processing. Great. FWIW, I'm happy if we want to continue to pass through things like the extra romanisim extension but I'm also happy with the current model where if romanisim itself produces the file it has that extra extension, and if romancal produces the file it looks like an ordinary roman file without extra stuff. I'd keep the "compress everything" default for now but we could build out per-array options later if needed. |
Comment by Brett Graham on JIRA: FWIW I also did some testing with a few other available algorithms. Compressing all arrays in a skycell file: r0099101001001001001_r274dp63x31y81_prompt_F158_i2d.asdf resulted in the following:
lz4f, zstandard and blosc are all supported in https://github.com/asdf-format/asdf-compression using the asdf compression extension API. That package is ready for release but is not yet on pypi. More algorithms can be added there if needed.
|
Issue RCAL-861 was created on JIRA by Harry Ferguson:
Compressing some or all of the individual arrays within ASDF files can offer both cost benefits and performance benefits. The performance benefit to users is significantly faster download times. Of the compression algorithms supported by ASDF, LZ4 is by far the fastest, albeit with slightly less compression than the others. There are some notes on compression at https://innerspace.stsci.edu/display/ROMAN/Optimizing+Roman+WFI+Data+Formats].
Having the ability to compress individual arrays gives us the option to compress the little-used ones (e.g. the individual variance arrays), while leaving others uncompressed.
We don't yet know for sure that compression is the right strategy, so it's useful to make this optional and easy to turn on and off. There may be downsides to compressing -- e.g. possible performance penalties for cutout services, which might want to index into an arbitrary location in the data arrays.
The text was updated successfully, but these errors were encountered: