Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to compress individual arrays in ASDF romancal output files #1555

Open
stscijgbot-rstdms opened this issue Dec 7, 2024 · 5 comments

Comments

@stscijgbot-rstdms
Copy link
Collaborator

Issue RCAL-861 was created on JIRA by Harry Ferguson:

Compressing some or all of the individual arrays within ASDF files can offer both cost benefits and performance benefits. The performance benefit to users is significantly faster download times. Of the compression algorithms supported by ASDF, LZ4 is by far the fastest, albeit with slightly less compression than the others. There are some notes on compression at https://innerspace.stsci.edu/display/ROMAN/Optimizing+Roman+WFI+Data+Formats].

Having the ability to compress individual arrays gives us the option to compress the little-used ones (e.g. the individual variance arrays), while leaving others uncompressed.

We don't yet know for sure that compression is the right strategy, so it's useful to make this optional and easy to turn on and off.  There may be downsides to compressing -- e.g. possible performance penalties for cutout services, which might want to index into an arbitrary location in the data arrays. 

@stscijgbot-rstdms
Copy link
Collaborator Author

Comment by Harry Ferguson on JIRA:

Implementation thoughts:

 - Make the type of compression for each array an option to be specified in the CRDS configuration file for each step of the pipeline? 

 - On the other hand...it could perhaps be encoded in the version of the data models that is used? Or both. The CRDS config file could point to the appropriate version of the roman data models for the type of compression desired. 

 - If we hare happy with LZ4 compression throughout, we may want to migrate to having that be the default for all the arrays.

Compression is transparent to the user of the data, since the ASDF reader will infer the type of compression from the metadata and decompress it while reading the array. 

@braingram
Copy link
Collaborator

spacetelescope/roman_datamodels#440 is a draft PR which configures roman_datamodels to compress all arrays with lz4 compression. The PR has some test results (using the regression test data). The current example results table is copied here:

file name uncompressed size (MB) compressed size (MB)
r0000101001001001001_0001_wfi01_dqinit.asdf 1480 248.05
r0000201001001001001_0001_wfi01_dqinit.asdf 1480 248.04
r0000101001001001001_0001_wfi01_cal.asdf 459.82 385.93
r0099101001001001001_r274dp63x31y81_prompt_F158_coadd.asdf 339.32 267.19
r0099101001001001001_F158_visit_coadd.asdf 485.10 416.08

@braingram
Copy link
Collaborator

spacetelescope/roman_datamodels#440 enables lz4 compression for all arrays (with the option for a user to override this with either no compression or a different algorithm).

Setting per-array compression is possible with asdf but is hindered by the current organization of roman datamodels. With asdf a user could (assuming af is an AsdfFile instance):

>>> af.set_array_compression(af['roman']['dq'], 'lz4')

to compress only the dq array. However set_array_compression is not exposed currently in roman datamodels and the AsdfFile instance where these settings are tracked is discarded on write:
https://github.com/spacetelescope/roman_datamodels/blob/313bedd6b6e8b7868e237b1f9193c237b2fd76b7/src/roman_datamodels/datamodels/_core.py#L234-L236
which would ignore all settings either provided by the user (or loaded from the file).

I propose that for setting per-array compression we:

  • update roman datamodels to retain array settings (perhaps by reusing the read AsdfFile for write, this has other benefits like retaining tree contents not under the roman tag like that used by romanisim which is currently discarded)
  • set defaults in roman_datamodels (perhaps keeping the default 'compress everything')

The motivations for:

  • not using CRDS: there are use cases (jdaviz) where a CRDS connection is not available making any defaults defined there unavailable
  • not using the schemas: the schema and data structure is not always 1-to-1, this is not an issue for describing constraints (a single object can be constrained by several subschemas without needing to combine those subschemas) but is an issue when attempting to use the schemas to annotate an object (if multiple subschemas define an annotation some combination logic is required which is otherwise undefined for jsonschema). As the current plan allows a user to override the compression any override would result in a disagreement with the schema on resave (let's say a user sets their data array compression to 'None' but the schema defines 'lz4' in this instance the schema no longer matches the file). Finally adding the compression to the schema would require that the schema be crawled prior to save to determine array compression algorithms (which comes with a considerable performance cost due to the complexity of the schemas).

@schlafly
Copy link
Collaborator

That's great. I think you also mentioned elsewhere that the timing for the regression tests was not significantly affected; i.e., we save ~30% or so in file size without penalties for normal processing. Great.

FWIW, I'm happy if we want to continue to pass through things like the extra romanisim extension but I'm also happy with the current model where if romanisim itself produces the file it has that extra extension, and if romancal produces the file it looks like an ordinary roman file without extra stuff. I'd keep the "compress everything" default for now but we could build out per-array options later if needed.

@stscijgbot-rstdms
Copy link
Collaborator Author

Comment by Brett Graham on JIRA:

FWIW I also did some testing with a few other available algorithms. Compressing all arrays in a skycell file:

r0099101001001001001_r274dp63x31y81_prompt_F158_i2d.asdf

resulted in the following:

  • uncompressed: 339M
  • lz4: 267M
  • lz4f (frame instead of block based compression): 267M
  • zstandard: 226M
  • blosc: 223M

lz4f, zstandard and blosc are all supported in https://github.com/asdf-format/asdf-compression using the asdf compression extension API. That package is ready for release but is not yet on pypi. More algorithms can be added there if needed.

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants