Add option to compress individual arrays in ASDF romancal output files #1555

stscijgbot-rstdms · 2024-12-07T19:51:04Z

Issue RCAL-861 was created on JIRA by Harry Ferguson:

Compressing some or all of the individual arrays within ASDF files can offer both cost benefits and performance benefits. The performance benefit to users is significantly faster download times. Of the compression algorithms supported by ASDF, LZ4 is by far the fastest, albeit with slightly less compression than the others. There are some notes on compression at https://innerspace.stsci.edu/display/ROMAN/Optimizing+Roman+WFI+Data+Formats].

Having the ability to compress individual arrays gives us the option to compress the little-used ones (e.g. the individual variance arrays), while leaving others uncompressed.

We don't yet know for sure that compression is the right strategy, so it's useful to make this optional and easy to turn on and off. There may be downsides to compressing -- e.g. possible performance penalties for cutout services, which might want to index into an arbitrary location in the data arrays.

stscijgbot-rstdms · 2024-12-07T19:51:09Z

Comment by Harry Ferguson on JIRA:

Implementation thoughts:

- Make the type of compression for each array an option to be specified in the CRDS configuration file for each step of the pipeline?

- On the other hand...it could perhaps be encoded in the version of the data models that is used? Or both. The CRDS config file could point to the appropriate version of the roman data models for the type of compression desired.

- If we hare happy with LZ4 compression throughout, we may want to migrate to having that be the default for all the arrays.

Compression is transparent to the user of the data, since the ASDF reader will infer the type of compression from the metadata and decompress it while reading the array.

braingram · 2024-12-12T19:11:37Z

spacetelescope/roman_datamodels#440 is a draft PR which configures roman_datamodels to compress all arrays with lz4 compression. The PR has some test results (using the regression test data). The current example results table is copied here:

file name	uncompressed size (MB)	compressed size (MB)
r0000101001001001001_0001_wfi01_dqinit.asdf	1480	248.05
r0000201001001001001_0001_wfi01_dqinit.asdf	1480	248.04
r0000101001001001001_0001_wfi01_cal.asdf	459.82	385.93
r0099101001001001001_r274dp63x31y81_prompt_F158_coadd.asdf	339.32	267.19
r0099101001001001001_F158_visit_coadd.asdf	485.10	416.08

braingram · 2024-12-13T16:48:55Z

spacetelescope/roman_datamodels#440 enables lz4 compression for all arrays (with the option for a user to override this with either no compression or a different algorithm).

Setting per-array compression is possible with asdf but is hindered by the current organization of roman datamodels. With asdf a user could (assuming af is an AsdfFile instance):

>>> af.set_array_compression(af['roman']['dq'], 'lz4')

to compress only the dq array. However set_array_compression is not exposed currently in roman datamodels and the AsdfFile instance where these settings are tracked is discarded on write:
https://github.com/spacetelescope/roman_datamodels/blob/313bedd6b6e8b7868e237b1f9193c237b2fd76b7/src/roman_datamodels/datamodels/_core.py#L234-L236
which would ignore all settings either provided by the user (or loaded from the file).

I propose that for setting per-array compression we:

update roman datamodels to retain array settings (perhaps by reusing the read AsdfFile for write, this has other benefits like retaining tree contents not under the roman tag like that used by romanisim which is currently discarded)
set defaults in roman_datamodels (perhaps keeping the default 'compress everything')

The motivations for:

not using CRDS: there are use cases (jdaviz) where a CRDS connection is not available making any defaults defined there unavailable
not using the schemas: the schema and data structure is not always 1-to-1, this is not an issue for describing constraints (a single object can be constrained by several subschemas without needing to combine those subschemas) but is an issue when attempting to use the schemas to annotate an object (if multiple subschemas define an annotation some combination logic is required which is otherwise undefined for jsonschema). As the current plan allows a user to override the compression any override would result in a disagreement with the schema on resave (let's say a user sets their data array compression to 'None' but the schema defines 'lz4' in this instance the schema no longer matches the file). Finally adding the compression to the schema would require that the schema be crawled prior to save to determine array compression algorithms (which comes with a considerable performance cost due to the complexity of the schemas).

schlafly · 2024-12-16T14:03:29Z

That's great. I think you also mentioned elsewhere that the timing for the regression tests was not significantly affected; i.e., we save ~30% or so in file size without penalties for normal processing. Great.

FWIW, I'm happy if we want to continue to pass through things like the extra romanisim extension but I'm also happy with the current model where if romanisim itself produces the file it has that extra extension, and if romancal produces the file it looks like an ordinary roman file without extra stuff. I'd keep the "compress everything" default for now but we could build out per-array options later if needed.

stscijgbot-rstdms · 2024-12-17T14:16:04Z

Comment by Brett Graham on JIRA:

FWIW I also did some testing with a few other available algorithms. Compressing all arrays in a skycell file:

r0099101001001001001_r274dp63x31y81_prompt_F158_i2d.asdf

resulted in the following:

uncompressed: 339M
lz4: 267M
lz4f (frame instead of block based compression): 267M
zstandard: 226M
blosc: 223M

lz4f, zstandard and blosc are all supported in https://github.com/asdf-format/asdf-compression using the asdf compression extension API. That package is ready for release but is not yet on pypi. More algorithms can be added there if needed.

stscijgbot-rstdms added the team_periwinkle label Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to compress individual arrays in ASDF romancal output files #1555

Add option to compress individual arrays in ASDF romancal output files #1555

stscijgbot-rstdms commented Dec 7, 2024

stscijgbot-rstdms commented Dec 7, 2024

braingram commented Dec 12, 2024

braingram commented Dec 13, 2024

schlafly commented Dec 16, 2024

stscijgbot-rstdms commented Dec 17, 2024

Add option to compress individual arrays in ASDF romancal output files #1555

Add option to compress individual arrays in ASDF romancal output files #1555

Comments

stscijgbot-rstdms commented Dec 7, 2024

stscijgbot-rstdms commented Dec 7, 2024

braingram commented Dec 12, 2024

braingram commented Dec 13, 2024

schlafly commented Dec 16, 2024

stscijgbot-rstdms commented Dec 17, 2024