Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Sujay-Shankar · 2024-01-18T21:29:11Z

In issue #12 we discuss prospects of caching grids in an efficient binary format such as .npy files. However, that issue implied that the user would be doing the caching on a per-subset basis. Here, instead we propose the one-time storage in an efficient format of an entire grid.

Open Questions: What storage format do we adopt? .npy, .h5, .parquet, .feather, etc. Where to store? Strong preference for zenodo.

Requirements:

Whatever format we have must have some form of storing metadata (teff, logg, Z, fsed, etc)
Should be binary-based (small storage and fast access), should support efficient columnar data access
Format should be stable and have longevity

Pros:

Users will only have to download once
Significantly faster IO
Storage impact is significantly lower
Easier for gollum developers to support cross-platform data downloads

Cons:

Must be managed by gollum maintainers instead of by grid creators/users
Prospect for divergence between native primary source documents and our secondary source documents, requires QA

Use case workflow:

User installs gollum, says "I want to work with Sonora Diamondback"
User runs SonoraDiamondbackSpectrum.download_grid(location=)
Ultra zipped compressed archive is stored somewhere online
Function grabs the archive and downloads it into a directory specified by the location they provided
Now when user asks for a sonora spectrum, all we do is pick out the columns(grid points) of our dataframe that we want and load them in (very fast because the data has been specifically optimized for this operation)
Same with grid loading, presto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Sujay-Shankar commented Jan 18, 2024

Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Comments

Sujay-Shankar commented Jan 18, 2024