Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tradeoffs of Storing Supported Grids as Secondary Sources rather than Primary Source Documents #100

Open
Sujay-Shankar opened this issue Jan 18, 2024 · 0 comments

Comments

@Sujay-Shankar
Copy link
Collaborator

In issue #12 we discuss prospects of caching grids in an efficient binary format such as .npy files. However, that issue implied that the user would be doing the caching on a per-subset basis. Here, instead we propose the one-time storage in an efficient format of an entire grid.

Open Questions: What storage format do we adopt? .npy, .h5, .parquet, .feather, etc. Where to store? Strong preference for zenodo.

Requirements:

  • Whatever format we have must have some form of storing metadata (teff, logg, Z, fsed, etc)
  • Should be binary-based (small storage and fast access), should support efficient columnar data access
  • Format should be stable and have longevity

Pros:

  • Users will only have to download once
  • Significantly faster IO
  • Storage impact is significantly lower
  • Easier for gollum developers to support cross-platform data downloads

Cons:

  • Must be managed by gollum maintainers instead of by grid creators/users
  • Prospect for divergence between native primary source documents and our secondary source documents, requires QA

Use case workflow:

  • User installs gollum, says "I want to work with Sonora Diamondback"
  • User runs SonoraDiamondbackSpectrum.download_grid(location=)
  • Ultra zipped compressed archive is stored somewhere online
  • Function grabs the archive and downloads it into a directory specified by the location they provided
  • Now when user asks for a sonora spectrum, all we do is pick out the columns(grid points) of our dataframe that we want and load them in (very fast because the data has been specifically optimized for this operation)
  • Same with grid loading, presto
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant