Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an end to end test #3

Open
mattjbr123 opened this issue Oct 1, 2024 · 4 comments
Open

Add an end to end test #3

mattjbr123 opened this issue Oct 1, 2024 · 4 comments
Assignees

Comments

@mattjbr123
Copy link
Collaborator

Is the output zarr dataset the same as the input netcdf dataset

Exact method to do this TBD

@mattjbr123 mattjbr123 transferred this issue from NERC-CEH/object_store_tutorial Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
@mattjbr123
Copy link
Collaborator Author

Ignore the above commits and comments, I think they've been assigned to the wrong issue...

@mattjbr123 mattjbr123 reopened this Oct 7, 2024
@mattjbr123 mattjbr123 changed the title Add some an end to end test Add an end to end test Oct 7, 2024
@dolegi dolegi self-assigned this Oct 18, 2024
@mattjbr123 mattjbr123 linked a pull request Oct 22, 2024 that will close this issue
@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Oct 25, 2024

We want to compare the data in the input netcdf file(s) to the output zarr dataset to ensure they are the same.

TL;DR

  • Do we do full data-point by data-point comparison or hashing/summarising?
  • Where do we store the data?
  • Where do we run the test?

If doing fully/completely one major issue is the size of the datasets, potentially multi-TB. Would it be possible to do hashing or some other simple calculation to solve this? But this probably still has the issue of being a computationally expensive operation and needing to read in all the data anyway? This must be an issue the EIDC team face and have solutions for? @phtrceh would you be able to advise? Maybe we do just compare the datasets in chunks/slices. It'll still take a while but probably not as computationally expensive as calculating a parameter/hash from the data? Given we'd probably still want to use Beam to parallelise this as much as possible, we could build it in to the conversion pipeline itself somehow?

Then there is the issue of where do we want to run the test? If we want to run the test via GitHub Actions/CI, we'd need to link to whatever HPC or HPC-like environment we are running the conversion on and run there, unless we get the pipeline to calculate a number that represents the whole dataset somewhow for each dataset, and then the comparison is trivial and can run directly on a teeny tiny instance on GitHub.

Another issue is that we cannot store the data on GitHub as again it is too big. A potential way around this would be to upload the original and converted datasets to an object store and read them from there. We might have to find a way to safely store the credentials needed to access the object store, but I feel like this should have been a problem already solved elsewhere (e.g. the time series FDRI product?). Eventually we will not need to upload the converted data to object storage as a manual step as it will be done anyway as part of the Beam pipeline, but we need to not be using the DirectRunner for this, which involves creating a Flink or Spark instance for the Beam Flink or Spark runners to use.

Lots of questions!!

@mattjbr123
Copy link
Collaborator Author

The "Verifying the integrity of Zarr stores" section of this blog post has some ideas, and some helpful links to where it is being discussed! Something to keep an eye on until we get around to tackling this in earnest.

@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Dec 5, 2024

Part of PADOCC, developed my @dwest77a, is a verification step.
PADOCC is a tool that converts or kerchunks datasets on the CEDA Archive at scale/in parallel. There is then a verfication step (soon to be an importable module!) that checks that data read in to xarray in original format is the same as the data read in via zarr/kerchunk to xarray is the same.
The comparison occurs on the xarray dataset objects.
The metadata and actual data as read in by xarray are compared. The data comparison is done by selecting a random(?) box of data where at least some proportion is not NaN and comparing these, ideally across chunk boundaries to check that the correct order has been maintained.
I'd like to dig a little further into the details here when I get to working on this issue in anger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

2 participants