Add an end to end test #3

mattjbr123 · 2024-10-01T15:06:06Z

Is the output zarr dataset the same as the input netcdf dataset

Exact method to do this TBD

mattjbr123 · 2024-10-07T16:11:43Z

Ignore the above commits and comments, I think they've been assigned to the wrong issue...

mattjbr123 · 2024-10-25T15:43:48Z

We want to compare the data in the input netcdf file(s) to the output zarr dataset to ensure they are the same.

TL;DR

Do we do full data-point by data-point comparison or hashing/summarising?
Where do we store the data?
Where do we run the test?

If doing fully/completely one major issue is the size of the datasets, potentially multi-TB. Would it be possible to do hashing or some other simple calculation to solve this? But this probably still has the issue of being a computationally expensive operation and needing to read in all the data anyway? This must be an issue the EIDC team face and have solutions for? @phtrceh would you be able to advise? Maybe we do just compare the datasets in chunks/slices. It'll still take a while but probably not as computationally expensive as calculating a parameter/hash from the data? Given we'd probably still want to use Beam to parallelise this as much as possible, we could build it in to the conversion pipeline itself somehow?

Then there is the issue of where do we want to run the test? If we want to run the test via GitHub Actions/CI, we'd need to link to whatever HPC or HPC-like environment we are running the conversion on and run there, unless we get the pipeline to calculate a number that represents the whole dataset somewhow for each dataset, and then the comparison is trivial and can run directly on a teeny tiny instance on GitHub.

Another issue is that we cannot store the data on GitHub as again it is too big. A potential way around this would be to upload the original and converted datasets to an object store and read them from there. We might have to find a way to safely store the credentials needed to access the object store, but I feel like this should have been a problem already solved elsewhere (e.g. the time series FDRI product?). Eventually we will not need to upload the converted data to object storage as a manual step as it will be done anyway as part of the Beam pipeline, but we need to not be using the DirectRunner for this, which involves creating a Flink or Spark instance for the Beam Flink or Spark runners to use.

Lots of questions!!

mattjbr123 · 2024-12-02T18:14:36Z

The "Verifying the integrity of Zarr stores" section of this blog post has some ideas, and some helpful links to where it is being discussed! Something to keep an eye on until we get around to tackling this in earnest.

mattjbr123 · 2024-12-05T17:41:14Z

Part of PADOCC, developed my @dwest77a, is a verification step.
PADOCC is a tool that converts or kerchunks datasets on the CEDA Archive at scale/in parallel. There is then a verfication step (soon to be an importable module!) that checks that data read in to xarray in original format is the same as the data read in via zarr/kerchunk to xarray is the same.
The comparison occurs on the xarray dataset objects.
The metadata and actual data as read in by xarray are compared. The data comparison is done by selecting a random(?) box of data where at least some proportion is not NaN and comparing these, ideally across chunk boundaries to check that the correct order has been maintained.
I'd like to dig a little further into the details here when I get to working on this issue in anger.

mattjbr123 transferred this issue from NERC-CEH/object_store_tutorial Oct 7, 2024

mattjbr123 added a commit that referenced this issue Oct 7, 2024

First attempt at recipe for GEAR #3

a892352

mattjbr123 closed this as completed in b443a1e Oct 7, 2024

mattjbr123 added a commit that referenced this issue Oct 7, 2024

random tweaks #3

0a1cf08

mattjbr123 added a commit that referenced this issue Oct 7, 2024

Add preprocessor #3

b55a161

mattjbr123 added a commit that referenced this issue Oct 7, 2024

minor changes to pipeline #3

438169e

github-project-automation bot moved this from Backlog to Done in Gridded data conversion initial product Oct 7, 2024

mattjbr123 reopened this Oct 7, 2024

mattjbr123 changed the title ~~Add some an end to end test~~ Add an end to end test Oct 7, 2024

dolegi self-assigned this Oct 18, 2024

mattjbr123 linked a pull request Oct 22, 2024 that will close this issue

feature/3-e2e-test add basic e2e test FW-352 #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an end to end test #3

Add an end to end test #3

mattjbr123 commented Oct 1, 2024

mattjbr123 commented Oct 7, 2024

mattjbr123 commented Oct 25, 2024 •

edited

Loading

mattjbr123 commented Dec 2, 2024

mattjbr123 commented Dec 5, 2024 •

edited

Loading

Add an end to end test #3

Add an end to end test #3

Comments

mattjbr123 commented Oct 1, 2024

mattjbr123 commented Oct 7, 2024

mattjbr123 commented Oct 25, 2024 • edited Loading

mattjbr123 commented Dec 2, 2024

mattjbr123 commented Dec 5, 2024 • edited Loading

mattjbr123 commented Oct 25, 2024 •

edited

Loading

mattjbr123 commented Dec 5, 2024 •

edited

Loading