Performance roadmap #104

TomNicholas · 2024-05-09T04:18:29Z

We want to be able to create, combine, serialize, and use manifests that point to very large numbers of files. The largest Zarr stores we already see have O(1e6) chunks per array, and 10's or 100's of arrays (e.g. the National Water Model dataset).

To be able to virtualize this many references at once, there are multiple places that could become performance bottlenecks.

Reference generation
open_virtual_dataset should create ManifestArray objects from URLs with a minimum memory overhead. It is unlikely that the current implementation (i.e. kerchunk's SingleHdf5ToZarr to find generate references for a netCDF file) we are using is as efficient as possible because it first creates a python dict rather than a more efficient in-memory representation.
In-memory representation
It's crucial that our in-memory representation of the manifest be as memory-efficient as possible. What's the smallest that the in-memory representation of e.g. a single chunk manifest containing a million references could be? I think it's only ~100MB, if we use a numpy array (or 3 numpy arrays) and avoid object dtypes. We should be able to test this easily ourselves by creating a manifest with some dummy references data. See In-memory representation of chunks: array instead of a dict? #33.
Combining
If we can fit all the ManifestArray objects we need into memory on one worker, this part is easy. Again using numpy arrays is good because np.concatenate/stack should be memory-efficient.

If the combined manifest is too big to fit into a single worker's memory then we might want to use dask to create a task graph, with the root tasks generating the references, concatenation via a tree of tasks (not technically a reduction because the output is as big as the input), then writing out chunkwise to some form of storage. Overall the result would be similar to how kerchunk uses auto_dask.
Serialization
Writing out the manifest to JSON as text would create an on-disk representation that is larger than necessary. Writing out to a compressed / chunked format such as parquet could ameliorate this. This could be kerchunk's parquet format or we could use a compressed format in the Zarr manifest storage transformer ZEP (see discussion in Manifest storage transformer zarr-specs#287)

We should be trying to work out what the performance of each of these steps is, and they are separate so we can look at them individually.

cc @sharkinsspatial @norlandrhagen @keewis

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-05-09T16:51:34Z

To demonstrate my point about (2), it looks like storing a million references in-memory can be done using numpy taking up only 24MB.

In [1]: import numpy as np

# Using numpy 2.0 for the variable-length string dtype
In [2]: np.__version__
Out[2]: '2.0.0rc1'

# The number of chunks in this zarr array
In [3]: N_ENTRIES = 1000000

In [4]: SHAPE = (100, 100, 100)

# Taken from Julius' example in issue #93 
In [5]: url_prefix = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-'

# Notice these strings will have slightly different lengths
In [6]: paths = [f"{url_prefix}{i}.nc" for i in range(N_ENTRIES)]

# Here's where we need numpy 2
In [7]: paths_arr = np.array(paths, dtype=np.dtypes.StringDType).reshape(SHAPE)

# These are the dimensions of our chunk grid
In [8]: paths_arr.shape
Out[8]: (100, 100, 100)

In [9]: paths_arr.nbytes
Out[9]: 16000000

In [10]: offsets_arr = np.arange(N_ENTRIES, dtype=np.dtype('int32')).reshape(SHAPE)

In [11]: offsets_arr.nbytes
Out[11]: 4000000

In [12]: lengths_arr = np.repeat(100, repeats=N_ENTRIES).astype('int32').reshape(SHAPE)

In [13]: lengths_arr.nbytes
Out[14]: 4000000

In [14]: (paths_arr.nbytes + offsets_arr.nbytes + lengths_arr.nbytes) / 1e6
Out[14]: 24.0

i.e. only 24MB to store a million references (for one ManifestArray) in memory.

This is all we need for (2) I think. It would be nice to put all 3 fields into one structured array instead of 3 separate numpy arrays, but apparently that isn't possible yet (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)). But that's just an implementation detail as this would all be hidden within the ManifestArray class anyway. So it looks like to complete (2) I just have to finish #39 but using 3 numpy arrays instead of one structured array.

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store. This is smaller than worker memory, implying that I don't think we need dask to perform the concatenation (so we shouldn't need dask for (3)?).

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

TomNicholas · 2024-05-09T18:01:59Z

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

See Ryan's comment (#33 (comment)) suggesting specific compressor codecs to use for storing references in zarr arrays.

TomNicholas · 2024-11-20T20:07:41Z

even a really big store with 100 variables, each with a million chunks

I tried this out, see earth-mover/icechunk#401

TomNicholas added the performance label May 9, 2024

TomNicholas mentioned this issue May 9, 2024

In-memory representation of chunks: array instead of a dict? #33

Closed

TomNicholas mentioned this issue May 10, 2024

Use 3 numpy arrays for manifest internally #107

Merged

This was referenced May 13, 2024

Reading from dmrcp index files? #85

Closed

chunk key parsing speedup ayushnag/VirtualiZarr#1

Merged

TomNicholas mentioned this issue May 21, 2024

Serverless parallelization of reference generation #123

Open

TomNicholas mentioned this issue Jun 8, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

21 tasks

TomNicholas mentioned this issue Jun 29, 2024

What should nbytes return? #167

Open

TomNicholas mentioned this issue Jul 9, 2024

Support for large files in ChunkManifest #176

Closed

TomNicholas mentioned this issue Jul 30, 2024

Add example to create a virtual dataset using lithops #203

Merged

TomNicholas mentioned this issue Nov 20, 2024

PetaByte-scale virtual ref performance benchmark earth-mover/icechunk#401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance roadmap #104

Performance roadmap #104

TomNicholas commented May 9, 2024 •

edited

Loading

TomNicholas commented May 9, 2024 •

edited

Loading

TomNicholas commented May 9, 2024

TomNicholas commented Nov 20, 2024 •

edited

Loading

Performance roadmap #104

Performance roadmap #104

Comments

TomNicholas commented May 9, 2024 • edited Loading

TomNicholas commented May 9, 2024 • edited Loading

TomNicholas commented May 9, 2024

TomNicholas commented Nov 20, 2024 • edited Loading

TomNicholas commented May 9, 2024 •

edited

Loading

TomNicholas commented May 9, 2024 •

edited

Loading

TomNicholas commented Nov 20, 2024 •

edited

Loading