Replies: 15 comments 2 replies
-
I'll share our use-case at Open Climate Fix:
|
Beta Was this translation helpful? Give feedback.
-
Our use case at WEBKNOSSOS:
|
Beta Was this translation helpful? Give feedback.
-
Camus Energy's reading NOAA weather data to forecast electric load and generation
"t/.zarray": "{\"chunks\":[1,1059,1799],\"compressor\":null,\"dtype\":\"<f8\",\"fill_value\":9999.0,\"filters\":[{\"dtype\":\"float64\",\"id\":\"grib\",\"var\":\"t\"}],\"order\":\"C\",\"shape\":[1,1059,1799],\"zarr_format\":2}",
"t/.zattrs": "{\"NV\":0,\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"x\",\"y\"],\"cfName\":\"air_temperature\",\"cfVarName\":\"t\",\"dataDate\":20210601,\"dataTime\":0,\"dataType\":\"fc\",\"endStep\":0,\"gridDefinitionDescription\":\"Lambert Conformal can be
secant or tangent, conical or bipolar\",\"gridType\":\"lambert\",\"missingValue\":9999,\"name\":\"Temperature\",\"numberOfPoints\":1905141,\"paramId\":130,\"shortName\":\"t\",\"stepType\":\"instant\",\"stepUnits\":1,\"typeOfLevel\":\"surface\",\"units\":\"
K\"}",
"t/0.0.0": [
"gcs://high-resolution-rapid-refresh/hrrr.20210601/conus/hrrr.t00z.wrfsfcf00.grib2",
33399570,
1357877
],
I think the overall direction, unblocking the python GIL at key points in this stack using Rust is the way to go. |
Beta Was this translation helpful? Give feedback.
-
Ome-Zarr might be of interest. It is the new standard and it seems everyone in microscopy is converging on. It defines a specific structure and metadata to store microscope images. More info: https://github.com/ome/ngff The zarr implementation: https://github.com/ome/ome-zarr-py |
Beta Was this translation helpful? Give feedback.
-
RPS Ocean Science is using zarr to archive, distribute, and visualize model data as a part of IOOS and RPS' Next Generation Data Management Project.
This stack is working really well for us, the next step is figuring out more efficient ways to fetch the data from the cloud as part of web service backends while handling many requests at once. Coordinating async chunk fetching and cacheing is tough with xarray and dask for many reasons, but figuring out how to improve this is vital to improving our web services. |
Beta Was this translation helpful? Give feedback.
-
Thanks so much for all these use-cases! This will be super-useful! It's very useful to see these kerchunk use-cases. I must admit I had been largely focused on improving the performance of Zarr datasets stored as Zarr on disk. But we clearly need to also think about how to speed up reading GRIB2 & NetCDF. (Which is fine: we were already thinking of splitting our work into at least two parts: General purpose, fast, async, batched IO. And a "Zarr front end". The IO backend could potentially also be used for reading GRIB and NetCDF). I've added two more questions to the list (q9 and q10). But please don't feel obliged to update existing answers! |
Beta Was this translation helpful? Give feedback.
-
In my experience, the speed of reading netcdf and grib tends to be reasonable, a single GFS GRIB chunk decode using cfgrib takes typically on the order of 30 ms uncached. This on the slower side too, spatially smaller datasets are much faster too, often decoding in under 10 ms. (I am also working on gribberish experimenting with improving grib reading and decoding with rust, which is showing some promise to be faster for most types of grib packing) We are serving map tiles of the data though, so we are reading chunks of the entire world to make small pictures which is not efficient, but we are locked to using source data as distributed by NOAA or others for robustness reasons. So the bottleneck for us is usually IO, as we are forced to read chunks of at least a mb or larger. I hope this is illustrative and helpful! |
Beta Was this translation helpful? Give feedback.
-
For completeness: Also see this comment about "super-zarr" parallelism, especially dask... Perhaps our benchmarking workloads should include workloads which use dask |
Beta Was this translation helpful? Give feedback.
-
Here is another use case: ML on digital pathology whole slide images
|
Beta Was this translation helpful? Give feedback.
-
Our use case: Marine Heatwaves web API A brief description of your use-case: Performance requirements (if any): Chunk sizes: (auto) date: 24, lat: 720, lon: 1440 |
Beta Was this translation helpful? Give feedback.
-
Distributed Data Processing with Cubed
Processing large datasets using Cubed. Cubed is a serverless distributed framework that drops into xarray in place of dask, and tries to limit RAM usage in computations by writing intermediate steps to persistent storage (an idea taken from Pangeo's
We haven't done much benchmarking (there is some comparison to using dask for the same problem in this blog post), but any IO speedup should make a big difference for Cubed (xref cubed-dev/cubed#187).
We've been recommending ~100MB chunks but I don't think this is a requirement in general.
No strong opinions AFAIK.
In production Cubed is primarily designed to be deployed on serverless infrastructure, so Linux.
Blob storage in the cloud. The whole idea of Cubed is to have many serverless workers reading and writing to cloud storage in parallel.
We can call cubed from xarray using cubed-xarray, see the blog post for details.
Could be anything that you want to perform numpy-like operations on - Cubed is a domain-agnostic processing framework like
Potentially both. We want to support the whole xarray API.
Mostly similar to the initial read step that is typical when using dask. So we likely read either the entire dataset, or some large subset of it, in a pattern that corresponds to the analysis the user wants to do (e.g. a spatially-contiguous subset, or downsampling in time). Then as Cubed performs reductions generally the size of the intermediate stores we create get smaller, and we are reading and writing those in their entirety at each step. cc @tomwhite |
Beta Was this translation helpful? Give feedback.
-
Biggest limiter is conserving metadata, custom grids and writing the data into an absurd ammount of storage and inodes. |
Beta Was this translation helpful? Give feedback.
-
We use Distributed Temperature Sensing (DTS) and Distributed Acoustic Sensing (DAS) technology, which produces arrays with a dimension of 2,000 x 5,000 each second. Data processing is done close to data production, and a first step is normalisation and standardisation of the data into an intermediate "buffer" from which new pipelines are started for further processing and feature extraction. Both this buffer and the final processed data are stored in zarr format. A 3rd dimension is added in some cases where we convert to the frequency domain. Due to limitations on storage space, and because there's no need to keep our buffers indefinitely, we have extended zarr to support "trimmable" stores, i.e. where you can trim the start of the time axis while extending the end of the time axis as new data is appended and older data becomes obsolete. We're planning to open source this code. An API exposes the various data.
Our production system should be able to store 35 MB/s of data from its raw format into the buffer store, while also running the various postprocessing pipelines. We've chosen a high-end AMD Ryzen 5 with 128 GB of RAM.
20,000 (time, 10s) x 1,000 (positions) for the preprocessing and frequency conversion. The time size is defined by incoming data size, which dictates further processing in chunks of 10 seconds. Since the interrogator switches between fibers, there's a lot of time without data for the other fibers. Zarr does not need to write the empty chunks which saves a lot of space, although it does require some creative indexing and seeking.
We use the standard blosc compressor which seems fast enough for our needs
Debian Linux (Bullseye)
For one application, we have 10 x 16TB Seagate Ironwolfs in RAID10 with a Highpoint Rocket 720 controller to achieve a single BTRFS volume of about 80 TB.
We use xarray in combination with some custom code. Currently dask is not used in this real-time pipeline, only for data analysis (for which we apply). We use celery to parallellise processing.
From a DAS or DTS interrogator as HDF5 files. We also want to be able to process streamed data, e.g. via Apache Kafka in the AVRO format.
We use integer array indexes with our custom implementation of the trimmable/shifted zarr store (also handling the timestamp - integer conversion under the hood).
We read slices (aligned with our chunk definition) for our data processing, whereas random slices for the API access. |
Beta Was this translation helpful? Give feedback.
-
Fantastic thread! Very interesting read - thanks everyone for sharing. Here is our use case.
At Oikolab, we provide weather & climate data (ERA5/ERA5Land/GFS/HRRR etc.) to analysts - our target users are climate analysts or Excel/MATLAB warriors who might have a need for climate data and know enough to request specific dataset or parameters to feed into their workflow but don't have time to go to the primary sources or learn about all the tools (reference). We also provide weather parameters that are not normally in the primary dataset such as web-bulb temperature, which are calculated on the fly. Most users look for location-based time-series data, say 100m wind data or solar data over 10 years, but we also have users who are looking to download a regional area (e.g. CONUS) say in NetCDF and just looking for a faster way to do this than going to the primary sources - in seconds or minutes rather than hours. Business users tend to track data for multiple locations so we provide ability to query with a list of lat/lon pairs, usually up to several hundreds at a time.
We run these on commodity Ubuntu servers so we need to run our scripts with reasonable memory. For raw data processing, we’re normally limited by the network bandwidth (10Gps) for fetching/uploading data but we’ve had some issues in the past with memory leak (dask/distributed#5960) that’s not released back to the OS after each request which is a problem for running API servers.
We found 10~20mb to be a sweet spot for most of our datasets, although this needs to be balanced by the total number of chunks. There is some waste here as we read more data than needed typically but this is not the bottleneck for our use-case.
Nothing fancy here - we use the default as we’re mostly concerned with compression/decompression speed.
Ubuntu/Debian mostly, and Mac for dev.
We use a combination of cloud-hosted S3 services (Wasabi, Digital Ocean etc.) and local attached volumes. Recently tried out R2 but found it to have lower rate limits than the others in terms number of objects written to it per second.
Yes.
National weather agencies such as NCEP, ECMWF. Most of these are in GRIB2 format so we run cronjobs to process the data. In our pipeline, we also use tools such as Wgrib2 and CDO for working with GRIB formats as we found them to be easier to work with than via cfgrib/Python.
We use datetimes.
For our API servers, entire datasets are read lazily on start up and applicable subsets are loaded as per user request - which could be a single point or a multiple points, or a region. They are almost always sliced by time and sometime by bounding box as per request. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Please share your use-case(s) for Zarr, to help inform our performance benchmarking.
xref: #1479
If possible, please share:
(EDIT: I added questions 9 and 10 on the 7th August)
Two use-cases have already been shared in this thread.
Beta Was this translation helpful? Give feedback.
All reactions