-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility with a zarr back end. #52
Comments
For sure! I just learnt enough about Zarr to really get interested in it, I'll have questions... 😀 |
so one question, why not use/require NetCDF itself? I've been wondering about trying this and I finally had a look at my normal system:
I expect none of the remote store stuff will be handled in NetCDF library like this (?) but wouldn't that be a better pathway for an RNetCDF package? (I expect you've explored this so will just dump my naive questions on you). I'm actually very keen to learn Zarr at its core, I only just really clicked to how simple it is - and the fact that we have entirely-R implementation in pizarrr (where it's using standard packages to support remote access and compression details afaict), and a C++ one in GDAL makes it very very accessible to me, but I'm still massively confused about how to find Zarr sources (the pangeo forge example links I find are out of date or not accessible to me, etc). |
Just for example: f <- system.file("extdata/bcsd.zarr/", package = "rnz", mustWork = TRUE)
nc <- RNetCDF::open.nc(sprintf("file://%s#mode=nczarr,file", f))
print.nc(nc)
(I only just figured this out, reading the docs: https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) |
I experimented with the NetCDF-C and GDAL pathways before and came to the conclusion that it's worth having a base-R implementation. Stuff like I have also wanted to more fully grasp the fundamentals of zarr... rnz and some contributions to pizzarr are my way to get there. A side benefit has been the opportunity to dabble in R6! (https://github.com/keller-mark/pizzarr/pull/82/files is a recent PR I made to pizzarr) Fot http zarrs, it's so frustrating. There are a few here: https://github.com/keller-mark/pizzarr/blob/main/tests/testthat/test-http-store.R and we have an open storage network pod going now (https://water.usgs.gov/catalog/usecases/8df9f64f-0f38-4849-9c6f-3d931fd2b2ba/) which will grow in its holdings in the next while. Note that test zarrs can be hosted via plain http (rawgit etc.) for basic testing. Anyways -- happy to work up a PR for a branch with some of these ideas in if you are interested, I think we are actually pretty close to zarr data "just working" with what I've got via pizzarr already. There are a lot of ""#TODO" lines in the pizzarr repo yet though, so probably not production ready for a while. |
f <- "https://raw.githubusercontent.com/DOI-USGS/rnz/main/inst/extdata/bcsd.zarr/"
nc <- RNetCDF::open.nc(sprintf("%s#mode=nczarr,s3", f)) # doesn't work!
rnz::zdump(f) # works! |
Excellent, totally appreciate your reply and all these details 👍 Fwiw I don't have strong ideas about "should", but it feels weird to extend ncmeta to an R implementation of a store that's "like netcdf", but I must admit my ideas about where things "belong" has changed radically over the years and I'm still coming to terms with how python has changed the landscape 🙏 |
Also, I didn't really understand the prospect for an R Zarr until yesterday when I finally really saw how it's structured, and I absolutely love it. (I wish we could apply smart geotransforms haha but let's see how it goes) |
This has been sitting for a while ... pizzarr is getting pretty good. My fork has RNetCDF swapped out for https://github.com/dblodgett-usgs/ncmeta https://github.com/DOI-USGS/rnz We still need to get pizzarr on cran, but I'm going to start testing zarr as a first class citizen here and see what we see. library(ncmeta)
http <- "https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr"
nc_meta(http)
#> $dimension
#> # A tibble: 3 × 4
#> id name length coord_dim
#> <dbl> <chr> <int> <lgl>
#> 1 0 Time 175296 TRUE
#> 2 1 south_north 205 FALSE
#> 3 2 west_east 180 FALSE
#>
#> $variable
#> # A tibble: 28 × 6
#> id name type ndims natts dim_coord
#> <dbl> <chr> <chr> <int> <int> <lgl>
#> 1 0 CFRACL <i2 3 8 FALSE
#> 2 1 CFRACT <i2 3 8 FALSE
#> 3 2 FGDP <f4 3 7 FALSE
#> 4 3 GLW <f4 3 7 FALSE
#> 5 4 GRDFLX <f4 3 7 FALSE
#> 6 5 GSW <f4 3 7 FALSE
#> 7 6 HFX <f4 3 7 FALSE
#> 8 7 HGT <f4 2 8 FALSE
#> 9 8 I_RAINNC <i4 3 7 FALSE
#> 10 9 LAI <f4 3 7 FALSE
#> # ℹ 18 more rows
#>
#> $attribute
#> # A tibble: 338 × 4
#> id name variable value
#> <dbl> <chr> <chr> <named list>
#> 1 0 FieldType CFRACL <int [1]>
#> 2 1 MemoryOrder CFRACL <chr [1]>
#> 3 2 coordinates CFRACL <chr [1]>
#> 4 3 description CFRACL <chr [1]>
#> 5 4 grid_mapping CFRACL <chr [1]>
#> 6 5 scale_factor CFRACL <dbl [1]>
#> 7 6 stagger CFRACL <chr [1]>
#> 8 7 units CFRACL <chr [1]>
#> 9 0 FieldType CFRACT <int [1]>
#> 10 1 MemoryOrder CFRACT <chr [1]>
#> # ℹ 328 more rows
#>
#> $extended
#> # A tibble: 3 × 3
#> dimension name time
#> <dbl> <chr> <list>
#> 1 0 Time <CFtime>
#> 2 1 south_north <lgl [1]>
#> 3 2 west_east <lgl [1]>
#>
#> $axis
#> # A tibble: 75 × 3
#> axis variable dimension
#> <int> <chr> <dbl>
#> 1 1 CFRACL 0
#> 2 2 CFRACL 1
#> 3 3 CFRACL 2
#> 4 4 CFRACT 0
#> 5 5 CFRACT 1
#> 6 6 CFRACT 2
#> 7 7 FGDP 0
#> 8 8 FGDP 1
#> 9 9 FGDP 2
#> 10 10 GLW 0
#> # ℹ 65 more rows
#>
#> $grid
#> # A tibble: 3 × 4
#> grid ndims variables nvars
#> <chr> <int> <list> <int>
#> 1 D0,D1,D2 3 <tibble [22 × 1]> 22
#> 2 D1,D2 2 <tibble [4 × 1]> 4
#> 3 D0 1 <tibble [1 × 1]> 1
#>
#> $source
#> # A tibble: 1 × 2
#> access source
#> <dttm> <chr>
#> 1 2024-11-11 22:16:32 https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr
#>
#> attr(,"class")
#> [1] "ncmeta" Created on 2024-11-11 with reprex v2.1.1 |
Sweet! Btw I've been making virtual zarrr that references netcdfs in object store: https://gist.github.com/mdsumner/c72ff510bf41c433662ef703a635daf8 (Also I met Brianna Pagán in person last week, which was awesome 👍) |
Also I found some bugs in GDAL Zarr and toyed with exposing that better in gdalraster, and played with zarrrs the Rust library, spurred on by Icechunk - ultimately I think we need xarray in Rust. But I agree with you R is an excellent place to write Zarr support from scratch. (Although I see you are going down the netcdf-lib for Zarr approach, I couldn't get S3 support built in- do you have a good build workflow that does that? Maybe I'm doing something wrong) |
I'm actually not using netcdf to access zarr. I'm using |
I see, I'm sorry got confused about the nc-like behaviour 🙏 |
The trick worked! My goal with |
I've been playing with the new
pizzarr
package (https://github.com/keller-mark/pizzarr) in a package calledrnz
(https://github.com/DOI-USGS/rnz) -- So far,rnz
implements the read side of theRNetCDF
functions.I am going to play with this in a fork but wanted to run the idea by you @mdsumner. If we were to create ncmeta functions that wrap open.nc, close.nc, file.inq.nc, and att.get.nc, we could call a zarr back end basically seamlessly with what I've worked up in
rnz
. When this is all up on CRAN, would you be interested in such a set up?The text was updated successfully, but these errors were encountered: