Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with a zarr back end. #52

Open
dblodgett-usgs opened this issue May 28, 2024 · 13 comments
Open

Compatibility with a zarr back end. #52

dblodgett-usgs opened this issue May 28, 2024 · 13 comments

Comments

@dblodgett-usgs
Copy link
Contributor

I've been playing with the new pizzarr package (https://github.com/keller-mark/pizzarr) in a package called rnz (https://github.com/DOI-USGS/rnz) -- So far, rnz implements the read side of the RNetCDF functions.

I am going to play with this in a fork but wanted to run the idea by you @mdsumner. If we were to create ncmeta functions that wrap open.nc, close.nc, file.inq.nc, and att.get.nc, we could call a zarr back end basically seamlessly with what I've worked up in rnz. When this is all up on CRAN, would you be interested in such a set up?

@mdsumner
Copy link
Member

For sure! I just learnt enough about Zarr to really get interested in it, I'll have questions... 😀

@mdsumner
Copy link
Member

mdsumner commented May 28, 2024

so one question, why not use/require NetCDF itself? I've been wondering about trying this and I finally had a look at my normal system:

nc-config --all | grep z
  --static        -> -lhdf5_hl -lhdf5 -lcrypto -lcurl -lpthread -lsz -lz -ldl -lm
  --has-szlib     -> yes
  --has-nczarr    -> yes

I expect none of the remote store stuff will be handled in NetCDF library like this (?) but wouldn't that be a better pathway for an RNetCDF package? (I expect you've explored this so will just dump my naive questions on you).

I'm actually very keen to learn Zarr at its core, I only just really clicked to how simple it is - and the fact that we have entirely-R implementation in pizarrr (where it's using standard packages to support remote access and compression details afaict), and a C++ one in GDAL makes it very very accessible to me, but I'm still massively confused about how to find Zarr sources (the pangeo forge example links I find are out of date or not accessible to me, etc).

@mdsumner
Copy link
Member

Just for example:

f <- system.file("extdata/bcsd.zarr/", package = "rnz", mustWork = TRUE)

nc <- RNetCDF::open.nc(sprintf("file://%s#mode=nczarr,file", f))
print.nc(nc)

(I only just figured this out, reading the docs: https://docs.unidata.ucar.edu/nug/current/nczarr_head.html)

@dblodgett-usgs
Copy link
Contributor Author

I experimented with the NetCDF-C and GDAL pathways before and came to the conclusion that it's worth having a base-R implementation.

Stuff like "file://%s#mode=nczarr,file" 🤢 and the extra layer of obfuscation for http library basics stand out, but just on principle, I want to make sure we don't fully rely on non-R logic for this kind of stuff.

I have also wanted to more fully grasp the fundamentals of zarr... rnz and some contributions to pizzarr are my way to get there. A side benefit has been the opportunity to dabble in R6! (https://github.com/keller-mark/pizzarr/pull/82/files is a recent PR I made to pizzarr)

Fot http zarrs, it's so frustrating. There are a few here: https://github.com/keller-mark/pizzarr/blob/main/tests/testthat/test-http-store.R and we have an open storage network pod going now (https://water.usgs.gov/catalog/usecases/8df9f64f-0f38-4849-9c6f-3d931fd2b2ba/) which will grow in its holdings in the next while.

Note that test zarrs can be hosted via plain http (rawgit etc.) for basic testing.

Anyways -- happy to work up a PR for a branch with some of these ideas in if you are interested, I think we are actually pretty close to zarr data "just working" with what I've got via pizzarr already. There are a lot of ""#TODO" lines in the pizzarr repo yet though, so probably not production ready for a while.

@dblodgett-usgs
Copy link
Contributor Author

f <- "https://raw.githubusercontent.com/DOI-USGS/rnz/main/inst/extdata/bcsd.zarr/"

nc <- RNetCDF::open.nc(sprintf("%s#mode=nczarr,s3", f)) # doesn't work!

rnz::zdump(f) # works!

@mdsumner
Copy link
Member

mdsumner commented May 29, 2024

Excellent, totally appreciate your reply and all these details 👍

Fwiw I don't have strong ideas about "should", but it feels weird to extend ncmeta to an R implementation of a store that's "like netcdf", but I must admit my ideas about where things "belong" has changed radically over the years and I'm still coming to terms with how python has changed the landscape 🙏

@mdsumner
Copy link
Member

Also, I didn't really understand the prospect for an R Zarr until yesterday when I finally really saw how it's structured, and I absolutely love it. (I wish we could apply smart geotransforms haha but let's see how it goes)

@dblodgett-usgs
Copy link
Contributor Author

This has been sitting for a while ... pizzarr is getting pretty good. My fork has RNetCDF swapped out for rnz so we can access NetCDF or ZARR using ncmeta.

https://github.com/dblodgett-usgs/ncmeta https://github.com/DOI-USGS/rnz

We still need to get pizzarr on cran, but I'm going to start testing zarr as a first class citizen here and see what we see.

library(ncmeta)

http <- "https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr"

nc_meta(http)
#> $dimension
#> # A tibble: 3 × 4
#>      id name        length coord_dim
#>   <dbl> <chr>        <int> <lgl>    
#> 1     0 Time        175296 TRUE     
#> 2     1 south_north    205 FALSE    
#> 3     2 west_east      180 FALSE    
#> 
#> $variable
#> # A tibble: 28 × 6
#>       id name     type  ndims natts dim_coord
#>    <dbl> <chr>    <chr> <int> <int> <lgl>    
#>  1     0 CFRACL   <i2       3     8 FALSE    
#>  2     1 CFRACT   <i2       3     8 FALSE    
#>  3     2 FGDP     <f4       3     7 FALSE    
#>  4     3 GLW      <f4       3     7 FALSE    
#>  5     4 GRDFLX   <f4       3     7 FALSE    
#>  6     5 GSW      <f4       3     7 FALSE    
#>  7     6 HFX      <f4       3     7 FALSE    
#>  8     7 HGT      <f4       2     8 FALSE    
#>  9     8 I_RAINNC <i4       3     7 FALSE    
#> 10     9 LAI      <f4       3     7 FALSE    
#> # ℹ 18 more rows
#> 
#> $attribute
#> # A tibble: 338 × 4
#>       id name         variable value       
#>    <dbl> <chr>        <chr>    <named list>
#>  1     0 FieldType    CFRACL   <int [1]>   
#>  2     1 MemoryOrder  CFRACL   <chr [1]>   
#>  3     2 coordinates  CFRACL   <chr [1]>   
#>  4     3 description  CFRACL   <chr [1]>   
#>  5     4 grid_mapping CFRACL   <chr [1]>   
#>  6     5 scale_factor CFRACL   <dbl [1]>   
#>  7     6 stagger      CFRACL   <chr [1]>   
#>  8     7 units        CFRACL   <chr [1]>   
#>  9     0 FieldType    CFRACT   <int [1]>   
#> 10     1 MemoryOrder  CFRACT   <chr [1]>   
#> # ℹ 328 more rows
#> 
#> $extended
#> # A tibble: 3 × 3
#>   dimension name        time     
#>       <dbl> <chr>       <list>   
#> 1         0 Time        <CFtime> 
#> 2         1 south_north <lgl [1]>
#> 3         2 west_east   <lgl [1]>
#> 
#> $axis
#> # A tibble: 75 × 3
#>     axis variable dimension
#>    <int> <chr>        <dbl>
#>  1     1 CFRACL           0
#>  2     2 CFRACL           1
#>  3     3 CFRACL           2
#>  4     4 CFRACT           0
#>  5     5 CFRACT           1
#>  6     6 CFRACT           2
#>  7     7 FGDP             0
#>  8     8 FGDP             1
#>  9     9 FGDP             2
#> 10    10 GLW              0
#> # ℹ 65 more rows
#> 
#> $grid
#> # A tibble: 3 × 4
#>   grid     ndims variables         nvars
#>   <chr>    <int> <list>            <int>
#> 1 D0,D1,D2     3 <tibble [22 × 1]>    22
#> 2 D1,D2        2 <tibble [4 × 1]>      4
#> 3 D0           1 <tibble [1 × 1]>      1
#> 
#> $source
#> # A tibble: 1 × 2
#>   access              source                                                  
#>   <dttm>              <chr>                                                   
#> 1 2024-11-11 22:16:32 https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr
#> 
#> attr(,"class")
#> [1] "ncmeta"

Created on 2024-11-11 with reprex v2.1.1

@mdsumner
Copy link
Member

Sweet! Btw I've been making virtual zarrr that references netcdfs in object store:

https://gist.github.com/mdsumner/c72ff510bf41c433662ef703a635daf8

(Also I met Brianna Pagán in person last week, which was awesome 👍)

@mdsumner
Copy link
Member

mdsumner commented Nov 12, 2024

Also I found some bugs in GDAL Zarr and toyed with exposing that better in gdalraster, and played with zarrrs the Rust library, spurred on by Icechunk - ultimately I think we need xarray in Rust. But I agree with you R is an excellent place to write Zarr support from scratch. (Although I see you are going down the netcdf-lib for Zarr approach, I couldn't get S3 support built in- do you have a good build workflow that does that? Maybe I'm doing something wrong)

@dblodgett-usgs
Copy link
Contributor Author

I'm actually not using netcdf to access zarr. I'm using pizzarr as the back end for zarr which is a base R ZARR client without any compiled dependencies. rnz is a very thin wrapper around RNetCDF and pizzarr with S3 methods for NetCDF and ZarrGroup classes. See https://github.com/DOI-USGS/rnz/blob/main/R/get_var.R for one of the core functions in rnz.

@mdsumner
Copy link
Member

I see, I'm sorry got confused about the nc-like behaviour 🙏

@dblodgett-usgs
Copy link
Contributor Author

The trick worked!

My goal with rnz is to allow us to just swap out rnz for RNetCDF and have things "just work".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants