Could MAAP user workflows benefit from a geoparquet version of the entire ATL08 archive? #971

sharkinsspatial · 2024-04-05T18:10:31Z

sharkinsspatial
Apr 5, 2024
Collaborator

Hey everyone 👋 I’m Sean Harkins from Development Seed. I was with @betolink several months ago at a conference in Mexico and we had some discussions over dinner about cloud optimization options for beam based return data like ATL08. I have some background in generating large cloud optimized data stores for gridded data from my work on pangeo-forge but I’m totally unfamiliar with how users and scientists interact with beam based return datasets.

After our conversation, I was curious if it would be possible to efficiently convert the land_segments groups for the entire ATL08 archive into a geoparquet dataset in a streaming/chunkwise memory efficient way. With a geoparquet dataset for the entire archive, end users could interact directly with subsets of the data with a combination of pyarrow and dataframe selection logic so that they only need to load data of interest.

I threw together a rough example of how this might work https://github.com/developmentseed/icesat-parquet/blob/main/atl08_earthaccess.ipynb. But given my limited experience with the ATL08 data and its applications I’m not sure how users currently interact with these land_segments beam group data or if having this optimized version of the data would be worthwhile for folks. A few questions

Are the majority of users working with ATL08 data currently utilizing Pandas or Geopandas dataframe operations in their notebook or analysis workflows?
How are users selecting or subsetting data? In this initial example, I am partitioning the parquet dataset using a year month week based hive approach which allows efficient subsetting for date ranges. We could also consider repartitioning the parquet dataset via some type of spatial partitioning strategy but it would be helpful to know more about users common data selection patterns.
I’ve made some naive guesses about how users might want to query and select data (as an example, I’ve included a column with the beam strength “strong”, “weak” or “degraded” based on the orbit_info data). It would be great to hear if there are additional values from outside of the land_segments group variables that should also be included for analysis purposes.
If we did have this geoparquet representation of the entire ATL08 archive, how would users interact with it and would it be helpful for common workflows that users have today?

@wildintellect mentioned that some MAAP users are using Sliderule’s geoparquet export options https://github.com/ICESAT-2HackWeek/h5cloud/blob/main/notebooks/sliderule2geoparquet.ipynb when working with ATL08, but I know there some restrictions around the volume of data that can be downloaded/subsetted via the service. Would be great to hear from folks here if it might be useful to pre-process the entire archive into an optimized geoparquet structure rather than generating data subsets on-demand.

Cheers.

pahbs · 2024-04-08T15:23:36Z

pahbs
Apr 8, 2024

For context, we used SlideRule as part of a 2-step processing sequence to process ATL08 seasonally from 2019-2023 for the circumpolar boreal forest domain.

we applied some basic (spatial, seasonal, and some quality) filtering to the ATL03 archive and returned our custom flavor of ATL08 (we chose the length of geosegments) as geodataframes saved as geoparquet files. We did this using the private server at U. Texas. After a number of bug fixes related to multi-processing, possible data volume issues, and server related issues, this processing (initiated in a notebook within a while loop and featuring multiprocessing ). Returned the specific flavor of intermediate data we wanted as geoparquet files.
we then read applied project-specific custom quality filters to each year's set of circumpolar boreal ATL08 geoparquet . This process was handled with DPS , was very robust, and finished with minimal clean-up cost. One of our smoother DPS batch runs - which we did 5 times (once for each year).

Here is our custom filter:
https://github.com/lauraduncanson/icesat2_boreal/blob/8b1ba01da6521e16d5df7cc786b115246b3b13d4/lib/FilterUtils.py#L591

Because of our need for custom ATL08 derived from our own aggregation of ATL03 a static prepared version of ATL08 would have been useful for us. My guess is that most would not want this - and opt rather to process the data themselves with details and idiosyncracies of their study areas in mind.

0 replies

weiji14 · 2024-04-10T23:07:40Z

weiji14
Apr 10, 2024

Thanks @sharkinsspatial for writing this up! I just want to clarify that there are two different things to consider here:

1. The on-disk file format

This is what the data providers usually care about. For 3D point clouds, some options are:

Cloud-optimized point-cloud (COPC) - Cloud-optimized version of LAS/LAZ in a single file
Entwine point tile (EPT) - Multiple JSON files used to index LAS/LAZ files
GeoParquet - Columnar format, single file. No spatial index yet as per v1.0 spec, but spatial indexing will come in v1.1, see Add an spatial index in geoparquet format as optional opengeospatial/geoparquet#13

My perception is that EPT is a little more 'append-friendly' since you could write multiple small LAZ files with sidecar JSON file indexes, whereas COPC and GeoParquet are less append-friendly since you would typically need to rewrite the whole file and/or rebuild the index. The EPT vs COPC debate has parallels to kerchunk vs Zarr in the raster world.

2. The in-memory array format

This is what the end-users/scientists will care about and be interacting with in their analysis workflows. Options are:

NumPy arrays - E.g. you can use laspy to read LAS/LAZ/COPC xyz data directly into numpy
GeoArrow/Geopandas tables/dataframes - What you would get when reading from GeoParquet, though you could also build up a DataFrame from numpy arrays
Xarray - Technically you could use something like xarray-datatree to read the original ATL08 HDF5 and keep the nested structure (example here for ATL06), though you'd need a good understanding of xarray's data model.

Notes

On 3D spatial subsetting

I would say that the 3D spatial filtering capability of GeoParquet is at a very early stage, compared to what is supported in more traditional LiDAR formats like LAS/LAZ (or their cloud-optimized equivalents). As far as I'm aware, GeoArrow only supports predicate pushdown on 2D bounding boxes (geoarrow/geoarrow#20), though @kylebarron could confirm on this if 3D predicate pushdown filters work out of the box already in GeoArrow/GeoParquet. I.e. filtering on a 3D cube (minx, miny, minz, maxx, maxy, maxz).

Attribute-based filters

Filtering on beam strength or signal quality might be important, and I'd say this is easier done with tabular in-memory formats like DataFrames. But I'd check with ATL08 users on what their usage patterns are on this.

2 replies

kylebarron Apr 11, 2024

As far as I'm aware, GeoArrow only supports predicate pushdown on 2D bounding boxes (geoarrow/geoarrow#20), though @kylebarron could confirm on this if 3D predicate pushdown filters work out of the box already in GeoArrow/GeoParquet. I.e. filtering on a 3D cube (minx, miny, minz, maxx, maxy, maxz).

In GeoParquet, the spec defines XY bounding boxes but there's no hurdle to generalizing that to 3D. Support would just need to be implemented.

sharkinsspatial Apr 12, 2024
Collaborator Author

@weiji14 A few comments here.

With regards to 1. The on-disk format.

Data providers

It is clear from our discussions with various DAACs and science teams that there is limited appetite for investigating new formats for data production and distribution and that we will likely be working against netCDF data for the foreseeable future. The rationale for this investigation was based on trying to find the optimal approach for us to transform data which is difficult for users to work with in its native distribution format/structure.

For Icesat-2 an entire suite of tools and server side infrastructure has been built to on-demand, generate transformed versions of the native data which are usable for analysis, the main ones being sliderule , nsidc-subsetter, icepyx. These tools/services are very well executed and satisfy a critical need for users but have a few issues

Limitations on the volume of data that can be transformed.
The need to maintain server-side infrastructure to generate on-demand transformed data.
Storage/processing duplication (on demand transformation requests by different users frequently overlap).
The need for analysis users to work with a custom product specific client in order to obtain data.

So my rationale here was less about influencing the data provider's native file distribution and more about if we can transform all of the ATL08 data land_segment data into a usable/queryable structure as it is produced rather than relying on on-demand subsetting/transformation.

Geoparquet

With regards to the append-friendly nature of Geoparquet, it is true we are restricted by the fact that we'd need to rewrite an entire file to append data but

In the case of a centralized system, this would be less problematic since we'd be continually appending all newly available data as new parquet files in the dataset as it becomes available.
If you have a look at the code we are actually incrementally appending in a streaming way via write_table, which is the only way to generate these files in memory efficient fashion.

This Parquet append/rewrite limitation that you've raised would be more problematic in cases where we retroactively need to update a partition/file with updated data. My hope is that we can choose reasonable partition / average file sizes which make this a less daunting process.

With regards to 2. The in-memory array format

The choice for an in-memory representation of return data depends primarily on the analysis use case. It seems like the bulk of users are working on analysis uses cases that involve x/y/time dimensions. Given it's extremely fine temporal resolution and limited collection, Icesat-2 data is extremely sparse when represented in an n-dimensional array. This can lead to a host of performance and scalability issues (some of which you and I have struggled with before 😄 ). In many ways, GeoArrow/Geopandas tables/dataframes seem like a very efficient representation of this data for many use cases, but I'd love to hear more from users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAAP-Project

Could MAAP user workflows benefit from a geoparquet version of the entire ATL08 archive? #971

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

MAAP-Project

Could MAAP user workflows benefit from a geoparquet version of the entire ATL08 archive? #971

sharkinsspatial Apr 5, 2024 Collaborator

Replies: 2 comments · 2 replies

pahbs Apr 8, 2024

weiji14 Apr 10, 2024

1. The on-disk file format

2. The in-memory array format

Notes

On 3D spatial subsetting

Attribute-based filters

kylebarron Apr 11, 2024

sharkinsspatial Apr 12, 2024 Collaborator Author

With regards to 1. The on-disk format.

Data providers

Geoparquet

With regards to 2. The in-memory array format

sharkinsspatial
Apr 5, 2024
Collaborator

Replies: 2 comments 2 replies

pahbs
Apr 8, 2024

weiji14
Apr 10, 2024

sharkinsspatial Apr 12, 2024
Collaborator Author