Could MAAP user workflows benefit from a geoparquet version of the entire ATL08 archive? #971
Replies: 2 comments 2 replies
-
For context, we used
Here is our custom filter: Because of our need for custom |
Beta Was this translation helpful? Give feedback.
-
Thanks @sharkinsspatial for writing this up! I just want to clarify that there are two different things to consider here: 1. The on-disk file formatThis is what the data providers usually care about. For 3D point clouds, some options are:
My perception is that EPT is a little more 'append-friendly' since you could write multiple small LAZ files with sidecar JSON file indexes, whereas COPC and GeoParquet are less append-friendly since you would typically need to rewrite the whole file and/or rebuild the index. The EPT vs COPC debate has parallels to kerchunk vs Zarr in the raster world. 2. The in-memory array formatThis is what the end-users/scientists will care about and be interacting with in their analysis workflows. Options are:
NotesOn 3D spatial subsettingI would say that the 3D spatial filtering capability of GeoParquet is at a very early stage, compared to what is supported in more traditional LiDAR formats like LAS/LAZ (or their cloud-optimized equivalents). As far as I'm aware, GeoArrow only supports predicate pushdown on 2D bounding boxes (geoarrow/geoarrow#20), though @kylebarron could confirm on this if 3D predicate pushdown filters work out of the box already in GeoArrow/GeoParquet. I.e. filtering on a 3D cube (minx, miny, minz, maxx, maxy, maxz). Attribute-based filtersFiltering on beam strength or signal quality might be important, and I'd say this is easier done with tabular in-memory formats like DataFrames. But I'd check with ATL08 users on what their usage patterns are on this. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone 👋 I’m Sean Harkins from Development Seed. I was with @betolink several months ago at a conference in Mexico and we had some discussions over dinner about cloud optimization options for beam based return data like ATL08. I have some background in generating large cloud optimized data stores for gridded data from my work on
pangeo-forge
but I’m totally unfamiliar with how users and scientists interact with beam based return datasets.After our conversation, I was curious if it would be possible to efficiently convert the
land_segments
groups for the entire ATL08 archive into ageoparquet
dataset in a streaming/chunkwise memory efficient way. With ageoparquet
dataset for the entire archive, end users could interact directly with subsets of the data with a combination ofpyarrow
anddataframe
selection logic so that they only need to load data of interest.I threw together a rough example of how this might work https://github.com/developmentseed/icesat-parquet/blob/main/atl08_earthaccess.ipynb. But given my limited experience with the ATL08 data and its applications I’m not sure how users currently interact with these
land_segments
beam group data or if having this optimized version of the data would be worthwhile for folks. A few questionsorbit_info
data). It would be great to hear if there are additional values from outside of theland_segments
group variables that should also be included for analysis purposes.geoparquet
representation of the entire ATL08 archive, how would users interact with it and would it be helpful for common workflows that users have today?@wildintellect mentioned that some MAAP users are using Sliderule’s
geoparquet
export options https://github.com/ICESAT-2HackWeek/h5cloud/blob/main/notebooks/sliderule2geoparquet.ipynb when working with ATL08, but I know there some restrictions around the volume of data that can be downloaded/subsetted via the service. Would be great to hear from folks here if it might be useful to pre-process the entire archive into an optimizedgeoparquet
structure rather than generating data subsets on-demand.Cheers.
Beta Was this translation helpful? Give feedback.
All reactions