Skip to content

Commit

Permalink
Add GeoDataset interface (#43)
Browse files Browse the repository at this point in the history
* some initial failing tests

* actually working

* use row groups

* error

* subclass for parquet rowgroup dataset

* keep row group ids

* move indexing methods into the GeoDataset class

* factor out dataset constructor

* prepare for row group metadata

* scaffold the row group metadata scan

* theoretical row group stats working

* handle parquet files with no stats

* basic docs

* reorganize

* maybe docs

* maybe fix docstrings

* fix a type-o (har)

* test + fix parquet field counting

* format, remove unneeded change

* document + test filtering on multiple geometry columns

* more explicit name + type tests

* fix type test

* theoretically pass index through filter

* test wrapping

* fix filtering to empty

* fix docs
  • Loading branch information
paleolimbot authored Aug 22, 2023
1 parent 800df92 commit 91d2ab9
Show file tree
Hide file tree
Showing 4 changed files with 719 additions and 0 deletions.
11 changes: 11 additions & 0 deletions docs/source/python/pyarrow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ Integration with pyarrow

.. autofunction:: array

Dataset constructors
--------------------

.. autofunction:: dataset

Type Constructors
-----------------

Expand Down Expand Up @@ -94,3 +99,9 @@ Integration with pyarrow

.. autoclass:: MultiPolygonType
:members:

.. autoclass:: geoarrow.pyarrow._dataset.GeoDataset
:members:

.. autoclass:: geoarrow.pyarrow._dataset.ParquetRowGroupGeoDataset
:members:
31 changes: 31 additions & 0 deletions python/geoarrow/pyarrow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,35 @@
point_coords,
)


# Use a lazy import here to avoid requiring pyarrow.dataset
def dataset(*args, geometry_columns=None, use_row_groups=None, **kwargs):
"""Construct a GeoDataset
This constructor is intended to mirror `pyarrow.dataset()`, adding
geo-specific arguments. See :class:`geoarrow.pyarrow._dataset.GeoDataset` for
details.
>>> import geoarrow.pyarrow as ga
>>> import pyarrow as pa
>>> table = pa.table([ga.array(["POINT (0.5 1.5)"])], ["geometry"])
>>> dataset = ga.dataset(table)
"""
from pyarrow import dataset as _ds
from ._dataset import GeoDataset, ParquetRowGroupGeoDataset

parent = _ds.dataset(*args, **kwargs)

if use_row_groups is None:
use_row_groups = isinstance(parent, _ds.FileSystemDataset) and isinstance(
parent.format, _ds.ParquetFileFormat
)
if use_row_groups:
return ParquetRowGroupGeoDataset.create(
parent, geometry_columns=geometry_columns
)
else:
return GeoDataset(parent, geometry_columns=geometry_columns)


register_extension_types()
Loading

0 comments on commit 91d2ab9

Please sign in to comment.