Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: geoarrow export #30

Merged
merged 15 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions src/s2geography/geoarrow.cc
Original file line number Diff line number Diff line change
Expand Up @@ -742,6 +742,232 @@ void Reader::ReadGeography(const ArrowArray* array, int64_t offset,
impl_->ReadGeography(array, offset, length, out);
}

class WriterImpl {
public:
WriterImpl() {
error_.message[0] = '\0';
writer_.private_data = nullptr;
}

~WriterImpl() {
if (writer_.private_data != nullptr) {
GeoArrowArrayWriterReset(&writer_);
}
}

void Init(const ArrowSchema* schema, const ImportOptions& options) {
options_ = options;

int code = GeoArrowArrayWriterInitFromSchema(&writer_, schema);
ThrowNotOk(code);

InitCommon();
}

void Init(GeoArrowType type, const ImportOptions& options,
struct ArrowSchema* out_schema) {
options_ = options;

int code = GeoArrowArrayWriterInitFromType(&writer_, type);
ThrowNotOk(code);
code = GeoArrowSchemaInitExtension(out_schema, type);
ThrowNotOk(code);

InitCommon();
}

void InitCommon() {
visitor_.error = &error_;
int code = GeoArrowArrayWriterInitVisitor(&writer_, &visitor_);
ThrowNotOk(code);
}

void WriteGeography(const Geography& geog) { VisitFeature(geog); }

void Finish(struct ArrowArray* out) {
int code = GeoArrowArrayWriterFinish(&writer_, out, &error_);
ThrowNotOk(code);
}

private:
ImportOptions options_;
GeoArrowArrayWriter writer_;
GeoArrowVisitor visitor_;
GeoArrowCoordView coords_view_;
GeoArrowError error_;

int VisitPoints(const PointGeography& point) {
coords_view_.n_coords = 1;
coords_view_.n_values = 2;
coords_view_.coords_stride = 2;
double coords[2];
coords_view_.values[0] = &coords[0];
coords_view_.values[1] = &coords[1];

if (point.Points().size() == 0) {
// empty Point
GEOARROW_RETURN_NOT_OK(visitor_.geom_start(
&visitor_, GEOARROW_GEOMETRY_TYPE_POINT, GEOARROW_DIMENSIONS_XY));
GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));

} else if (point.Points().size() == 1) {
// Point
GEOARROW_RETURN_NOT_OK(visitor_.geom_start(
&visitor_, GEOARROW_GEOMETRY_TYPE_POINT, GEOARROW_DIMENSIONS_XY));
S2LatLng ll(point.Points()[0]);
coords[0] = ll.lng().degrees();
coords[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));

} else {
// MultiPoint
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_MULTIPOINT,
GEOARROW_DIMENSIONS_XY));

for (const S2Point& pt : point.Points()) {
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_POINT,
GEOARROW_DIMENSIONS_XY));
S2LatLng ll(pt);
coords[0] = ll.lng().degrees();
coords[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));
}

GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));
}
return GEOARROW_OK;
}

int VisitPolylines(const PolylineGeography& geog) {
coords_view_.n_coords = 1;
coords_view_.n_values = 2;
coords_view_.coords_stride = 2;
double coords[2];
coords_view_.values[0] = &coords[0];
coords_view_.values[1] = &coords[1];

if (geog.Polylines().size() == 0) {
// empty LineString
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_LINESTRING,
GEOARROW_DIMENSIONS_XY));
GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));

} else if (geog.Polylines().size() == 1) {
// LineString
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_LINESTRING,
GEOARROW_DIMENSIONS_XY));

const auto& poly = geog.Polylines()[0];

for (int i = 0; i < poly->num_vertices(); i++) {
S2LatLng ll(poly->vertex(i));
coords[0] = ll.lng().degrees();
coords[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for geometries with >1 coordinates (here for the example of a linestring), I loop over the s2points and call the coords() callback multiple times.

The alternative is to build up a single GeoArrowCoordView with the n_coords set to something larger than 1, and then only do a single call to coords().

Now, given that we don't start from a contiguous array of coords, I suppose it will (performance wise) not matter that much if we would do the conversion to a single array ourselves and do a single coords() call, or or call coords() multiple times and let the underlying writer do the conversion?
Especially since we don't know for sure if the output actually needs single buffers (like for WKT/WKB it would iterate over the coords pairs anyway then, when passing a single CoordView).
Unless the typicaly coords() visitor implementation might have a bunch of overhead per call? In that case it might be beneficial to limit the number of calls to it (I suppose I should just test the difference .. ;))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to limit the calls to the coords() callback if it's easy to do it (but not necessary for this PR as long as it is correct!). On other end, we also have to make a copy of the coordinates we send via the callback (i.e., to populate the requisite double[]), so you don't want to send all of them at once. One other slight wrench in that is that some visitor implementations can "return early" (e.g., the WKT formatter, which has a character limit after which it returns EAGAIN to force the reader to stop). As a data point, geoarrow-c's WKB reader buffers 3072 ordinates (had to be divisible by 2, 3, and 4) at a time:

https://github.com/geoarrow/geoarrow-c/blob/2890982956a9f715339abb5e24059300bb147107/src/geoarrow/wkb_reader.c#L103-L111

Adding the ability to transmit more than one coordinate on this callback was one of the things I changed between wk's design (early in my career writing any C code whatsoever) and this visitor. I also never measured the difference comprehensively though...in theory this is in such a tight loop that the function pointer call does matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I think I will leave it at a coords() call per point for this PR, and we can optimize that later


GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));

} else {
// MultiLineString
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_MULTILINESTRING,
GEOARROW_DIMENSIONS_XY));

for (const auto& poly : geog.Polylines()) {
GEOARROW_RETURN_NOT_OK(
visitor_.geom_start(&visitor_, GEOARROW_GEOMETRY_TYPE_LINESTRING,
GEOARROW_DIMENSIONS_XY));

for (int i = 0; i < poly->num_vertices(); i++) {
S2LatLng ll(poly->vertex(i));
coords[0] = ll.lng().degrees();
coords[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
}

GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));
}

GEOARROW_RETURN_NOT_OK(visitor_.geom_end(&visitor_));
}

return GEOARROW_OK;
}

int VisitFeature(const Geography& geog) {
GEOARROW_RETURN_NOT_OK(visitor_.feat_start(&visitor_));

auto child_point = dynamic_cast<const PointGeography*>(&geog);
if (child_point != nullptr) {
GEOARROW_RETURN_NOT_OK(VisitPoints(*child_point));
} else {
auto child_polyline = dynamic_cast<const PolylineGeography*>(&geog);
if (child_polyline != nullptr) {
GEOARROW_RETURN_NOT_OK(VisitPolylines(*child_polyline));
} else {
throw Exception("Unsupported Geography subclass");
// auto child_polygon = dynamic_cast<const PolygonGeography*>(&geog);
// if (child_polygon != nullptr) {
// HANDLE_OR_RETURN(handle_polygon(*child_polygon, handler));
// } else {
// auto child_collection = dynamic_cast<const
// GeographyCollection*>(&geog); if (child_collection != nullptr) {
// HANDLE_OR_RETURN(handle_collection(*child_collection,
// handler));
// } else {
// throw Exception("Unsupported Geography subclass");
// }
// }
}
}
return GEOARROW_OK;
}

void ThrowNotOk(int code) {
if (code != GEOARROW_OK) {
throw Exception(error_.message);
}
}
};

Writer::Writer() : impl_(new WriterImpl()) {}

Writer::~Writer() { impl_.reset(); }

void Writer::Init(const ArrowSchema* schema, const ImportOptions& options) {
impl_->Init(schema, options);
}

void Writer::Init(OutputType output_type, const ImportOptions& options,
struct ArrowSchema* out_schema) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of initializing the writer with just a type, I added an ArrowSchema as output argument, because for exporting the array to another producer, we need both the array and schema.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this above, too, but I think you want a const ArrowSchema* as your entry point into the Writer (you can expose a function to generate the appropriate ArrowSchema* so that those APIs are decoupled).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, if we have a separate function to construct this schema from a type / encoding (or those inferred from an array of gemetries), then it is perfectly fine to only have the Init here starting from a schema

switch (output_type) {
case OutputType::kPoints:
impl_->Init(GEOARROW_TYPE_INTERLEAVED_POINT, options, out_schema);
break;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added kPoints (mapping to interleaved points) for a quick test. I assume we might want to add some generic "native geoarrow" option like kGeoArrow(Interleaved), and then infer the schema from the geographies? (with something similar like GeoArrowGEOSSchemaCalculator you did in geoarrow-c-geos)

It might still be worth allowing the user to specify the type themselves to avoid inference / choose a specific type (e.g. always the multi-version regardless of presence of any multi geom)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geoarrow-c-geos also has a function to help you make a schema:

https://github.com/geoarrow/geoarrow-c-geos/blob/33ad0ba21c76c09e9d72fc4e4ae0b9ff9da61848/src/geoarrow_geos/geoarrow_geos.h#L122-L123

The right entry point I think is still ArrowSchema* (until a point where geoarrow-c exposes an ABI for specifying geometry type constraints, which is I think a ways off). It's definitely good to let the user specify this (often they already know because they're calculating a centroid or something).

The calculator is a definite must! It can probably use geography->num_shape() and geography->dimension():

virtual int dimension() const {
if (num_shapes() == 0) {
return -1;
}
int dim = Shape(0)->dimension();
for (int i = 1; i < num_shapes(); i++) {
if (dim != Shape(i)->dimension()) {
return -1;
}
}
return dim;
}
// The number of S2Shape objects needed to represent this Geography
virtual int num_shapes() const = 0;

case OutputType::kWKT:
impl_->Init(GEOARROW_TYPE_WKT, options, out_schema);
break;
case OutputType::kWKB:
impl_->Init(GEOARROW_TYPE_WKB, options, out_schema);
break;
default:
throw Exception("Output type not supported");
}
}

void Writer::WriteGeography(const Geography& geog) {
impl_->WriteGeography(geog);
}

void Writer::Finish(struct ArrowArray* out) { impl_->Finish(out); }

} // namespace geoarrow

} // namespace s2geography
28 changes: 28 additions & 0 deletions src/s2geography/geoarrow.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,34 @@ class Reader {
std::unique_ptr<ReaderImpl> impl_;
};

class WriterImpl;

/// \brief Array writer for any GeoArrow extension array
///
/// This class is used to convert a vector of Geography objects into an ArrowArray
/// with geoarrow data (serialized or native).
class Writer {
public:
enum class OutputType { kPoints, kWKT, kWKB };
Writer();
~Writer();

void Init(const ArrowSchema* schema) { Init(schema, ImportOptions()); }

void Init(const ArrowSchema* schema, const ImportOptions& options);

void Init(OutputType output_type, struct ArrowSchema* out_schema) { Init(output_type, ImportOptions(), out_schema); }

void Init(OutputType output_type, const ImportOptions& options, struct ArrowSchema* out_schema);

void WriteGeography(const Geography& geog);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need a WriteNull()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need a WriteNull()?

That's a good point (for spherely actually not yet, because we don't yet support missing values, but we should support it here in general).
Add a separate method for it, or allow the geog to be a nullptr? (and check that inside WriteGeography?)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a separate method!


void Finish(struct ArrowArray* out);

private:
std::unique_ptr<WriterImpl> impl_;
};

} // namespace geoarrow

} // namespace s2geography
9 changes: 5 additions & 4 deletions src/vendored/geoarrow/geoarrow.c
Original file line number Diff line number Diff line change
Expand Up @@ -8607,10 +8607,11 @@ GeoArrowErrorCode GeoArrowWKTWriterInit(struct GeoArrowWKTWriter* writer) {
ArrowBitmapInit(&private->validity);
ArrowBufferInit(&private->offsets);
ArrowBufferInit(&private->values);
writer->precision = 16;
private->precision = 16;
writer->use_flat_multipoint = 1;
private->use_flat_multipoint = 1;
// TODO should be changed upstream to make this configurable
writer->precision = 6;
private->precision = 6;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot what would you recommend here?

Pass that as an extra param to GeoArrowArrayWriterInitFromSchema, which can then pass it down here?

Or a setter after the object has been initialized? (I suppose that is more backwards compatible)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Hang tight, I'll fix that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is merged into s2geography (or should be!)

#36

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

writer->use_flat_multipoint = 0;
private->use_flat_multipoint = 0;
writer->max_element_size_bytes = -1;
private->max_element_size_bytes = -1;
writer->private_data = private;
Expand Down
Loading