Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: geoarrow export #30

Merged
merged 15 commits into from
Oct 29, 2024

Conversation

jorisvandenbossche
Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche commented Oct 9, 2024

Similar to #28, just opening to be able to add some comments / discuss the design.

For now it just can process points, without any options (but the points do work, as being tested with benbovy/spherely#53

@jorisvandenbossche jorisvandenbossche marked this pull request as draft October 9, 2024 09:04
Comment on lines 909 to 911
case OutputType::kPoints:
impl_->Init(GEOARROW_TYPE_INTERLEAVED_POINT, options, out_schema);
break;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added kPoints (mapping to interleaved points) for a quick test. I assume we might want to add some generic "native geoarrow" option like kGeoArrow(Interleaved), and then infer the schema from the geographies? (with something similar like GeoArrowGEOSSchemaCalculator you did in geoarrow-c-geos)

It might still be worth allowing the user to specify the type themselves to avoid inference / choose a specific type (e.g. always the multi-version regardless of presence of any multi geom)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geoarrow-c-geos also has a function to help you make a schema:

https://github.com/geoarrow/geoarrow-c-geos/blob/33ad0ba21c76c09e9d72fc4e4ae0b9ff9da61848/src/geoarrow_geos/geoarrow_geos.h#L122-L123

The right entry point I think is still ArrowSchema* (until a point where geoarrow-c exposes an ABI for specifying geometry type constraints, which is I think a ways off). It's definitely good to let the user specify this (often they already know because they're calculating a centroid or something).

The calculator is a definite must! It can probably use geography->num_shape() and geography->dimension():

virtual int dimension() const {
if (num_shapes() == 0) {
return -1;
}
int dim = Shape(0)->dimension();
for (int i = 1; i < num_shapes(); i++) {
if (dim != Shape(i)->dimension()) {
return -1;
}
}
return dim;
}
// The number of S2Shape objects needed to represent this Geography
virtual int num_shapes() const = 0;

Comment on lines 888 to 889
void Writer::Init(OutputType output_type, const ImportOptions& options,
struct ArrowSchema* out_schema) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of initializing the writer with just a type, I added an ArrowSchema as output argument, because for exporting the array to another producer, we need both the array and schema.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this above, too, but I think you want a const ArrowSchema* as your entry point into the Writer (you can expose a function to generate the appropriate ArrowSchema* so that those APIs are decoupled).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, if we have a separate function to construct this schema from a type / encoding (or those inferred from an array of gemetries), then it is perfectly fine to only have the Init here starting from a schema

@jorisvandenbossche
Copy link
Collaborator Author

Pushed a small update:

  • Changed the interface to have a scalar WriteGeography(const Geography& geog) and a separate Finish(struct ArrowArray* out), as discussed
  • Did not yet update the Init() to just take a schema (and provide some helpers to create / infer the schema)
  • Added support for Polylines, still have to add Polygon and Collection

Comment on lines 868 to 873
for (int i = 0; i < poly->num_vertices(); i++) {
S2LatLng ll(poly->vertex(i));
coords[0] = ll.lng().degrees();
coords[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for geometries with >1 coordinates (here for the example of a linestring), I loop over the s2points and call the coords() callback multiple times.

The alternative is to build up a single GeoArrowCoordView with the n_coords set to something larger than 1, and then only do a single call to coords().

Now, given that we don't start from a contiguous array of coords, I suppose it will (performance wise) not matter that much if we would do the conversion to a single array ourselves and do a single coords() call, or or call coords() multiple times and let the underlying writer do the conversion?
Especially since we don't know for sure if the output actually needs single buffers (like for WKT/WKB it would iterate over the coords pairs anyway then, when passing a single CoordView).
Unless the typicaly coords() visitor implementation might have a bunch of overhead per call? In that case it might be beneficial to limit the number of calls to it (I suppose I should just test the difference .. ;))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to limit the calls to the coords() callback if it's easy to do it (but not necessary for this PR as long as it is correct!). On other end, we also have to make a copy of the coordinates we send via the callback (i.e., to populate the requisite double[]), so you don't want to send all of them at once. One other slight wrench in that is that some visitor implementations can "return early" (e.g., the WKT formatter, which has a character limit after which it returns EAGAIN to force the reader to stop). As a data point, geoarrow-c's WKB reader buffers 3072 ordinates (had to be divisible by 2, 3, and 4) at a time:

https://github.com/geoarrow/geoarrow-c/blob/2890982956a9f715339abb5e24059300bb147107/src/geoarrow/wkb_reader.c#L103-L111

Adding the ability to transmit more than one coordinate on this callback was one of the things I changed between wk's design (early in my career writing any C code whatsoever) and this visitor. I also never measured the difference comprehensively though...in theory this is in such a tight loop that the function pointer call does matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I think I will leave it at a coords() call per point for this PR, and we can optimize that later

Copy link
Owner

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that some basic tests are going to be needed at some point. At least for this PR, I think you could replace the WKTWriter and add a few cases to the "wkt roundtrip" tests. The nice part about the visitor pattern is that you can basically just check WKT output and in theory everything else "just works" (or is a problem with code that isn't yours 🙂 ).

@jorisvandenbossche
Copy link
Collaborator Author

Just a note that some basic tests are going to be needed at some point.

Yes I know ;) For now I have been testing locally directly with wiring it up in spherely and so tests in python, and this confirms the basics are working. But definitely will have to add tests in s2geography as well!

Comment on lines 8610 to 8612
writer->precision = 16;
private->precision = 16;
writer->use_flat_multipoint = 1;
private->use_flat_multipoint = 1;
// TODO should be changed upstream to make this configurable
writer->precision = 6;
private->precision = 6;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot what would you recommend here?

Pass that as an extra param to GeoArrowArrayWriterInitFromSchema, which can then pass it down here?

Or a setter after the object has been initialized? (I suppose that is more backwards compatible)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Hang tight, I'll fix that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is merged into s2geography (or should be!)

#36

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche
Copy link
Collaborator Author

jorisvandenbossche commented Oct 25, 2024

OK, I deleted most of the WKT writer implementation (which I essentially had moved to the geoarrow writer already, updating the "handler" pattern for "visitor" code), and so that this now reuses the geoarrow based writer, like is done for the WKTReader as well.

And expanded the WKT roundtrip tests which thus indirectly test the GeoArrow::Writer

Comment on lines +869 to +872
S2LatLng ll(loop->vertex(loop->num_vertices() - 1));
coords_[0] = ll.lng().degrees();
coords_[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot In the case of a hole, the code here (which I essentially took from the wkt writer) explicitly repeats the first coordinate pair (i.e. in the last vertex because we are iterating reversed, so we started with the last).

But for VisitLoopShell above, the code does not do that (it just iterates over the vertices, and s2 doesn't duplicate the first/last).

In practice, though (at least in case of WKT), it seems both work fine, and so I assume that the visitor implemented by GeoArrowArrayWriter (or for WKT the GeoArrowWKTWriter used under the hood) checks if the first coordinate was repeated, and if not repeats it themselves? But is that meant to be a guarantee? Or should ideally the one calling the visitor ensure they are providing closed ring coordinates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, nevermind, I just noticed that in the case for the shell, we are looping until i <= loop->num_vertices() and not until num_vertices() - 1. So the loop.vertex(i) call wraps around, and so we are actually explicitly adding the first coordinate as well
(that trick just not works when iterating in reverse)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember fiddling a lot with that, but I think it exploits the special behaviour of vertex():

https://github.com/google/s2geometry/blob/ca1f3416f1b9907b6806f1be228842c9518d96df/src/s2/s2loop.h#L185-L194

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's the explanation I also had found

Comment on lines +869 to +872
S2LatLng ll(loop->vertex(loop->num_vertices() - 1));
coords_[0] = ll.lng().degrees();
coords_[1] = ll.lat().degrees();
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember fiddling a lot with that, but I think it exploits the special behaviour of vertex():

https://github.com/google/s2geometry/blob/ca1f3416f1b9907b6806f1be228842c9518d96df/src/s2/s2loop.h#L185-L194

GEOARROW_DIMENSIONS_XY));

for (const auto& child_geog : geog.Features()) {
auto child_point = dynamic_cast<const PointGeography*>(child_geog.get());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dynamic casts are necessary given the current design of the Geography but hopefully I'll find a way to clean that up (maybe making an internal but virtual Visit(void* visitor) method on the Geography).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spherely, we recently added the type as essentially an attribute on the spherely::Geography C++ class (i.e. which otherwise wraps the s2geography::Geography class): https://github.com/benbovy/spherely/blob/4e39dee1bf46c9591c68bd2b64541ff7c2f8b14f/src/geography.cpp#L116-L162, so whenever a Geography object gets constructed we also store the type (https://github.com/benbovy/spherely/blob/4e39dee1bf46c9591c68bd2b64541ff7c2f8b14f/src/geography.hpp#L55-L59). Not sure if something similar could/should be moved upstream at some point.

(the clone method a bit higher up in the file is certainly a candidate of moving into s2geography)


void Init(OutputType output_type, const ExportOptions& options);

void WriteGeography(const Geography& geog);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need a WriteNull()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need a WriteNull()?

That's a good point (for spherely actually not yet, because we don't yet support missing values, but we should support it here in general).
Add a separate method for it, or allow the geog to be a nullptr? (and check that inside WriteGeography?)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a separate method!

Comment on lines 8610 to 8612
writer->precision = 16;
private->precision = 16;
writer->use_flat_multipoint = 1;
private->use_flat_multipoint = 1;
// TODO should be changed upstream to make this configurable
writer->precision = 6;
private->precision = 6;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review October 29, 2024 17:28
Comment on lines 14 to +18
WKTWriter writer_default;
EXPECT_EQ(writer_default.write_feature(*geog), "POINT (0 3.333333333333334)");
EXPECT_EQ(writer_default.write_feature(*geog), "POINT (0 3.3333333333333344)");

WKTWriter writer_6digits(6);
EXPECT_EQ(writer_6digits.write_feature(*geog), "POINT (0 3.33333)");

EXPECT_EQ(writer_6digits.write_feature(*geog), "POINT (0 3.333333)");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the main difference I see compared to the previous WKTWriter implementation (maybe because of using ryu?), so that the significant digits are interpreted differently.

(actually, the significant_digits name is then not really correct any more, and should we use precision here as well like is used in geoarrow-c?)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, precision is a better name. I think the previous version was using the sprintf() (or maybe the ostream)-based implementation, which uses slightly different words for all of that.

Copy link
Owner

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

There are a few follow-ups (null writing, maybe more efficient coordinate things, tessellation, more comprehensive testing), but I think those are more effectively handled one at a time based on this work.

@jorisvandenbossche
Copy link
Collaborator Author

There are a few follow-ups (null writing, maybe more efficient coordinate things, tessellation, more comprehensive testing), but I think those are more effectively handled one at a time based on this work.

Yep, indeed, and for the tessellation I already have a WIP branch

Will just rename the keyword to precision

@jorisvandenbossche
Copy link
Collaborator Author

OK, all green!

@paleolimbot paleolimbot merged commit 79dce6e into paleolimbot:main Oct 29, 2024
4 checks passed
@paleolimbot
Copy link
Owner

(I think you have write access too? Feel free to merge things!)

@jorisvandenbossche jorisvandenbossche deleted the geoarrow-writer branch October 29, 2024 18:34
@jorisvandenbossche
Copy link
Collaborator Author

I think you have write access too?

I think I forgot to accept your invite and it expired .. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants