-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index: add intersection/union support #210
Index: add intersection/union support #210
Conversation
@hobu I would love to get this in before the next release but it's unclear to me how to compute the intersection of two Items. I could use some giant if-statement so that the logic is completely different for |
I'm afraid I don't have any good suggestions. |
192ffd8
to
08c8ba8
Compare
Added a basic implementation that I think should work. Note that this code should really be in libspatialindex. In the worse case scenario, 2 indices with N elements each and M dimensions will be O(MN^2 + N^2 + N) to compute the intersection of. This could be very slow in pure-Python. |
It would be fine to add to libspatialindex too. Make sure to add C API bits to touch it if you do add a mergeTree method. |
Unfortunately I barely know C... |
If you do the C++, I can do the C bits. |
Unfortunately I don't know any C++ 😆 |
I'm confused. Do you really want mergeTrees or do you want parallel query in two trees? |
So my particular use case is for TorchGeo, but I think it should be useful for other projects as well. TorchGeo is a PyTorch library for geospatial data, especially satellite imagery. We have a Many applications involve combining two or more datasets. For example, you might want to combine all Landsat 7 and Landsat 8 imagery into one giant dataset. In TorchGeo, we call this a In another application, you might want to sample from both datasets simultaneously, meaning that you can only sample from regions where data exists in both datasets. For example, you might want to combine images from Landsat satellites and labels from the Cropland Data Layer. This requires you to know the intersection of these two datasets. In TorchGeo, we call this an See https://github.com/microsoft/torchgeo#geospatial-datasets for a more complex example that involves an intersection of a union: So basically, we need two things: the ability to add all elements of |
For both libspatialindex and Rtree, I have often made the point that they are not databases, and if you are using them as databases, you are not using them as intended (and you get to suffer all of the impedance associated with that disconnect). Your use case sounds like a few simple tables and PostGIS, to be honest. What's the value you're getting from Rtree/libspatialindex? The parallel query requirement can probably be implemented if you were to use threads in Python. The query code should be thread safe (and we should fix any deadlocks if not). The mergeTrees requirement would need a C++ adaptation of the BulkSort code that exists in libspatialindex already. That's going to be more involved, but it isn't a gigantic lift. The first one you can probably do with some frustrated tinkering. If you aren't comfortable doing c++ for the second, if there are resources available to hire my team we could quote it and take it on. |
Yes, I've noticed this when reading through the docs. I guess I'm a bit confused, I think of R-trees as yet another data structure like lists or dicts that can be used to store information and efficiently retrieve that information. When I think of databases, I think of fancier data structures that add properties like persistence of data and multi-user parallel access. TorchGeo does not use or need any of these additional features. We index a directory of files each time you instantiate a new dataset, and once the ML training is done we delete this index. We don't need persistence from one process to the next. Also, each task is single-user only. We do need to support parallel lookups, but we don't need to support parallel insertion or insertion at the same time as lookup. I think PostGIS/PostGRES would add a lot of difficult non-Python dependencies that might not be possible to install on something like Google Colab. |
But this is the requirement you want of file-based Rtree indexes, right? Persistence and parallel access. I wonder if something like https://github.com/flatgeobuf/flatgeobuf could give you more flexibility here without as much misery. It's internally indexed and stored according to that organization. |
I don't think we need persistence, we can just create a new index each time you run the code. All data is stored in the files themselves, we don't manage that in TorchGeo. We currently only support files locally on disk but we're planning on adding support for STAC API calls and other remote data access. All we really need is some sort of index mapping geospatial queries to specific files. As far as parallel access goes, 99.9% of the data loading time is spent actually reading the data from disk and doing any reprojection/resampling that is necessary, not querying rtree. Honestly, R-trees might be overkill for our applications, we could just store everything in a giant list and loop through that list to find files that overlap. |
ok, I'm not trying to push you off (thanks so much for all the contributions!). I just don't want you to build something out have it kind of stink though. |
Yeah definitely, I appreciate all the comments. I don't have any experience with PostGIS so I certainly didn't think through the solution in all possible ways. But I think what we currently have is sufficient for now until someone comes along and complains about performance or persistence. |
All that said, flatgeobuf is slick and if I had it at the time I did all the rtree/libspatialindex work, I would have just used that 😄 |
@hobu so assuming TorchGeo continues to use rtree for this purpose for the time being, and getting back to this PR, is this something that would make sense to contribute to rtree itself? If so, I'll try to add tests for this. |
oh, sure. go for it. It might not be the most performant, but it will be convenient. |
Okay, added tests to this and discovered several bugs in my original implementation (and possibly in rtree itself). See #204 (comment) |
113b65f
to
bcbb3b2
Compare
Bounds to see if CI will get happy. Please merge this one when you're ready. |
Still a work in progress. I was planning on taking the intersection of two Item objects but the Item constructor doesn't seem to contain bounds so it's unclear how to do this. Also, things might be different depending on interleave=True/False. Will definitely need extensive testing.
Closes #209