-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace GTFS Loading Code #1508
Comments
Maybe #177 could be supported in this replacement:
|
Working on this currently. |
We have decided to move away from an OBA-like reflection and annotation-heavy implemetation to a much more imperative one that allows straightforward tracing of where and why strings are parsed into the internal model objects. |
Loading GTFS feeds will be done in three stages:
Right now I've completed work on the first stage, although there may of course still be bugs lingering. I can read in an OVapi feed and write it back out in approximately 50 seconds on my laptop, which requires approximately 25 seconds to just unzip it. The resulting TXT files aren't 100% identical to those inside the original GTFS feed, but this difference is due to trailing whitespace (which CSV parsers are supposed to ignore unless it's quoted). I could make changes that preserve the whitespace, but then it will be quoted when the files are written back. Personally I don't see how trailing whitespace could ever be significant in a GTFS feed, so I'm inclined to leave things as they are. Going from a feed to a table is done through a Map that is keyed on a String representing the filename, so you can look up stop_times.txt (or something else) in a very intuitive way. You can then iterate over the table to fetch the individual rows. Allowing for random access would be too memory-intensive here. You can, however, go from a row to a cell through another Map, which is keyed on a String representing the field name. The Cell class actually represents a field name and cell contents pair, so the actual cell is represented as a String. You can find my work in the obaa branch, with the main() method that can be used to demo it being contained in the RoundTrip class. Personally I rather like the design, which is reasonably generic without becoming slow or unreadable. However, the use of inner classes might be described as "excessive" by some. ;) |
With stage 2 partially implemented, the round-trip time is now at approximately 70 seconds. |
Hi @JordenVerwer, I've reviewed your branch. There is already Map<String, String> style access to the fields of each row in our CSV loader library. Did you make an intentional decision to replicate this functionality (i.e. was there something measurably bad about it)? I was actually expecting to just use the CSV library for level 1 and move directly to level 2, only replacing the reflection-based filling-in of level-2 objects with straightforward imperative functions. I have adapted a few of the level-2 model classes to this approach to demonstrate what I mean. The results are in the obaa_abyrd branch which I just pushed to Github. As before, runnable code demonstrating its use is in GTFSMain / GTFSFeed. What is your take on this approach? |
No, it was a matter of reusing old code, partially rewriting it, then not noticing the overloaded get() method. I rewrote the code to see if it would work correctly using that method and it seems like it would. Performance is in the same ballpark, the round-trip results don't change and the only real disadvantage is that it becomes a bit harder to implement the Map interface correctly (but that was never a hard requirement anyway). So yes, I could and probably should have done it that way, although I think leaving it as it is now is just as well.
Of course the CSV library performs a major part of the work on level 1, but it doesn't do everything. I wish it had proper documentation so I wouldn't have to guess about the details of what it does half the time. Some methods don't even have JavaDoc. I've been working on adding validation code today and I noticed that the CSV reader does not check that a row has the right column count, for instance. I'm not sure what happens if you use the get() method on a column that doesn't exist for that particular row, so that might be something to look into if you want me to switch the code over.
I've been avoiding implementation inheritance lately and I've never been a fan of the factory pattern, but I suppose it could work. I don't really see any advantages to it, so I may simply be misunderstanding your intentions. Could you explain the rationale behind using factories instead of constructors? |
Factory methods serve exactly the same purpose as constructors, except that I can specify them as overriding a method on an interface or abstract class. I started out using constructors (since this is based on your code) but shifted to the factory methods because I needed the reused chunk of code in loadTable to call different functions depending on context, to transform a CSV row into instances of each of the various "Entity" types. Of course what we're really trying to do is pass a function object into the reusable table loader (domain is CSV row, range is instance of specific Entity type). I see you've favored a functional style using Iterables.transform and Function, which can also accomplish this. However, I also wanted the code that moved CSV field values into Java object instances to rely on a shared set of utility methods that check whether the parsing happened correctly as each row is converted. Of course those methods could be static methods in a library-module, but they need access to the CSV reader field and a place to store parse error messages, etc. I suppose all that could be passed into the constructor, and then passed down from the constructor into the field-fetch-helper methods but that seems a bit unnecessarily cluttered. The factory classes+methods just bundle together all the information necessary to the reusable methods while they are loading one table. I agree that it's often best to avoid inheritance, but I find a single level of inheritance acceptable when it facilitates code reuse. Regarding the CSV library, I think the lack of documentation is because this library was originally for another language and translated to Java. I found the Javadoc sufficient to make use of the library, but yes it's sparse. I believe it returns null if the requested column is not present. If this library turns out to be inefficient or lacking in some other way, I'm sure there are hundreds of other ones that do the same thing. |
Yes, there are multiple valid approaches and in the end it comes down to personal preference. I don't mind functions or methods with many arguments, but I can see why many people are against (I, on the other hand, try to avoid global variables pretty much at all cost, which is something other people don't consider all that important). I'll see if I can reconcile our different approaches somehow, because it's a shame to let code go to waste. Right now I'm working on adding proper types and validation code (much of it as static classes on FeedValidator) to the GTFS classes, which should bring us much closer to finishing stage 2. After that the main thing that remains is the MapDB integration, and then stage 2 should be done (apart from bug fixing). The main advantage your code has over mine right now is that it's better at continuing the import in the face of errors. This is an important requirement and I'm still pondering what the best solution is. Your approach certainly has merit here. I've looked at other CSV libraries, but I haven't found a single one that I really like. Despite its flaws, this one at least gives us good performance, and I haven't run into any obvious bugs yet. |
It was pretty straightforward to finish off what I'd started, so I just pushed the rest for completeness. There's no real concern about "wasting code", it's not wasted as long as it's helped illustrate or contribute to the design. Let's just use what we decide is best, or a combination of the best features. MapDB integration should be essentially done, as long as we're only putting StopTimes into MapDB. The only thing that took any real effort was defining the iteration that pulls out the trips one at a time. The approach where you apply a transformation to an iterator using a function object is particularly well suited to streaming GTFS in a line at a time, and therefore particularly suited to your test of sending a feed through a round-trip with essentially no memory consumption. However, in practice we will always want to give the OTP GTFS loader (and validator) random access to the GTFS contents, so I'd prefer to have the output of loading a feed look much like the current GTFSFeed class. |
Ah, right. I was under the assumption you wanted to use MapDB for On Mon, Oct 13, 2014 at 8:48 PM, Andrew Byrd [email protected]
|
On GitHub is a test of apparently every Java CSV parser in existence. Results: Currently, three parsers stand out as the fastest CSV parsers for Java: uniVocity-parsers, SimpleFlatMapper and Jackson CSV. Keep in mind that Simpleflatmapper is a very simple implementation that does not provide any customization options. When processing values not enclosed within quotes:
When processing values enclosed within quotes:
Another thing is. We can't forget to support all non-standard GTFS which OBA supported like: |
Thanks, @buma. If performance becomes a problem (right now it isn't), I'll definitely consider switching. However, I believe the unzipping performance is at least as important as the CSV parsing performance, so the actual improvements may not be as spectacular as they would seem to be. GTFS extensions will probably be added when the code is integrated into OTP. I think it's important that any extensions are clearly marked as such, so that people know they are non-standard features that may not be standardized in their current form or at all. |
Reading your comment Jorden, it just occurred to me that the loading phase would probably benefit from parallelization. The tables can be loaded completely separately from one another with the exception of feed-info which contains the feed id. |
That's a very interesting idea. Once I get to a point where it makes sense to try that I will certainly do so. |
After removing some time hoggers from RoundTrip, I came to the conclusion that commit 86963c4 caused the round-trip time to go from approximately 70 seconds to approximately 130 seconds, for future reference. |
ADDENDUM: However, my own commit note claims that what I just wrote isn't correct. I will not investigate this further at this time, because I have spent too much time on this already and don't want to get sidetracked anymore. |
Stage 2 is now more or less complete (except parts of the validation code) and it takes about five minutes to import an OVapi GTFS feed and find its trip patterns on my laptop. |
After some necessary changes it now takes about seven minutes to do the same. |
I'm thinking about what's the best way to store stop times, calendar dates, shapes and transfers. Right now (in code that I will commit and push in a few minutes) they are keyed on tuples, but I'm wondering if it would be better to switch to multimaps instead (at least for stop times and shapes). Any thoughts? |
I think it will make sense to use multimaps for some of these, so we can have more than one value for the same key. The tuple keys just happened to be efficient and useful for grouping stoptimes into ordered trips. The main problem I see with storing stoptimes in a multimap is that MapDB does not directly provide multimaps, and it requires map values to be immutable so the lists can't be built up progressively as we load the table. |
It might be good to store shapes in a MapDB map as well, since that's also often a huge table. RE your load time/performance observations above, while we do of course want to maintain a reasonable speed, the main priority is just getting raw GTFS data into a form usable by OTP and our validator front end. We can leave optimization for later. |
Yeah, using multimaps with MapDB is probably more trouble than it's worth, unfortunately. Performance seems okay for now, so I'm explicitly refraining from performing any further optimization steps at this point. I moved shapes.txt to MapDB, along with all other tables that are keyed on tuples. It appears to work quite well, and memory usage is back to the level it was on before I fixed the keying issue. |
I believe stage 2 is now complete, although there may obviously still be things I overlooked. Some things that could technically be validated at this stage aren't, because the costs seemed to outweigh the benefits for what is essentially GTFS import code for a trip planner. Of course such functionality could still end up inside a separate GTFS validator, but that's outside the scope of OTP. |
And TriMet's GTFS feeds can now also be imported without errors. |
This is implemented in the new TransportNetwork and TransitLayer classes. |
Does this resolve the onebusaway memory problem reported here: |
@marcioaguiar the new gtfs-lib should create the same number of Calendar objects as there are service IDs, i.e. space requirements are proportional to the number of service IDs not the number of days those service IDs are active. In addition, it's now trivial to back any of the GTFS tables with disk storage instead of in-memory maps. https://github.com/conveyal/gtfs-lib/blob/master/src/main/java/com/conveyal/gtfs/GTFSFeed.java If I am not mistaken, the new routing implementation never builds a full table of days when the services are active. At the beginning of each search it scans through all the service IDs and flags the ones that are active on that particular day. cf. com.conveyal.gtfs.model.Service#activeOn |
That's great! Thank you for the comprehensive response. |
Eliminate the feed/agency confusion we inherited from the One Bus Away GTFS loader. Perform all loading via MapDB to limit memory requirements. AgencyAndId should be replaced with FeedScopedId first to make change incremental. Should feeds have an identifier_namespace field, in case two feeds want to declare the same stops? Is that ever necessary or can we just do stop clustering? See existing code at org.opentripplanner.gtfs.model.GTFSMain. Calendar lookup service should be in a separate layer on top of raw GTFS loader. Use an int->servicecode map and a list of active service codes on each defined day (with a check for an upper limit on the number of days, e.g. 4 years).
Comments copied over from Trello:
Sep 4 - Brandon Martin-Anderson
By "in case two feeds want to declare the same stops" do you mean: (a) two feeds want to use the same stopid for different stops or (b) two feeds want to reference the same physical stop? I think both scenarios are relatively common.
Sep 4 - Andrew Byrd
The situation is that GTFS implicitly has one namespace per feed. That's pretty simple to mirror in our internal representation. So the question is whether it's worth allowing cross-feed references at all, since the worst thing that happens is you get two stops in similar positions that get connected to the same place in the street network. Do you have examples of cases where OTP needs to distinguish between A) two feeds referring to the same logical stop and B) two feeds referring to distinct stops that happen to be really close together?
Brandon Martin-Anderson
Whereas (a) there's no way in GTFS to refer to another feed's namespace and (b) hacking in a convention would be awkward and unreliable and (c) just having two stops in the exact same place is fine because traversing the street/stop link is (if I recall correctly) free, we should just let stop clustering do the work.
The text was updated successfully, but these errors were encountered: