Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GTFS Loading Code #1508

Closed
abyrd opened this issue Sep 15, 2014 · 29 comments
Closed

Replace GTFS Loading Code #1508

abyrd opened this issue Sep 15, 2014 · 29 comments
Assignees

Comments

@abyrd
Copy link
Member

abyrd commented Sep 15, 2014

Eliminate the feed/agency confusion we inherited from the One Bus Away GTFS loader. Perform all loading via MapDB to limit memory requirements. AgencyAndId should be replaced with FeedScopedId first to make change incremental. Should feeds have an identifier_namespace field, in case two feeds want to declare the same stops? Is that ever necessary or can we just do stop clustering? See existing code at org.opentripplanner.gtfs.model.GTFSMain. Calendar lookup service should be in a separate layer on top of raw GTFS loader. Use an int->servicecode map and a list of active service codes on each defined day (with a check for an upper limit on the number of days, e.g. 4 years).

Comments copied over from Trello:

Sep 4 - Brandon Martin-Anderson

By "in case two feeds want to declare the same stops" do you mean: (a) two feeds want to use the same stopid for different stops or (b) two feeds want to reference the same physical stop? I think both scenarios are relatively common.

Sep 4 - Andrew Byrd

The situation is that GTFS implicitly has one namespace per feed. That's pretty simple to mirror in our internal representation. So the question is whether it's worth allowing cross-feed references at all, since the worst thing that happens is you get two stops in similar positions that get connected to the same place in the street network. Do you have examples of cases where OTP needs to distinguish between A) two feeds referring to the same logical stop and B) two feeds referring to distinct stops that happen to be really close together?

Brandon Martin-Anderson

Whereas (a) there's no way in GTFS to refer to another feed's namespace and (b) hacking in a convention would be awkward and unreliable and (c) just having two stops in the exact same place is fine because traversing the street/stop link is (if I recall correctly) free, we should just let stop clustering do the work.

@abyrd abyrd self-assigned this Sep 15, 2014
@abyrd abyrd added this to the NYSDOT Sprint 2 milestone Sep 15, 2014
@buma
Copy link
Contributor

buma commented Sep 29, 2014

Maybe #177 could be supported in this replacement:

Couldn't there be a way of allowing the one who generates the feed to add additional values to trips.txt file? And for admin to define in OTP configuration what these values represent and allow to add easily checkboxes into the gui which would represent filters for given values (as there is now for wheelchair access)? These characteristics should then be also visible when one looks at detailed info about particular train/bus line, and also stop's timetables.

Updates to the GTFS graph builder to allow people to specify additional fields/data in their feeds, and how their values should effect routing.

@abyrd abyrd modified the milestones: NYSDOT Sprint 3, NYSDOT Sprint 2 Oct 2, 2014
@abyrd
Copy link
Member Author

abyrd commented Oct 2, 2014

Working on this currently.

@abyrd
Copy link
Member Author

abyrd commented Oct 9, 2014

We have decided to move away from an OBA-like reflection and annotation-heavy implemetation to a much more imperative one that allows straightforward tracing of where and why strings are parsed into the internal model objects.

@JordenVerwer
Copy link
Member

Loading GTFS feeds will be done in three stages:

  1. The first stage converts the ZIP file representing the feed to an internal representation having a Feed => Table => Row => Cell hierarchy. No validation or interpretation is performed, although obviously completely malformed feeds (such as one that is not a ZIP, or which contains something other than CSV files) will already be rejected at this stage.
  2. The second stage converts the first stage's representation to a format that closely resembles the official GTFS specification. Validations that will be performed at this stage include checking that required fields are included and that cells contain data of the right type (so no text in columns that are supposed to contain numbers). No validations that require cross-references between the various tables will be performed.
  3. The third stage converts the second stage's representation to objects that can be used in routing and completes the validation process. For some tables (like agency.txt) this will be trivial, for some tables (like calendar_dates.txt) it will require more work.

Right now I've completed work on the first stage, although there may of course still be bugs lingering. I can read in an OVapi feed and write it back out in approximately 50 seconds on my laptop, which requires approximately 25 seconds to just unzip it. The resulting TXT files aren't 100% identical to those inside the original GTFS feed, but this difference is due to trailing whitespace (which CSV parsers are supposed to ignore unless it's quoted). I could make changes that preserve the whitespace, but then it will be quoted when the files are written back. Personally I don't see how trailing whitespace could ever be significant in a GTFS feed, so I'm inclined to leave things as they are.

Going from a feed to a table is done through a Map that is keyed on a String representing the filename, so you can look up stop_times.txt (or something else) in a very intuitive way. You can then iterate over the table to fetch the individual rows. Allowing for random access would be too memory-intensive here. You can, however, go from a row to a cell through another Map, which is keyed on a String representing the field name. The Cell class actually represents a field name and cell contents pair, so the actual cell is represented as a String.

You can find my work in the obaa branch, with the main() method that can be used to demo it being contained in the RoundTrip class. Personally I rather like the design, which is reasonably generic without becoming slow or unreadable. However, the use of inner classes might be described as "excessive" by some. ;)

@JordenVerwer
Copy link
Member

With stage 2 partially implemented, the round-trip time is now at approximately 70 seconds.

@abyrd
Copy link
Member Author

abyrd commented Oct 13, 2014

Hi @JordenVerwer, I've reviewed your branch. There is already Map<String, String> style access to the fields of each row in our CSV loader library. Did you make an intentional decision to replicate this functionality (i.e. was there something measurably bad about it)?

I was actually expecting to just use the CSV library for level 1 and move directly to level 2, only replacing the reflection-based filling-in of level-2 objects with straightforward imperative functions.

I have adapted a few of the level-2 model classes to this approach to demonstrate what I mean. The results are in the obaa_abyrd branch which I just pushed to Github. As before, runnable code demonstrating its use is in GTFSMain / GTFSFeed.

What is your take on this approach?

@JordenVerwer
Copy link
Member

Did you make an intentional decision to replicate this functionality (i.e. was there something measurably bad about it)?

No, it was a matter of reusing old code, partially rewriting it, then not noticing the overloaded get() method. I rewrote the code to see if it would work correctly using that method and it seems like it would. Performance is in the same ballpark, the round-trip results don't change and the only real disadvantage is that it becomes a bit harder to implement the Map interface correctly (but that was never a hard requirement anyway). So yes, I could and probably should have done it that way, although I think leaving it as it is now is just as well.

I was actually expecting to just use the CSV library for level 1 and move directly to level 2, only replacing the reflection-based filling-in of level-2 objects with straightforward imperative functions.

Of course the CSV library performs a major part of the work on level 1, but it doesn't do everything. I wish it had proper documentation so I wouldn't have to guess about the details of what it does half the time. Some methods don't even have JavaDoc. I've been working on adding validation code today and I noticed that the CSV reader does not check that a row has the right column count, for instance. I'm not sure what happens if you use the get() method on a column that doesn't exist for that particular row, so that might be something to look into if you want me to switch the code over.

What is your take on this approach?

I've been avoiding implementation inheritance lately and I've never been a fan of the factory pattern, but I suppose it could work. I don't really see any advantages to it, so I may simply be misunderstanding your intentions. Could you explain the rationale behind using factories instead of constructors?

@abyrd
Copy link
Member Author

abyrd commented Oct 13, 2014

Factory methods serve exactly the same purpose as constructors, except that I can specify them as overriding a method on an interface or abstract class. I started out using constructors (since this is based on your code) but shifted to the factory methods because I needed the reused chunk of code in loadTable to call different functions depending on context, to transform a CSV row into instances of each of the various "Entity" types.

Of course what we're really trying to do is pass a function object into the reusable table loader (domain is CSV row, range is instance of specific Entity type). I see you've favored a functional style using Iterables.transform and Function, which can also accomplish this.

However, I also wanted the code that moved CSV field values into Java object instances to rely on a shared set of utility methods that check whether the parsing happened correctly as each row is converted. Of course those methods could be static methods in a library-module, but they need access to the CSV reader field and a place to store parse error messages, etc. I suppose all that could be passed into the constructor, and then passed down from the constructor into the field-fetch-helper methods but that seems a bit unnecessarily cluttered. The factory classes+methods just bundle together all the information necessary to the reusable methods while they are loading one table.

I agree that it's often best to avoid inheritance, but I find a single level of inheritance acceptable when it facilitates code reuse.

Regarding the CSV library, I think the lack of documentation is because this library was originally for another language and translated to Java. I found the Javadoc sufficient to make use of the library, but yes it's sparse. I believe it returns null if the requested column is not present. If this library turns out to be inefficient or lacking in some other way, I'm sure there are hundreds of other ones that do the same thing.

@JordenVerwer
Copy link
Member

Yes, there are multiple valid approaches and in the end it comes down to personal preference. I don't mind functions or methods with many arguments, but I can see why many people are against (I, on the other hand, try to avoid global variables pretty much at all cost, which is something other people don't consider all that important).

I'll see if I can reconcile our different approaches somehow, because it's a shame to let code go to waste. Right now I'm working on adding proper types and validation code (much of it as static classes on FeedValidator) to the GTFS classes, which should bring us much closer to finishing stage 2. After that the main thing that remains is the MapDB integration, and then stage 2 should be done (apart from bug fixing). The main advantage your code has over mine right now is that it's better at continuing the import in the face of errors. This is an important requirement and I'm still pondering what the best solution is. Your approach certainly has merit here.

I've looked at other CSV libraries, but I haven't found a single one that I really like. Despite its flaws, this one at least gives us good performance, and I haven't run into any obvious bugs yet.

@abyrd
Copy link
Member Author

abyrd commented Oct 13, 2014

It was pretty straightforward to finish off what I'd started, so I just pushed the rest for completeness. There's no real concern about "wasting code", it's not wasted as long as it's helped illustrate or contribute to the design. Let's just use what we decide is best, or a combination of the best features.

MapDB integration should be essentially done, as long as we're only putting StopTimes into MapDB. The only thing that took any real effort was defining the iteration that pulls out the trips one at a time.

The approach where you apply a transformation to an iterator using a function object is particularly well suited to streaming GTFS in a line at a time, and therefore particularly suited to your test of sending a feed through a round-trip with essentially no memory consumption. However, in practice we will always want to give the OTP GTFS loader (and validator) random access to the GTFS contents, so I'd prefer to have the output of loading a feed look much like the current GTFSFeed class.

@JordenVerwer
Copy link
Member

Ah, right. I was under the assumption you wanted to use MapDB for
everything. I'll have more code done by tomorrow and hopefully the big
picture will become a bit clearer then. Signing off.

On Mon, Oct 13, 2014 at 8:48 PM, Andrew Byrd [email protected]
wrote:

It was pretty straightforward to finish off what I'd started, so I just
pushed the rest for completeness. There's no real concern about "wasting
code", it's not wasted as long as it's helped illustrate or contribute to
the design. Let's just use what we decide is best, or a combination of the
best features.

MapDB integration should be essentially done, as long as we're only
putting StopTimes into MapDB. The only thing that took any real effort was
defining the iteration that pulls out the trips one at a time.

The approach where you apply a transformation to an iterator using a
function object is particularly well suited to streaming GTFS in a line at
a time, and therefore particularly suited to your test of sending a feed
through a round-trip with essentially no memory consumption. However, in
practice we will always want to give the OTP GTFS loader (and validator)
random access to the GTFS contents, so I'd prefer to have the output of
loading a feed look much like the current GTFSFeed class.


Reply to this email directly or view it on GitHub
#1508 (comment)
.

@buma
Copy link
Contributor

buma commented Oct 13, 2014

On GitHub is a test of apparently every Java CSV parser in existence.

Results:

Currently, three parsers stand out as the fastest CSV parsers for Java:

uniVocity-parsers, SimpleFlatMapper and Jackson CSV.

Keep in mind that Simpleflatmapper is a very simple implementation that does not provide any customization options.

When processing values not enclosed within quotes:

  • SimpleFlatMapper is barely faster than uniVocity-parsers when processing values that are not enclosed within quotes.
  • Jackson CSV follows closely and all other parsers follow very far behind.

When processing values enclosed within quotes:

  • SimpleFlatMapper and uniVocity-parsers have comparable performance when processing values enclosed within quotes
  • Jackson CSV comes behind, yet it is still way faster than the others.

Another thing is. We can't forget to support all non-standard GTFS which OBA supported like:
journey_duration in fares and bikes_allowed etc. in trips

@JordenVerwer
Copy link
Member

Thanks, @buma. If performance becomes a problem (right now it isn't), I'll definitely consider switching. However, I believe the unzipping performance is at least as important as the CSV parsing performance, so the actual improvements may not be as spectacular as they would seem to be.

GTFS extensions will probably be added when the code is integrated into OTP. I think it's important that any extensions are clearly marked as such, so that people know they are non-standard features that may not be standardized in their current form or at all.

@abyrd
Copy link
Member Author

abyrd commented Oct 14, 2014

Reading your comment Jorden, it just occurred to me that the loading phase would probably benefit from parallelization. The tables can be loaded completely separately from one another with the exception of feed-info which contains the feed id.

@JordenVerwer
Copy link
Member

That's a very interesting idea. Once I get to a point where it makes sense to try that I will certainly do so.

@JordenVerwer
Copy link
Member

After removing some time hoggers from RoundTrip, I came to the conclusion that commit 86963c4 caused the round-trip time to go from approximately 70 seconds to approximately 130 seconds, for future reference.

@JordenVerwer
Copy link
Member

ADDENDUM: However, my own commit note claims that what I just wrote isn't correct. I will not investigate this further at this time, because I have spent too much time on this already and don't want to get sidetracked anymore.

@JordenVerwer
Copy link
Member

Stage 2 is now more or less complete (except parts of the validation code) and it takes about five minutes to import an OVapi GTFS feed and find its trip patterns on my laptop.

@JordenVerwer
Copy link
Member

After some necessary changes it now takes about seven minutes to do the same.

@JordenVerwer
Copy link
Member

I'm thinking about what's the best way to store stop times, calendar dates, shapes and transfers. Right now (in code that I will commit and push in a few minutes) they are keyed on tuples, but I'm wondering if it would be better to switch to multimaps instead (at least for stop times and shapes). Any thoughts?

@abyrd
Copy link
Member Author

abyrd commented Oct 22, 2014

I think it will make sense to use multimaps for some of these, so we can have more than one value for the same key. The tuple keys just happened to be efficient and useful for grouping stoptimes into ordered trips.

The main problem I see with storing stoptimes in a multimap is that MapDB does not directly provide multimaps, and it requires map values to be immutable so the lists can't be built up progressively as we load the table.

@abyrd
Copy link
Member Author

abyrd commented Oct 22, 2014

It might be good to store shapes in a MapDB map as well, since that's also often a huge table. RE your load time/performance observations above, while we do of course want to maintain a reasonable speed, the main priority is just getting raw GTFS data into a form usable by OTP and our validator front end. We can leave optimization for later.

@JordenVerwer
Copy link
Member

Yeah, using multimaps with MapDB is probably more trouble than it's worth, unfortunately.

Performance seems okay for now, so I'm explicitly refraining from performing any further optimization steps at this point.

I moved shapes.txt to MapDB, along with all other tables that are keyed on tuples. It appears to work quite well, and memory usage is back to the level it was on before I fixed the keying issue.

@JordenVerwer
Copy link
Member

I believe stage 2 is now complete, although there may obviously still be things I overlooked. Some things that could technically be validated at this stage aren't, because the costs seemed to outweigh the benefits for what is essentially GTFS import code for a trip planner. Of course such functionality could still end up inside a separate GTFS validator, but that's outside the scope of OTP.

@JordenVerwer
Copy link
Member

And TriMet's GTFS feeds can now also be imported without errors.

@abyrd abyrd assigned abyrd and unassigned JordenVerwer Oct 28, 2014
@abyrd abyrd modified the milestones: NYSDOT Sprint 5, NYSDOT Sprint 7 Nov 24, 2014
@abyrd
Copy link
Member Author

abyrd commented Sep 30, 2015

This is implemented in the new TransportNetwork and TransitLayer classes.

@abyrd abyrd closed this as completed Sep 30, 2015
@marcioaguiar
Copy link

Does this resolve the onebusaway memory problem reported here:
OneBusAway/onebusaway-gtfs-modules#59

@abyrd
Copy link
Member Author

abyrd commented Oct 8, 2015

@marcioaguiar the new gtfs-lib should create the same number of Calendar objects as there are service IDs, i.e. space requirements are proportional to the number of service IDs not the number of days those service IDs are active. In addition, it's now trivial to back any of the GTFS tables with disk storage instead of in-memory maps.

https://github.com/conveyal/gtfs-lib/blob/master/src/main/java/com/conveyal/gtfs/GTFSFeed.java

If I am not mistaken, the new routing implementation never builds a full table of days when the services are active. At the beginning of each search it scans through all the service IDs and flags the ones that are active on that particular day.

cf. com.conveyal.gtfs.model.Service#activeOn

@marcioaguiar
Copy link

That's great!

Thank you for the comprehensive response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants