-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best Practices: Reasonable lengths in unique ids #518
Comments
I must admit this one made me laugh - no matter how many rules the spec imposes, data producers will always find new ways to create a mess. Famous last words: "36 characters ought to be enough for anyone." |
I'm not certain which problem we are trying to solve here. The zip will compress large and multiple IDs nicely, and trip count is never that huge to prevent storing data, even in memory. Also giving an (arbitrary) limit could be taken as an excuse for some re-users to justify not being able to consume feeds with a few IDs larger than this limit. I've encountered some re-users that cannot ingest GTFS with IDs longer than 80 or 255 chars for example, even if only a few IDs are above that threshold. In summary I'm rather against this; for me this is rather useless, somehow arbitrary and open to misinterpretation. |
@laurentg in this case the feed of "last week" was 74MB compressed, and this week 890MB compressed. For compression (itself) to work properly, some things must be guaranteed first. For example, the data in the files are sorted. But this is not about compression or not, processing and running matching still requires this idiotic long strings to be stored in memory, unless the implementation throws them overboard anyhow and creates hashes. With respect to your other comment, that it is never too big to store something in memory. 372996 trips, multiplied by 100 is indeed "only" 37MB. But it could also have been just 2.2MB. The example that you give, that there exists GTFS-ids with a length of 80 - 255 already shows we need to have a best practice. Nobody in their sane mind has more than 10^80 stops in the network. |
@skinkie I agree that there definitely needs to be a best practice on |
As another data point: Edit: this was incorrect. MBTA enforces a maximum service ID length of 39 characters, but does not enforce a maximum trip ID length. |
Entity ID length is never been an issue I've worried about, and I have seen a lot of pathologically malformed GTFS files. Regarding memory usage, a consumer has options for dealing with this, such as string interning or using internal integer keys for foreign key references. In my experience the main memory hog is generally shapes.txt. |
My rationale for 36 byte is that UUID has the length of 36 bytes. Not that I would ever suggest using a UUID, I would even be more in favor of making it a true integer, breaking change ;-) |
@skinkie I think everyone is agreeing on adding this as a best practice, the only point of doubt is the character limit. I do agree that since the |
@skalexch I think two things are important:
|
I agree that compact IDs should be included as a best practice. The exact length suggested is debatable. Since the primary purpose of this recommendation is to avoid unnecessary bloating of file sizes and a bit of excess processing when reading, you may even want to include (weaker) recommendations against use of UUIDs, and in favor of outputting stop_times grouped by trip. These are of course not requirements, only suggestions, but data producers simply may simply not realize the effect these simple choices have on their file sizes and processing load. For something like a monthly data transfer between infrastructure components this might seem unimportant, but many people pass around GTFS feeds for scenario planning and network analysis purposes, and others load new GTFS feeds every hour or day. Background info: I had a client contact me about the same case cited here. They were using the De Lijn GTFS files at https://gtfs.irail.be/de-lijn/gtfs/ and were concerned that in one day (from September 17 to September 18) the size of the file changed from 51MB to 904MB, an 18X increase in size. This was confusing enough, and enough of an impediment to use, that they have continued using the September 17 feed up until today. In the feed from September 17 the stop_times table looks like this:
By comparison here is the stop_times table from the current version:
The old IDs are 8 digit integers with the same ID occurring in successive lines, while the new ones are 110 characters and highly random, and lines are apparently in random order. Taken together these are probably the source of the file size increase. We do not know what change in software may have led to this difference, and whether they are even conscious of the file size impact of this choice. It is a quirk of GTFS that the largest file (stop_times) requires a trip ID on every single line, so trip ID length has a particularly significant impact. It looks like the rows are now in a more random order, probably worsening the compression performance. It also looks like maybe they're trying to multiplex additional information into the ID field. If people want to share some kind of stable universal ID for trips or include other extra information about trips, they can just add more columns to the trips table. It is rather inefficient to repeat all this information hundreds of times on every row of the stop_times table. I suspect it's just due to people not realizing that they can add arbitrary extra columns to trips.txt, and not realizing how repetitively GTFS uses trip IDs. |
Really, 8 characters should be enough. Maybe with some kind of three-character type suffix. :) But consider that 10 digits is enough to give a unique integer ID to every person in the world. And a 64-bit integer has 20 digits in base 10. |
A UUID is a true integer, 128 bits long. The representation you're thinking of is more compact than the decimal or binary one ;) Just kidding of course, I don't think anyone's seriously suggesting changing the definition of IDs or minimizing their length, just providing some helpful tips to inform people how they can avoid inadvertently causing their file sizes to jump by an order of magnitude. |
Suggest making the recommendation user- rather than developer-centric and use digits instead of bytes. |
Despite the bit of discussion about numeric IDs here, I don't think any of us really want to restrict these fields to (text representations of) decimal numbers. Short IDs like "RED_LINE" or "F12A" would also be fine, whether or not developers or users are thinking of them as numbers. So might the best user-centric recommendation be in terms of "characters" rather than bytes or digits? |
Describe the problem
Today, I was confronted with a GTFS feed that used 100+ character IDs for individual trip_ids. You may guess what the size of the feed looked like. Their old feeds used only 8 digits as trip_id.
Use cases
Efficient resource usage.
Proposed solution
I want to propose a 36 byte soft limit as best practice for any identifier used in GTFS. A UUID would fit, I would say even a NeTEx ServiceJourney or ScheduledStopPoint identifier would fit as whole. If a value exceeds 36 bytes, a nice warning can and should be presented.
The text was updated successfully, but these errors were encountered: