Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Practices: Reasonable lengths in unique ids #518

Open
skinkie opened this issue Nov 14, 2024 · 14 comments
Open

Best Practices: Reasonable lengths in unique ids #518

skinkie opened this issue Nov 14, 2024 · 14 comments
Labels
Change: Best Practice Changes focusing on recommendations for optimal use of the specification. GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule Status: Discussion Issues and Pull Requests that are currently being discussed and reviewed by the community. Support: Needs Feedback

Comments

@skinkie
Copy link
Contributor

skinkie commented Nov 14, 2024

Describe the problem

Today, I was confronted with a GTFS feed that used 100+ character IDs for individual trip_ids. You may guess what the size of the feed looked like. Their old feeds used only 8 digits as trip_id.

Use cases

Efficient resource usage.

Proposed solution

I want to propose a 36 byte soft limit as best practice for any identifier used in GTFS. A UUID would fit, I would say even a NeTEx ServiceJourney or ScheduledStopPoint identifier would fit as whole. If a value exceeds 36 bytes, a nice warning can and should be presented.

@leonardehrenfried
Copy link
Contributor

I must admit this one made me laugh - no matter how many rules the spec imposes, data producers will always find new ways to create a mess.

Famous last words: "36 characters ought to be enough for anyone."

@laurentg
Copy link

I'm not certain which problem we are trying to solve here. The zip will compress large and multiple IDs nicely, and trip count is never that huge to prevent storing data, even in memory. Also giving an (arbitrary) limit could be taken as an excuse for some re-users to justify not being able to consume feeds with a few IDs larger than this limit. I've encountered some re-users that cannot ingest GTFS with IDs longer than 80 or 255 chars for example, even if only a few IDs are above that threshold.

In summary I'm rather against this; for me this is rather useless, somehow arbitrary and open to misinterpretation.

@skinkie
Copy link
Contributor Author

skinkie commented Nov 14, 2024

@laurentg in this case the feed of "last week" was 74MB compressed, and this week 890MB compressed. For compression (itself) to work properly, some things must be guaranteed first. For example, the data in the files are sorted. But this is not about compression or not, processing and running matching still requires this idiotic long strings to be stored in memory, unless the implementation throws them overboard anyhow and creates hashes.

With respect to your other comment, that it is never too big to store something in memory. 372996 trips, multiplied by 100 is indeed "only" 37MB. But it could also have been just 2.2MB.

The example that you give, that there exists GTFS-ids with a length of 80 - 255 already shows we need to have a best practice. Nobody in their sane mind has more than 10^80 stops in the network.

@skalexch
Copy link
Collaborator

skalexch commented Nov 18, 2024

@skinkie I agree that there definitely needs to be a best practice on trip_id lengths. I just want to know the rationale behind the 36 byte soft limit. A little analysis of the GTFS feeds in the Mobility Database shows that both the median and max lengths of trip_ids are averaged around 20. Even when looking at trip_ids that are composed of service_ids (I guess that's how they get longer), we find the same result. So a limit under 30 can probably take care of the needs of most agencies.
What do you think? Should there be a range?

@paulswartz
Copy link
Contributor

paulswartz commented Nov 18, 2024

As another data point: MBTA enforces a maximum trip ID length of 39 characters, but we do have trip IDs which approach this limit.

Edit: this was incorrect. MBTA enforces a maximum service ID length of 39 characters, but does not enforce a maximum trip ID length.

@irees
Copy link

irees commented Nov 18, 2024

Entity ID length is never been an issue I've worried about, and I have seen a lot of pathologically malformed GTFS files.

Regarding memory usage, a consumer has options for dealing with this, such as string interning or using internal integer keys for foreign key references. In my experience the main memory hog is generally shapes.txt.

@skinkie
Copy link
Contributor Author

skinkie commented Nov 18, 2024

@skinkie I agree that there definitely needs to be a best practice on trip_id lengths. I just want to know the rationale behind the 36 byte soft limit. A little analysis of the GTFS feeds in the Mobility Database shows that both the median and max lengths of trip_ids are averaged around 20. Even when looking at trip_ids that are composed of service_ids (I guess that's how they get longer), we find the same result. So a limit under 30 can probably take care of the needs of most agencies. What do you think? Should there be a range?

My rationale for 36 byte is that UUID has the length of 36 bytes. Not that I would ever suggest using a UUID, I would even be more in favor of making it a true integer, breaking change ;-)

@eliasmbd eliasmbd added GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule Change: Best Practice Changes focusing on recommendations for optimal use of the specification. Status: Discussion Issues and Pull Requests that are currently being discussed and reviewed by the community. Support: Needs Feedback labels Nov 19, 2024
@skalexch
Copy link
Collaborator

skalexch commented Dec 4, 2024

@skinkie I think everyone is agreeing on adding this as a best practice, the only point of doubt is the character limit. I do agree that since the trip_id is a key, the UUID limit is reasonable. But I also suggest either finding a data-driven number (sort of an average or a max through the feeds in the Mobility Database) or removing the exact number and urging that the ID is as short as possible.

@skinkie
Copy link
Contributor Author

skinkie commented Dec 4, 2024

@skalexch I think two things are important:

  1. shortest as possible, but unique
  2. never derive content from a key (some organisations hide extra attributes directly in their keys, which others then parse)

@abyrd
Copy link

abyrd commented Dec 6, 2024

I agree that compact IDs should be included as a best practice. The exact length suggested is debatable. Since the primary purpose of this recommendation is to avoid unnecessary bloating of file sizes and a bit of excess processing when reading, you may even want to include (weaker) recommendations against use of UUIDs, and in favor of outputting stop_times grouped by trip. These are of course not requirements, only suggestions, but data producers simply may simply not realize the effect these simple choices have on their file sizes and processing load. For something like a monthly data transfer between infrastructure components this might seem unimportant, but many people pass around GTFS feeds for scenario planning and network analysis purposes, and others load new GTFS feeds every hour or day.

Background info: I had a client contact me about the same case cited here. They were using the De Lijn GTFS files at https://gtfs.irail.be/de-lijn/gtfs/ and were concerned that in one day (from September 17 to September 18) the size of the file changed from 51MB to 904MB, an 18X increase in size. This was confusing enough, and enough of an impediment to use, that they have continued using the September 17 feed up until today.

In the feed from September 17 the stop_times table looks like this:

trip_id,"arrival_time","departure_time",stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,"shape_dist_traveled"
48327837,"6:24:00","6:24:00",14411,1,,0,0,""
48327837,"6:27:00","6:27:00",108451,2,,0,0,""
48327837,"6:29:00","6:29:00",152488,3,,0,0,""

By comparison here is the stop_times table from the current version:

"trip_id","arrival_time","departure_time","stop_id","stop_sequence","pickup_type","drop_off_type"
"1007_117_110_3ANTRAM-23_1128-0710HBK1281_5787_a0126881ee5ec9d45a87041cb8defe97f542a06ad8cad8219ff36d385d338f13","13:55:00","13:55:00","104375","9","0","0"
"1561_29_687_1ME1560-13_1276-7605KRG2761_5399_ab73b272ab206fa9466300168527dfa321446c76acc875678e53d90b05fb688f","12:31:00","12:31:00","108844","21","0","0"
"1860_762_37_1ME1860-03_1294-8913DRS2941_5399_78f6922c62c12d5e0cfecfeaf00b0fe16a288fdeddd56167d78d448604874450","18:58:00","18:58:00","105168","27","0","0"

The old IDs are 8 digit integers with the same ID occurring in successive lines, while the new ones are 110 characters and highly random, and lines are apparently in random order. Taken together these are probably the source of the file size increase. We do not know what change in software may have led to this difference, and whether they are even conscious of the file size impact of this choice.

It is a quirk of GTFS that the largest file (stop_times) requires a trip ID on every single line, so trip ID length has a particularly significant impact. It looks like the rows are now in a more random order, probably worsening the compression performance.

It also looks like maybe they're trying to multiplex additional information into the ID field. If people want to share some kind of stable universal ID for trips or include other extra information about trips, they can just add more columns to the trips table. It is rather inefficient to repeat all this information hundreds of times on every row of the stop_times table. I suspect it's just due to people not realizing that they can add arbitrary extra columns to trips.txt, and not realizing how repetitively GTFS uses trip IDs.

@abyrd
Copy link

abyrd commented Dec 6, 2024

Famous last words: "36 characters ought to be enough for anyone."

Really, 8 characters should be enough. Maybe with some kind of three-character type suffix. :)

But consider that 10 digits is enough to give a unique integer ID to every person in the world. And a 64-bit integer has 20 digits in base 10.

@abyrd
Copy link

abyrd commented Dec 6, 2024

My rationale for 36 byte is that UUID has the length of 36 bytes. Not that I would ever suggest using a UUID, I would even be more in favor of making it a true integer, breaking change ;-)

A UUID is a true integer, 128 bits long. The representation you're thinking of is more compact than the decimal or binary one ;)

Just kidding of course, I don't think anyone's seriously suggesting changing the definition of IDs or minimizing their length, just providing some helpful tips to inform people how they can avoid inadvertently causing their file sizes to jump by an order of magnitude.

@e-lo
Copy link

e-lo commented Dec 12, 2024

Suggest making the recommendation user- rather than developer-centric and use digits instead of bytes.

@abyrd
Copy link

abyrd commented Dec 13, 2024

Suggest making the recommendation user- rather than developer-centric and use digits instead of bytes.

Despite the bit of discussion about numeric IDs here, I don't think any of us really want to restrict these fields to (text representations of) decimal numbers. Short IDs like "RED_LINE" or "F12A" would also be fine, whether or not developers or users are thinking of them as numbers. So might the best user-centric recommendation be in terms of "characters" rather than bytes or digits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Change: Best Practice Changes focusing on recommendations for optimal use of the specification. GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule Status: Discussion Issues and Pull Requests that are currently being discussed and reviewed by the community. Support: Needs Feedback
Projects
None yet
Development

No branches or pull requests

9 participants