Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Why are the feeds from a direct_download and latest not the same? #296

Closed
wklumpen opened this issue Sep 12, 2023 · 10 comments · Fixed by #299
Closed

[QUESTION] Why are the feeds from a direct_download and latest not the same? #296

wklumpen opened this issue Sep 12, 2023 · 10 comments · Fixed by #299
Assignees
Labels
bug Something isn't working

Comments

@wklumpen
Copy link

Apologies if this is asked and answered but a quick search didn't turn anything up.

I've noticed that the feeds that are archived in latest often do not match the datasets that come from the direct_download (e.g. fewer calendar_date rows, etc.).

An example: Arlington Transit (mdb_id = 485) direct download has calendar dates that extend to 20240131 while the latest URL has dates only to 20230902

Is this simply because the set hasn't been updated on a recent pass?

Some further info/documentation on the differences between the two would be ideal, as I'm struggling to understand them from the current field descriptions.

@emmambd
Copy link
Contributor

emmambd commented Sep 12, 2023

@wklumpen Thanks for flagging this issue! There shouldn't be any difference between the datasets from the two URLs - if there is, this likely indicates a bug with our Github Actions. Right now there's a cronjob that runs each day to check the direct_download URL, and update the latest URL if the dataset at the direct_download URL has been changed. Our team will take a look to see why this is happening.

Example of failed action

@emmambd emmambd added the bug Something isn't working label Sep 12, 2023
@wklumpen
Copy link
Author

Of note: I've com across a few broken direct_download links, e.g. 498 Frederick County: https://maps.frederickcountymd.gov/google/google_transit.zip

I'd like to trust the latest URL as it's much more stable, but at the moment it's not, and I imagine with a broken link the stable URL wouldn't be updating anyway.

Should I raise an issue (that would then be linked presumably to a PR) for each broken URL? I don't want to come in and stomp on whatever workflow you have going for this.

@emmambd
Copy link
Contributor

emmambd commented Sep 12, 2023

@wklumpen I think the broken direct_download URLs is a separate issue (stale data) vs. the out-of-date latest URLs (broken pipeline). You're right that you should be able to trust the latest URLs, so we'll prioritize looking at this issue ASAP.

For the direct_download links, you can open a separate issue with all the broken URLs you've found, with a 1 linked PR for working replacements you've found. (1 PR for each broken URL will likely take more of your time than it's worth!)

Thanks for checking in and asking about the best approach for this — it's very helpful that you've flagged this!

@wklumpen
Copy link
Author

Sounds good. There will probably be more to come as I go through basically every agency in a number of US urban areas :)

@emmambd
Copy link
Contributor

emmambd commented Sep 12, 2023

@wklumpen We always welcome the help in our data updating/cleaning efforts! Really appreciate it 🚀

@emmambd emmambd linked a pull request Sep 13, 2023 that will close this issue
@emmambd
Copy link
Contributor

emmambd commented Sep 13, 2023

@wklumpen PR #299 has fixed this error. Arlington Transit's latest URL now pulls the most recent dataset. Let us know if you encounter this issue again - for now, it's closed.

@emmambd emmambd closed this as completed Sep 13, 2023
@wklumpen
Copy link
Author

Thanks! Maybe I'll write a little validation script for the feeds I'm interested in. Stale feeds will cause a problem but that's separate issue.

@emmambd
Copy link
Contributor

emmambd commented Sep 13, 2023

Maybe I'll write a little validation script for the feeds I'm interested in

You mean to check if the datasets at latest and direct_download match? For the actual repo, we'd probably need some kind of issue/Github alert to check when Store latest datasets cronjob fails 2+ days in a row so we can troubleshoot it. Maybe @fredericsimard has thoughts about how this could be implemented. I don't expect this issue to recur in the near future though thanks to #299.

Stale feeds will cause a problem but that's separate issue.

Stale as in there's another URL somewhere with more up-to-date data?

@wklumpen
Copy link
Author

You mean to check if the datasets at latest and direct_download match?

Yes - I was going to do this for the feeds I'm using but internal validation on the MDB end would be even better

Stale as in there's another URL somewhere with more up-to-date data?

Yes, correct. I wonder if there's a possibility to detect if "active" feeds aren't actually being updated (e.g. the latest feed no longer covers the current date)

@emmambd
Copy link
Contributor

emmambd commented Sep 18, 2023

@wklumpen re: detecting date range from actual GTFS calendars, we plan on actually opening the feeds and sharing the dynamic data from datasets for V2 of the API we're developing right now (the logic will be from the GTFS Validator). But it won't be another 3-6 months, so if you want to do validation based on the text files themselves, I'd suggest going ahead and doing it yourself.

However, re: internal validation, if you are open to just relying on our cronjob pass/fail to verify if latest and direct_download match, that could be a contribution to the Mobility Database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants