Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator v6 Fails to Validate GTFS File (UNPARSABLE_ROWS) #1918

Open
tafflin opened this issue Nov 7, 2024 · 5 comments
Open

Validator v6 Fails to Validate GTFS File (UNPARSABLE_ROWS) #1918

tafflin opened this issue Nov 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues

Comments

@tafflin
Copy link

tafflin commented Nov 7, 2024

Describe the bug

At busmaps.com we are experiencing an issue with the Mobility Data Validator v6 when validating GTFS feeds that were previously validated successfully in version 5.0.1. Specifically, the latest version fails to process certain files with route names containing specific characters, which were acceptable in v5.0.1.

The GTFS file in question, located at GTFS File URL, includes route names such as "Funo - Z.I. Còde Fabbri", containing non-standard symbols. In v6, this causes the validator to label certain rows as unparseable, which subsequently blocks all validation rules from executing for affected files, including routes validation. This behavior differs from v5.0.1, which processed these routes without issues, allowing complete validation.

Validation Log Summary:

agency.txt – 1 row
calendar_dates.txt – 7,694 rows
feed_info.txt – 1 row
routes.txt – UNPARSABLE_ROWS
shapes.txt – 440,628 rows
stop_times.txt – 663,387 rows
stops.txt – 6,578 rows
trips.txt – 21,473 rows

Additional Information:
We validate over 3,000 GTFS feeds, and consistency across versions is critical for our use case. This change in behavior has introduced significant challenges in managing our validation workflow.

Steps/Code to Reproduce

Steps to Reproduce:

  1. Download the GTFS file from the provided URL.

  2. Run the Mobility Data Validator on an Ubuntu system using the minimal command outlined in the documentation:

    java -jar  gtfs-validator-5.0.1-cli.jar -i {path to the GTFS file} -o {name of the output directory that will be created}
    java -jar  gtfs-validator-6.0.0-cli.jar -i {path to the GTFS file} -o {name of the output directory that will be created}
  3. Observe the failure in routes.txt and the unparseable row errors.

Expected Results

The validator should successfully parse and validate the GTFS file, including all rows in routes.txt without marking them as unparseable, even if route names contain non-standard characters. The validation should complete with a full report of any detected issues across all files, as it did in version 5.0.1. If any encoding-related issues are detected in routes.txt, they should be logged as warnings rather than errors that block further rule execution.

Actual Results

When running the validation in version 6, the routes.txt file is marked as containing "UNPARSABLE_ROWS," preventing further validation of its contents. This differs from version 5.0.1, where the file was fully validated even with non-standard characters in route names. As a result, validation rules for routes are not executed, and a complete validation report is not generated.

Screenshots

No response

Files used

No response

Validator version

6.0.0

Operating system

Linux Ubuntu 22.04

Java version

openjdk version "17.0.12" 2024-07-16

Additional notes

No response

@tafflin tafflin added bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues labels Nov 7, 2024
Copy link

welcome bot commented Nov 7, 2024

Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi 👋!

Welcome to the community and thank you for your engagement in open source! 🎉

@github-project-automation github-project-automation bot moved this to Requires investigation in Bug triage Nov 7, 2024
@emmambd
Copy link
Contributor

emmambd commented Nov 7, 2024

Hi @tafflin - thanks for raising this concern. You can follow the development of the invalid_character rule here for more context: #1840

Could you share approximately the number of feeds in your pipeline that are being impacted by the invalid_character rule? Part of our rationale for making this an error rather than a warning was that we saw less than 1% of feeds impacted by this in our acceptance tests.

@qcdyx qcdyx self-assigned this Nov 7, 2024
@tafflin
Copy link
Author

tafflin commented Nov 8, 2024

If I understand correctly, the issue lies with the replacement character '�' (\uFFFD). I haven’t yet validated all our GTFS feeds, but according to our database, this issue is likely to affect 5 gtfs feeds: four feeds (routes.txt files) and one feed (stops.txt). Here are a few additional examples:

"Facolt�di Ingegneria - Ospedale Maggiore - Stazione Centrale - Fiera - Facolt�di Agraria"
"Piazza dell' Unit� Opedale Maggiore"
"Rotterdam, Selma Lagerl�fweg"
"Rotterdam, Ca�rostraat"
"Vlaardingen, Verploegh Chass�plein"

Problem is that one of such feeds contains 20000 routes and as of now it cannot be correctly validated by v 6.0.0.
What is the expected solution to invalid characters rule? Should they be deleted or replaced with some other character?

@emmambd
Copy link
Contributor

emmambd commented Nov 12, 2024

@tafflin The solution in this case would be to replace the � characters with UTF-8 encoded accents. This issue often happens when the string is not initially UTF-8 encoded correctly, so when it's converted to UTF-8 the accents are replaced with �.

Since this text is rider-facing (stop names, route names), it can impact their experience.

@emmambd
Copy link
Contributor

emmambd commented Dec 2, 2024

Hi @tafflin! Following up on this is to see if you were able to find a solution or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues
Projects
Status: Requires investigation
Development

No branches or pull requests

3 participants