Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error when id tag doesn't match filename book id #141

Merged
merged 3 commits into from
Nov 15, 2024

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Nov 11, 2024

This PR addresses sillsdev/silnlp#574. ParatextTextCorpus and ParatextBackupTextCorpus now raise a ValueError if the book id in the filename and the \id tag inside the file don't match for a given SFM file in the Paratext project. For example, if the filename is 07JDG.SFM but the \id tag is JUD, this will now raise an error during initialization, whereas before initialization would succeed without any message to the user and would use the incorrect \id tag for that book's verse refs. I added two error messages, one for if the \id tag itself is invalid, and another for if the \id tag is valid but does not match the book id in the filename.


This change is Reviewable

@mshannon-sil mshannon-sil added the bug Something isn't working label Nov 11, 2024
@mshannon-sil mshannon-sil requested a review from ddaspit November 11, 2024 23:43
@mshannon-sil mshannon-sil self-assigned this Nov 11, 2024
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

                        settings.name,
                    )
                    with text.get_rows() as rows:

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

It would be ideal if we could just perform the check when we hit the \id marker.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would be ideal if we could just perform the check when we hit the \id marker.

Done.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.33%. Comparing base (183fdfb) to head (894e2ba).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #141      +/-   ##
==========================================
+ Coverage   88.30%   88.33%   +0.03%     
==========================================
  Files         275      275              
  Lines       16171    16192      +21     
==========================================
+ Hits        14280    14304      +24     
+ Misses       1891     1888       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 14 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@mshannon-sil mshannon-sil merged commit bd8707f into main Nov 15, 2024
13 of 14 checks passed
@mshannon-sil mshannon-sil deleted the id_mismatch branch November 15, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 👀 In review
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants