-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise error when id tag doesn't match filename book id #141
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py
line 33 at r1 (raw file):
settings.name, ) with text.get_rows() as rows:
We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py
line 33 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when
get_rows
is called?
That's fair, and yes I took a look and we should be able to do a similar check in get_rows
, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py
line 33 at r1 (raw file):
Previously, mshannon-sil wrote…
That's fair, and yes I took a look and we should be able to do a similar check in
get_rows
, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.
It would be ideal if we could just perform the check when we hit the \id
marker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py
line 33 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It would be ideal if we could just perform the check when we hit the
\id
marker.
Done.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #141 +/- ##
==========================================
+ Coverage 88.30% 88.33% +0.03%
==========================================
Files 275 275
Lines 16171 16192 +21
==========================================
+ Hits 14280 14304 +24
+ Misses 1891 1888 -3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
60c78d6
to
894e2ba
Compare
This PR addresses sillsdev/silnlp#574. ParatextTextCorpus and ParatextBackupTextCorpus now raise a ValueError if the book id in the filename and the \id tag inside the file don't match for a given SFM file in the Paratext project. For example, if the filename is 07JDG.SFM but the \id tag is JUD, this will now raise an error during initialization, whereas before initialization would succeed without any message to the user and would use the incorrect \id tag for that book's verse refs. I added two error messages, one for if the \id tag itself is invalid, and another for if the \id tag is valid but does not match the book id in the filename.
This change is