-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port USFM code from Machine up to commit c93f8dc #118
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #118 +/- ##
==========================================
+ Coverage 87.81% 87.99% +0.17%
==========================================
Files 249 260 +11
Lines 15082 15569 +487
==========================================
+ Hits 13245 13700 +455
- Misses 1837 1869 +32 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 49 files reviewed, 3 unresolved discussions (waiting on @ddaspit)
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
from io import BytesIO
In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.
machine/corpora/usfm_parser.py
line 72 at r13 (raw file):
self.state.index += 1 assert self.state.token is not None
I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None
tests/corpora/test_paratext_project_terms_parser.py
line 9 at r13 (raw file):
UsfmStylesheet, ) from machine.corpora.paratext_project_terms_parser_base import _get_glosses, _strip_parens
I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.
See sillsdev/machine#236 - call it "character" |
Does this bring it up to match the tests fully? Weren't there more changes than this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 49 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/corpora/usfm_text_base.py
line 53 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
See sillsdev/machine#236 - call it "character"
Done.
tests/testutils/data/usfm/Tes/41MATTes.SFM
line 41 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Does this bring it up to match the tests fully? Weren't there more changes than this?
There aren't any more changes than this that I'm aware of. Although maybe the changes you're thinking of were in my two previous PRs, since I ended up splitting the porting process into 3 chunks so that it would be easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 11 files at r1, 4 of 5 files at r2, 7 of 7 files at r3, 2 of 5 files at r4, 3 of 7 files at r5, 2 of 2 files at r7, 1 of 3 files at r8, 1 of 3 files at r9, 5 of 9 files at r11, 3 of 3 files at r12, 18 of 18 files at r13, 1 of 1 files at r14, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)
machine/corpora/alignment_corpus.py
line 30 at r13 (raw file):
def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int: with self.get_rows(text_ids) as rows: return len(list(rows)) if include_empty else sum(1 for r in self.get_rows(text_ids) if not r.is_empty)
You accidentally called get_rows
here. Just sum(1 for row in rows if include_empty or not row.is_empty)
should be sufficient. This will avoid unnecessarily creating a list
when include_empty
is set to True
.
machine/corpora/alignment_corpus.py
line 50 at r13 (raw file):
return _FilterAlignmentCorpus(self, predicate) def filter_texts(self, text_ids: Iterable[str]) -> AlignmentCorpus:
You should rename text_ids
to filter
, so that it is consistent with my comment on TextCorpus
. Also, the parameter should be optional.
machine/corpora/dictionary_alignment_corpus.py
line 3 at r13 (raw file):
from typing import Generator, Iterable, Optional, overload from machine.corpora.alignment_row import AlignmentRow
This import should be relative.
machine/corpora/dictionary_text_corpus.py
line 3 at r13 (raw file):
from typing import Generator, Iterable, Optional, overload from machine.corpora.text_row import TextRow
This import should be relative.
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
Previously, mshannon-sil wrote…
In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.
This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.
machine/corpora/parallel_text_corpus.py
line 108 at r13 (raw file):
def _get_rows( self, text_ids: Optional[Iterable[str]] = None ) -> ContextManagedGenerator[ParallelTextRow, None, None]: ...
The return type should be Generator
.
machine/corpora/parallel_text_corpus.py
line 594 at r13 (raw file):
return len(self._df) return len(self._df[(self._df[self._source_column] != "") & (self._df[self._target_column] != "")]) return sum(1 for row in self._get_rows(text_ids) if include_empty or not row.is_empty)
You should be able to filter the dataframe by the list of text ids. Something like this:
len(self._df[self._df[self._text_column].isin(text_ids_set)])
machine/corpora/parallel_text_corpus.py
line 602 at r13 (raw file):
if self._text_id_column is not None and self._text_id_column in self._df: text_id = cast(str, row[self._text_id_column]) if text_ids_set is not None and text_id not in text_ids_set:
I would move this if
up a level.
machine/corpora/parallel_text_corpus.py
line 695 at r13 (raw file):
if self._text_id_column is not None and self._text_id_column in example: text_id = cast(str, example[self._text_id_column]) if text_ids_set is not None and text_id not in text_ids_set:
I would move this if
up a level.
machine/corpora/paratext_project_settings_parser_base.py
line 15 at r13 (raw file):
@abstractmethod def exists(self, file_name: str) -> bool: ...
These methods should be prefixed with an underscore.
machine/corpora/paratext_project_terms_parser_base.py
line 8 at r13 (raw file):
from xml.etree import ElementTree from machine.corpora.paratext_project_settings import ParatextProjectSettings
This should be a relative import.
machine/corpora/paratext_project_terms_parser_base.py
line 92 at r13 (raw file):
@abstractmethod def exists(self, file_name: str) -> bool: ...
These methods should be prefixed with an underscore.
machine/corpora/paratext_project_text_updater_base.py
line 23 at r13 (raw file):
self, book_id: str, rows: Optional[List[Tuple[List[ScriptureRef], str]]] = None,
Instead of List
, you should use Sequence
.
machine/corpora/paratext_project_text_updater_base.py
line 48 at r13 (raw file):
@abstractmethod def exists(self, file_name: StrPath) -> bool: ...
These methods should be prefixed with an underscore.
machine/corpora/text_corpus.py
line 45 at r13 (raw file):
def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int: with self.get_rows(text_ids) as rows: return len(list(rows)) if include_empty else sum(1 for row in rows if include_empty or not row.is_empty)
Just sum(1 for row in rows if include_empty or not row.is_empty)
should be sufficient. This will avoid unnecessarily creating a list
when include_empty
is set to True
.
machine/corpora/text_corpus.py
line 110 at r13 (raw file):
def filter_texts( self, predicate: Optional[Callable[[Text], bool]] = None, text_ids: Optional[Iterable[str]] = None
I would merge predicate
and text_ids
into a single parameter called filter
that is a Union
of the two types. You will need to check the type of filter
using something like isinstance
or callable
.
machine/corpora/usfm_memory_texts.py
line 0 at r13 (raw file):
This file name should be called usfm_memory_text.py
.
machine/corpora/usfm_parser.py
line 72 at r13 (raw file):
Previously, mshannon-sil wrote…
I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None
That should be fine.
machine/corpora/zip_paratext_project_terms_parser.py
line 5 at r13 (raw file):
from zipfile import ZipFile from machine.corpora.paratext_project_settings import ParatextProjectSettings
These imports should be relative.
machine/corpora/zip_paratext_project_text_updater.py
line 5 at r13 (raw file):
from zipfile import ZipFile from machine.corpora.paratext_project_text_updater_base import ParatextProjectTextUpdaterBase
These imports should be relative.
tests/corpora/test_paratext_project_terms_parser.py
line 9 at r13 (raw file):
Previously, mshannon-sil wrote…
I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.
This is fine when testing private functions.
tests/corpora/test_update_usfm_parser_handler.py
line 6 at r14 (raw file):
from machine.corpora import ScriptureRef, parse_usfm from machine.corpora.file_paratext_project_text_updater import FileParatextProjectTextUpdater
These classes should be imported from machine. Corpora
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/corpora/alignment_corpus.py
line 30 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You accidentally called
get_rows
here. Justsum(1 for row in rows if include_empty or not row.is_empty)
should be sufficient. This will avoid unnecessarily creating alist
wheninclude_empty
is set toTrue
.
Done.
machine/corpora/alignment_corpus.py
line 50 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should rename
text_ids
tofilter
, so that it is consistent with my comment onTextCorpus
. Also, the parameter should be optional.
Done.
machine/corpora/dictionary_alignment_corpus.py
line 3 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This import should be relative.
Done.
machine/corpora/dictionary_text_corpus.py
line 3 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This import should be relative.
Done.
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.
Should we move all the memory files to the test folder then?
machine/corpora/parallel_text_corpus.py
line 108 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
The return type should be
Generator
.
Done.
machine/corpora/parallel_text_corpus.py
line 594 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should be able to filter the dataframe by the list of text ids. Something like this:
len(self._df[self._df[self._text_column].isin(text_ids_set)])
Done.
machine/corpora/parallel_text_corpus.py
line 602 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would move this
if
up a level.
Done.
machine/corpora/parallel_text_corpus.py
line 695 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would move this
if
up a level.
Done.
machine/corpora/paratext_project_settings_parser_base.py
line 15 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These methods should be prefixed with an underscore.
Done.
machine/corpora/paratext_project_terms_parser_base.py
line 8 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be a relative import.
Done.
machine/corpora/paratext_project_terms_parser_base.py
line 92 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These methods should be prefixed with an underscore.
Done.
machine/corpora/paratext_project_text_updater_base.py
line 23 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Instead of
List
, you should useSequence
.
Done. I also changed it for some other function definitions where Machine specified an IReadOnlyList
.
machine/corpora/paratext_project_text_updater_base.py
line 48 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These methods should be prefixed with an underscore.
Done.
machine/corpora/text_corpus.py
line 45 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Just
sum(1 for row in rows if include_empty or not row.is_empty)
should be sufficient. This will avoid unnecessarily creating alist
wheninclude_empty
is set toTrue
.
Done.
machine/corpora/text_corpus.py
line 110 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would merge
predicate
andtext_ids
into a single parameter calledfilter
that is aUnion
of the two types. You will need to check the type offilter
using something likeisinstance
orcallable
.
Done.
machine/corpora/usfm_memory_texts.py
line at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This file name should be called
usfm_memory_text.py
.
Done.
machine/corpora/zip_paratext_project_terms_parser.py
line 5 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These imports should be relative.
Done.
machine/corpora/zip_paratext_project_text_updater.py
line 5 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These imports should be relative.
Done.
tests/corpora/test_update_usfm_parser_handler.py
line 6 at r14 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These classes should be imported from
machine. Corpora
.
Done. I also checked the rest of my files for this and fixed every other instance I found this occurring.
Previously, mshannon-sil wrote…
Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 24 of 24 files at r15, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
Previously, mshannon-sil wrote…
Should we move all the memory files to the test folder then?
As we discussed in the standup, you just need to move this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit)
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
As we discussed in the standup, you just need to move this class.
Done. And just to confirm, it's okay that I'm importing it as from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser
right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
tests/testutils/data/usfm/Tes/41MATTes.SFM
line 41 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two.
Yes, the test cases look to match up between both repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r16, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
machine/corpora/memory_paratext_project_terms_parser.py
line 1 at r13 (raw file):
Previously, mshannon-sil wrote…
Done. And just to confirm, it's okay that I'm importing it as
from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser
right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.
Yes, that is perfectly fine.
…including zipped projects
0778bb7
to
56ba148
Compare
This is the final PR for catching machine.py up to the USFM changes in Machine (until more USFM changes are added to Machine).
This change is