small fix for NMT build job #119

johnml1135 · 2024-09-03T16:48:22Z

This change is

codecov-commenter · 2024-09-03T16:50:49Z

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.12%. Comparing base (88f5bad) to head (6df7ece).

Files with missing lines	Patch %	Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py	25.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   88.12%   88.12%   -0.01%     
==========================================
  Files         273      273              
  Lines       15987    15990       +3     
==========================================
+ Hits        14089    14091       +2     
- Misses       1898     1899       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

            model_trainer.train(progress=phase_progress, check_canceled=check_canceled)
            model_trainer.save()
            train_corpus_size = parallel_corpus.count()

Why was this change necessary?

johnml1135 · 2024-09-04T11:32:26Z

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this change necessary?

When I was refactoring for SMT engines split out, I was mirroring the return type, returning the train corpus size.

ddaspit

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

When I was refactoring for SMT engines split out, I was mirroring the return type, returning the train corpus size.

parallel_corpus.count() will result in the full parallel corpus being read, so it would be better to use model_trainer.stats.train_corpus_size. I looked at HuggingFaceNmtModelTrainer and it looks like it might not be properly setting train_corpus_size. We should fix that bug instead.

johnml1135 · 2024-09-05T14:41:25Z

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

parallel_corpus.count() will result in the full parallel corpus being read, so it would be better to use model_trainer.stats.train_corpus_size. I looked at HuggingFaceNmtModelTrainer and it looks like it might not be properly setting train_corpus_size. We should fix that bug instead.

I looked at the code and can't tell where it should be fixed (added to). If you already know what fix needs to be made, can you go ahead and make it?

ddaspit

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 370 at r2 (raw file):

        self._metrics = train_result.metrics
        self._metrics["train_samples"] = len(train_dataset)
        self._stats.train_corpus_size = len(train_dataset)

You should compute the length once. This might require iterating over the entire dataset.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 382 at r2 (raw file):

        self._trainer.save_state()
        if isinstance(self._model, PreTrainedModel):
            model: PreTrainedModel = self._model

This shouldn't be necessary. The type checker can infer the type from the isinstance call.

johnml1135 · 2024-09-06T16:42:35Z

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 370 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should compute the length once. This might require iterating over the entire dataset.

Done.

johnml1135 · 2024-09-06T16:51:02Z

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 382 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This shouldn't be necessary. The type checker can infer the type from the isinstance call.

Done.

ddaspit

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 assigned ddaspit Sep 3, 2024

ddaspit reviewed Sep 3, 2024

View reviewed changes

ddaspit requested changes Sep 4, 2024

View reviewed changes

johnml1135 force-pushed the nmt_build_stats_bug branch from 1b848ec to 6df7ece Compare September 5, 2024 21:00

ddaspit reviewed Sep 5, 2024

View reviewed changes

small fix for NMT build job

433fb93

johnml1135 force-pushed the nmt_build_stats_bug branch from 6df7ece to 433fb93 Compare September 6, 2024 16:51

ddaspit approved these changes Sep 6, 2024

View reviewed changes

johnml1135 merged commit 7987890 into main Sep 6, 2024
13 checks passed

johnml1135 deleted the nmt_build_stats_bug branch September 6, 2024 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small fix for NMT build job #119

small fix for NMT build job #119

johnml1135 commented Sep 3, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Sep 3, 2024 •

edited

Loading

ddaspit left a comment

johnml1135 commented Sep 4, 2024

ddaspit left a comment

johnml1135 commented Sep 5, 2024

ddaspit left a comment

johnml1135 commented Sep 6, 2024

johnml1135 commented Sep 6, 2024

ddaspit left a comment

small fix for NMT build job #119

small fix for NMT build job #119

Conversation

johnml1135 commented Sep 3, 2024 • edited by ddaspit Loading

codecov-commenter commented Sep 3, 2024 • edited Loading

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Sep 4, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Sep 5, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Sep 6, 2024

johnml1135 commented Sep 6, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Sep 3, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Sep 3, 2024 •

edited

Loading