Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small fix for NMT build job #119

Merged
merged 1 commit into from
Sep 6, 2024
Merged

small fix for NMT build job #119

merged 1 commit into from
Sep 6, 2024

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Sep 3, 2024

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.12%. Comparing base (88f5bad) to head (6df7ece).

Files with missing lines Patch % Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py 25.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   88.12%   88.12%   -0.01%     
==========================================
  Files         273      273              
  Lines       15987    15990       +3     
==========================================
+ Hits        14089    14091       +2     
- Misses       1898     1899       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

            model_trainer.train(progress=phase_progress, check_canceled=check_canceled)
            model_trainer.save()
            train_corpus_size = parallel_corpus.count()

Why was this change necessary?

@johnml1135
Copy link
Collaborator Author

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this change necessary?

When I was refactoring for SMT engines split out, I was mirroring the return type, returning the train corpus size.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

When I was refactoring for SMT engines split out, I was mirroring the return type, returning the train corpus size.

parallel_corpus.count() will result in the full parallel corpus being read, so it would be better to use model_trainer.stats.train_corpus_size. I looked at HuggingFaceNmtModelTrainer and it looks like it might not be properly setting train_corpus_size. We should fix that bug instead.

@johnml1135
Copy link
Collaborator Author

machine/jobs/nmt_engine_build_job.py line 77 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

parallel_corpus.count() will result in the full parallel corpus being read, so it would be better to use model_trainer.stats.train_corpus_size. I looked at HuggingFaceNmtModelTrainer and it looks like it might not be properly setting train_corpus_size. We should fix that bug instead.

I looked at the code and can't tell where it should be fixed (added to). If you already know what fix needs to be made, can you go ahead and make it?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 370 at r2 (raw file):

        self._metrics = train_result.metrics
        self._metrics["train_samples"] = len(train_dataset)
        self._stats.train_corpus_size = len(train_dataset)

You should compute the length once. This might require iterating over the entire dataset.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 382 at r2 (raw file):

        self._trainer.save_state()
        if isinstance(self._model, PreTrainedModel):
            model: PreTrainedModel = self._model

This shouldn't be necessary. The type checker can infer the type from the isinstance call.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 370 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should compute the length once. This might require iterating over the entire dataset.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 382 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This shouldn't be necessary. The type checker can infer the type from the isinstance call.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135 johnml1135 merged commit 7987890 into main Sep 6, 2024
13 checks passed
@johnml1135 johnml1135 deleted the nmt_build_stats_bug branch September 6, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants