Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RETURNN compute perplexity job #563

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

christophmluscher
Copy link
Contributor

No description provided.

self,
returnn_config: ReturnnConfig,
returnn_model: Union[PtCheckpoint, Checkpoint],
eval_dataset: tk.Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval_dataset: tk.Path,
eval_dataset: Union[tk.Path, Dict[str, Any]],

in principle any dataset is valid? Actually, does eval_datasets = {"eval": "/path/to/file"} constitute a correct dataset definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested if any dataset is valid. I am assuming that any dataset with text should be valid.. but this should be used with LmDataset.

No that is not the correct dataset definition. It needs to be something like: {"eval": "class": "LmDatset", ...}

# TODO verify paths
if isinstance(returnn_model, PtCheckpoint):
model_path = returnn_model.path
self.add_input(returnn_model.path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should already be input to the Job as we pass the PT/Checkpoint object to the Job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure??
Pt/Checkpoint are normal Python classes which I think are not covered by the extract_paths from sisyphus https://github.com/rwth-i6/sisyphus/blob/master/sisyphus/tools.py#L74
or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure??

never without actually testing ;)

but very confident because:

  • in recognition we also just give the checkpoint object and it works

are not covered by the extract_paths from sisyphus

  • extract_paths should arrive in the last else and then call get_object_state which should then via get_members_ descend into the dict of the Checkpoint object and return the underlying tk.Path object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes. thanks for the explanation :)

returnn/training.py Outdated Show resolved Hide resolved
shutil.move("returnn_log", self.out_returnn_log.get_path())

def gather(self):
for data_key in self.out_perplexities.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tk.Variable has no keys()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think that is from an earlier version...

Copy link
Contributor

@JackTemaki JackTemaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I do not like the concept of this job, because it does not really add any value on top of a ReturnnForwardJob. Also, currently we are not doing anything with the perplexities, so I wonder what the task of the Job is.

I would suggest to just run a Forward Job, and then pass the learning_rate files to an "ExtractPerplexities" Job or so. Also in the end we can anyway not guarantee that people have the right settings in their config to calculate the perplexities correctly, so I feel an ComputePerplexityJob in the way here is a little misleading / faking confidence about doing the right thing.

Another addition that you could to an extract PPL job is to pass the corpus and a BPE vocab, so you could directly compute word-level PPL from the BPE-level PPL.

@michelwi
Copy link
Contributor

michelwi commented Dec 9, 2024

it does not really add any value on top of a ReturnnForwardJob.

Maybe inherit from ReturnnForwardJob and add some checks of the returnn-config or learning-rate file parsing as a value add?

Co-authored-by: michelwi <[email protected]>
@christophmluscher
Copy link
Contributor Author

Thanks for the feedback. I think @JackTemaki makes a good point. I think automatically adding the word-level PPL for subword-level models would be a useful addition.

Summation:

  • Take output from ReturnnForwardJob
  • Word-level PPL for subword-level models

What do you think of @michelwi suggestion:

Maybe inherit from ReturnnForwardJob and add some checks of the returnn-config or learning-rate file parsing as a value add?

?
Inheriting would mean one less job but potentially you would rerun the forward part several times depending on the pipeline. So I would go with a separate job instead. WHat are your thoughts on this?

@michelwi
Copy link
Contributor

Inheriting would mean one less job but potentially you would rerun the forward part several times depending on the pipeline. So I would go with a separate job instead. WHat are your thoughts on this?

forwarding should not be done more than needed. I don't currently see how we do unnecessary forwardings (i.e. I think each "compute ppl from LR file" would go with exactly one forwarding) but I have no hard feelings against separate jobs if it makes more sense.

@christophmluscher
Copy link
Contributor Author

I was thinking about a situation where you compute the PPL of a model on a dataset and additionally do some other step for example some prior computation of sort.. but maybe that is a bit far fetched.. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants