-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RETURNN compute perplexity job #563
base: main
Are you sure you want to change the base?
Conversation
self, | ||
returnn_config: ReturnnConfig, | ||
returnn_model: Union[PtCheckpoint, Checkpoint], | ||
eval_dataset: tk.Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval_dataset: tk.Path, | |
eval_dataset: Union[tk.Path, Dict[str, Any]], |
in principle any dataset is valid? Actually, does eval_datasets = {"eval": "/path/to/file"}
constitute a correct dataset definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested if any dataset is valid. I am assuming that any dataset with text should be valid.. but this should be used with LmDataset
.
No that is not the correct dataset definition. It needs to be something like: {"eval": "class": "LmDatset", ...}
# TODO verify paths | ||
if isinstance(returnn_model, PtCheckpoint): | ||
model_path = returnn_model.path | ||
self.add_input(returnn_model.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should already be input to the Job as we pass the PT/Checkpoint object to the Job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure??
Pt/Checkpoint are normal Python classes which I think are not covered by the extract_paths
from sisyphus https://github.com/rwth-i6/sisyphus/blob/master/sisyphus/tools.py#L74
or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure??
never without actually testing ;)
but very confident because:
- in recognition we also just give the checkpoint object and it works
are not covered by the extract_paths from sisyphus
- extract_paths should arrive in the last else and then call get_object_state which should then via get_members_ descend into the dict of the Checkpoint object and return the underlying tk.Path object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes. thanks for the explanation :)
shutil.move("returnn_log", self.out_returnn_log.get_path()) | ||
|
||
def gather(self): | ||
for data_key in self.out_perplexities.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tk.Variable
has no keys()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I think that is from an earlier version...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I do not like the concept of this job, because it does not really add any value on top of a ReturnnForwardJob
. Also, currently we are not doing anything with the perplexities, so I wonder what the task of the Job is.
I would suggest to just run a Forward Job, and then pass the learning_rate files to an "ExtractPerplexities" Job or so. Also in the end we can anyway not guarantee that people have the right settings in their config to calculate the perplexities correctly, so I feel an ComputePerplexityJob in the way here is a little misleading / faking confidence about doing the right thing.
Another addition that you could to an extract PPL job is to pass the corpus and a BPE vocab, so you could directly compute word-level PPL from the BPE-level PPL.
Maybe inherit from |
Co-authored-by: michelwi <[email protected]>
Thanks for the feedback. I think @JackTemaki makes a good point. I think automatically adding the word-level PPL for subword-level models would be a useful addition. Summation:
What do you think of @michelwi suggestion:
? |
forwarding should not be done more than needed. I don't currently see how we do unnecessary forwardings (i.e. I think each "compute ppl from LR file" would go with exactly one forwarding) but I have no hard feelings against separate jobs if it makes more sense. |
I was thinking about a situation where you compute the PPL of a model on a dataset and additionally do some other step for example some prior computation of sort.. but maybe that is a bit far fetched.. 🤔 |
No description provided.