-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
random seed in parameters #948
Conversation
Hi @silil ! I will review this on my flight to Chile this Sunday. It seems correct at first glance, but I want to double-check some related code/design decisions we made at DSaPP a while ago regarding this parameter. If I recall correctly, we had a discussion about this behavior. Let me review it, and I will add comments to this PR accordingly. |
@kasunamare do you have any comments? |
@nanounanue Did you have a chance to look at the changes? |
I am on it. But still untangling it. My concern is that in some point of time triage/src/triage/component/catwalk/utils.py Line 253 in 18bbbe4
And then, it was an issue related to when no random_seed is specified, the RandomGenerator of python didn't work correctly in multithread (maybe I am misusing that word). But I am not able to find that discussion :( So, I am still checking all of those behaviours. |
hey @nanounanue, From the code and the chat I had with @silil, I feel like the only change of this PR is making sure that we use the specified random seed when we initialize the model object? Am I missing something? |
Thanks @nanounanue. I didn't change how Triage generates the random seed. This update ensures that the generated random seed is consistently used as part of the hyperparameters when initializing any predictive model ( For example, a config file with I hope this clarifies things. Let me know if you'd like to go over this in a meeting. |
Thank you @silil and @kasunamare. Just one more question, when you mention unreproducible (is this a word?) results ... do you mean the predicted entities list? If you prefer we can have a video call to see this behaviour |
Following our conversation with @rayidghani last Friday, I made changes to the code. The random seed is now set only at the moment of instantiation, right before fitting the model. In this way:
I tested these changes (with @kasunamare) and the results are as follows: When using a different experiment random seed in the
Is that the expected behavior? I assumed changing the random seed in the experiment would result in different model random seeds. What do you think @nanounanue @kasunamare @rayidghani? |
Regarding the model ID and model hash, I believe that's the desired behavior (or at least, that was the consensus). The rationale is that the random_seed of the model is not a hyperparameter, so it shouldn't change the identifier (i.e., the model’s behavior should be resilient to the seed). However, this doesn't mean we want several models with the same model ID. Again, the predictions shouldn't be heavily affected by the seed, meaning the model should be relatively "stable." That said, I’m not sure what @rayidghani thinks about this. Regarding, the model random_state, that's generated based on the following code
I will check if that's overwriting your code. |
If the list is very different between runs, maybe the problem is the seed in postgresql (the rank and all of that) |
The lists generated on both sides (DSaPP and JoCo) weren't that different, but we were expecting to have the same list (we know we have the same data in the DB on both sides), we were assuming that because we were using the same random seed, that's how we found out we weren't using our random seed in the model. |
Going by the screenshot @silil has posted, it looks like changing the My understanding of the desired behavior for changing the random seed at the experiment level:
Is this correct @nanounanue? If so, looks like Step 1. is not working? |
Yes, totally agree. It shouldn't be a new |
With the help of @kasunamare we were able to fix the last part of the riddle. We were missing a condition in the join clause of this function. triage/src/triage/component/catwalk/utils.py Line 253 in 18bbbe4
When a new random seed is used in the experiment config file -and only that- we need to check if the triage/src/triage/component/catwalk/utils.py Line 271 in 18bbbe4
Now everything works as expected -> with a new random seed and everything else we have the same experiment hash, same model group id, different model id. The model random seed is only saved in the |
c55adb7
to
6d07cc2
Compare
We added logic to deal with the case of a custom The changes associated with this behavior have been tested in Donors Choose project in Linux, RedHat, and Microsoft (WSL), as well as by using SingleThread and MultiCore options with Baselines, DecisionTrees, and Thresholder models. |
Unfortunately, the random seed wasn't being set on any algorithm used in Triage. This leads to unreproducible results, mainly in the outreach-generated lists.
This bug was identified while comparing the outputs Triage generated on JoCo and DSaPP sides.
This pull request adds the random seed to the hyper-parameters of each model set up on the YAML file on the dictionary of the
parameters
andunique_parameters
once the random seed has been generated. The changes affect the scriptsrc/triage/component/catwalk/model_trainers.py
.The tests associated with this change can be found in this notebook in the DoJo repository, where 3 different models have been trained in and outside of Triage using the same hyperparameters stored in the trained model by Triage and the scores obtained compared to verify that in both cases the scores are the same.
This pull request is associated with issue #951