Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random seed in parameters #948

Merged
merged 14 commits into from
Oct 16, 2024
Merged

random seed in parameters #948

merged 14 commits into from
Oct 16, 2024

Conversation

silil
Copy link
Contributor

@silil silil commented Aug 16, 2024

Unfortunately, the random seed wasn't being set on any algorithm used in Triage. This leads to unreproducible results, mainly in the outreach-generated lists.

This bug was identified while comparing the outputs Triage generated on JoCo and DSaPP sides.

This pull request adds the random seed to the hyper-parameters of each model set up on the YAML file on the dictionary of the parameters and unique_parameters once the random seed has been generated. The changes affect the script src/triage/component/catwalk/model_trainers.py.

The tests associated with this change can be found in this notebook in the DoJo repository, where 3 different models have been trained in and outside of Triage using the same hyperparameters stored in the trained model by Triage and the scores obtained compared to verify that in both cases the scores are the same.

This pull request is associated with issue #951

@nanounanue
Copy link
Contributor

Hi @silil !

I will review this on my flight to Chile this Sunday. It seems correct at first glance, but I want to double-check some related code/design decisions we made at DSaPP a while ago regarding this parameter. If I recall correctly, we had a discussion about this behavior. Let me review it, and I will add comments to this PR accordingly.

@silil
Copy link
Contributor Author

silil commented Aug 28, 2024

@kasunamare do you have any comments?

@silil
Copy link
Contributor Author

silil commented Aug 28, 2024

@nanounanue Did you have a chance to look at the changes?

@nanounanue
Copy link
Contributor

I am on it. But still untangling it.

My concern is that in some point of time random_seed was agregated (#671) (including the instructions in experiment.yaml) , but caused that the experiment_hash changed if it wasn't set (if I remember correctly, issues #726 and #661) and that behaviour was fixed in #799. Sometime later @shaycrk patched a problem with existing models (they didn't use the original random seed, so that was fixed in #848. And there are methods like

def retrieve_existing_model_random_seeds(
(added in #848).

And then, it was an issue related to when no random_seed is specified, the RandomGenerator of python didn't work correctly in multithread (maybe I am misusing that word). But I am not able to find that discussion :(

So, I am still checking all of those behaviours.

@kasunamare
Copy link
Contributor

kasunamare commented Aug 30, 2024

hey @nanounanue, From the code and the chat I had with @silil, I feel like the only change of this PR is making sure that we use the specified random seed when we initialize the model object?
Currently, the random seeds that are generated at experiment level are not being passed on the constructor of the model class; and this PR is forcing it to do so by making it a part of the parameters.
This will mean that if the Random Seed is changed (either intentionally or by not setting one with the experiment config), it will generate a separate model_hash (and id), and will result in us having multiple instances of the same model_group_id for the same train_end_time. By reading your notes on PR #799, I feel like that's desired behavior?

Am I missing something?

@silil
Copy link
Contributor Author

silil commented Aug 30, 2024

Thanks @nanounanue.

I didn't change how Triage generates the random seed. This update ensures that the generated random seed is consistently used as part of the hyperparameters when initializing any predictive model (sklean, xgboost, or lightgbm). Specifically, it now explicitly assigns the random_seed to the random_state parameter. Previously, we didn't specify any value in random_state, which led to non-reproducible results even when using the same matrices and configurations initially used for training a model.

For example, a config file with random_seed: 23895478 will generate a random seed value of 1684690576 for the model. The latter will be saved in the triage_metadata.models table under the random_seed column (and in some other tables), while the former will be saved in the triage_metadata.triage_runs table in the random_seed column. However, the generated model random_seed won't be used when initializing the model, so sklearn, xgboost, or lightgbm would assign a different random seed that is not persisted in our pickle object. As a result, the exact results can't be reproduced.

I hope this clarifies things. Let me know if you'd like to go over this in a meeting.

@nanounanue
Copy link
Contributor

Thank you @silil and @kasunamare.

Just one more question, when you mention unreproducible (is this a word?) results ... do you mean the predicted entities list?

If you prefer we can have a video call to see this behaviour

@silil
Copy link
Contributor Author

silil commented Sep 9, 2024

@nanounanue @kasunamare:

Following our conversation with @rayidghani last Friday, I made changes to the code. The random seed is now set only at the moment of instantiation, right before fitting the model. In this way:

  1. The hyperparameters stored in the DB don't include the random seed.
  2. The model hash is not affected by the random seed.
  3. The random seed is included in the model pickle.

I tested these changes (with @kasunamare) and the results are as follows:

When using a different experiment random seed in the yaml file but keeping the same model and hyperparameters we observe for each time split:

  • Same model group id
  • Same model random seed
  • Same model_id

In DB:
Screenshot from 2024-09-09 10-28-14

In pickle:
Screenshot from 2024-09-09 10-26-00

Is that the expected behavior? I assumed changing the random seed in the experiment would result in different model random seeds. What do you think @nanounanue @kasunamare @rayidghani?

@nanounanue
Copy link
Contributor

Regarding the model ID and model hash, I believe that's the desired behavior (or at least, that was the consensus). The rationale is that the random_seed of the model is not a hyperparameter, so it shouldn't change the identifier (i.e., the model’s behavior should be resilient to the seed). However, this doesn't mean we want several models with the same model ID. Again, the predictions shouldn't be heavily affected by the seed, meaning the model should be relatively "stable."

That said, I’m not sure what @rayidghani thinks about this.

Regarding, the model random_state, that's generated based on the following code

def get_or_generate_random_seed(self, model_group_id, matrix_metadata, train_matrix_uuid):

I will check if that's overwriting your code.

@nanounanue
Copy link
Contributor

If the list is very different between runs, maybe the problem is the seed in postgresql (the rank and all of that)

@silil
Copy link
Contributor Author

silil commented Sep 12, 2024

The lists generated on both sides (DSaPP and JoCo) weren't that different, but we were expecting to have the same list (we know we have the same data in the DB on both sides), we were assuming that because we were using the same random seed, that's how we found out we weren't using our random seed in the model.

@kasunamare
Copy link
Contributor

kasunamare commented Sep 12, 2024

Going by the screenshot @silil has posted, it looks like changing the experiment_random_seed doesn't result in a new set of model_random_seed values? I'm guessing that's not our desired behavior?

My understanding of the desired behavior for changing the random seed at the experiment level:

  1. It generates a series of new model (or model group) level random seeds
  2. These new model/model group level seeds should not create a new model group (model_group_id) as it's not a hyperparameter,
  3. However, it should create new instances of the model group (models) for the same train_end_times (new model_id and model_hash)
  4. This results in multiple instances of the model group for the same train_end_times. If the model is relatively stable, the variance in performance across these models should not be high.

Is this correct @nanounanue? If so, looks like Step 1. is not working?

@nanounanue
Copy link
Contributor

nanounanue commented Sep 13, 2024

Yes, totally agree.

It shouldn't be a new model group, but maybe a new model_id. So if we change the random seed, now we have several point in each as_of_date per model_group_id. So nothing is broken (apparently) in training, but it will be broken in aequitas, postmodeling, (maybe in how the evaluation is calculated? the stochastic one) and plotting, right?

@silil
Copy link
Contributor Author

silil commented Sep 17, 2024

@nanounanue @kasunamare

With the help of @kasunamare we were able to fix the last part of the riddle. We were missing a condition in the join clause of this function.

def retrieve_existing_model_random_seeds(

When a new random seed is used in the experiment config file -and only that- we need to check if the triage_metadata.models table has the record associated with this triage run id, hence we were missing the condition of the triage run id on this join:

on (experiment_models.experiment_hash = triage_runs.run_hash)

Now everything works as expected -> with a new random seed and everything else we have the same experiment hash, same model group id, different model id. The model random seed is only saved in the triage_metadata.models table in the column random_seed but not in the hyperparameters, and also in the trained model pickle.

image

@silil silil marked this pull request as draft September 24, 2024 19:30
@silil silil force-pushed the hotfix_random_seed_in_models branch from c55adb7 to 6d07cc2 Compare September 25, 2024 15:03
@silil silil marked this pull request as ready for review September 26, 2024 18:28
@silil
Copy link
Contributor Author

silil commented Oct 15, 2024

We added logic to deal with the case of a custom random_state defined by the user in the grid_config for any type of model defined in this section. If a custom random_state exists in the grid_config, Triage will use that random seed in all the model ids generated given the hyperparameters defined for that type of model. The custom random_state won't be saved in the DB within the hyperparameters column, it will be saved in the model random_seed column of the triage_metadata.models table.

The changes associated with this behavior have been tested in Donors Choose project in Linux, RedHat, and Microsoft (WSL), as well as by using SingleThread and MultiCore options with Baselines, DecisionTrees, and Thresholder models.

@silil silil merged commit 0abb177 into master Oct 16, 2024
1 check failed
@silil silil deleted the hotfix_random_seed_in_models branch October 22, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants