random seed in parameters #948

silil · 2024-08-16T23:41:09Z

Unfortunately, the random seed wasn't being set on any algorithm used in Triage. This leads to unreproducible results, mainly in the outreach-generated lists.

This bug was identified while comparing the outputs Triage generated on JoCo and DSaPP sides.

This pull request adds the random seed to the hyper-parameters of each model set up on the YAML file on the dictionary of the parameters and unique_parameters once the random seed has been generated. The changes affect the script src/triage/component/catwalk/model_trainers.py.

The tests associated with this change can be found in this notebook in the DoJo repository, where 3 different models have been trained in and outside of Triage using the same hyperparameters stored in the trained model by Triage and the scores obtained compared to verify that in both cases the scores are the same.

This pull request is associated with issue #951

nanounanue · 2024-08-17T12:22:52Z

Hi @silil !

I will review this on my flight to Chile this Sunday. It seems correct at first glance, but I want to double-check some related code/design decisions we made at DSaPP a while ago regarding this parameter. If I recall correctly, we had a discussion about this behavior. Let me review it, and I will add comments to this PR accordingly.

silil · 2024-08-28T14:41:26Z

@kasunamare do you have any comments?

silil · 2024-08-28T14:42:07Z

@nanounanue Did you have a chance to look at the changes?

nanounanue · 2024-08-28T22:56:53Z

I am on it. But still untangling it.

My concern is that in some point of time random_seed was agregated (#671) (including the instructions in experiment.yaml) , but caused that the experiment_hash changed if it wasn't set (if I remember correctly, issues #726 and #661) and that behaviour was fixed in #799. Sometime later @shaycrk patched a problem with existing models (they didn't use the original random seed, so that was fixed in #848. And there are methods like

triage/src/triage/component/catwalk/utils.py

Line 253 in 18bbbe4

def retrieve_existing_model_random_seeds(

(added in #848).

And then, it was an issue related to when no random_seed is specified, the RandomGenerator of python didn't work correctly in multithread (maybe I am misusing that word). But I am not able to find that discussion :(

So, I am still checking all of those behaviours.

kasunamare · 2024-08-30T14:26:01Z

hey @nanounanue, From the code and the chat I had with @silil, I feel like the only change of this PR is making sure that we use the specified random seed when we initialize the model object?
Currently, the random seeds that are generated at experiment level are not being passed on the constructor of the model class; and this PR is forcing it to do so by making it a part of the parameters.
This will mean that if the Random Seed is changed (either intentionally or by not setting one with the experiment config), it will generate a separate model_hash (and id), and will result in us having multiple instances of the same model_group_id for the same train_end_time. By reading your notes on PR #799, I feel like that's desired behavior?

Am I missing something?

silil · 2024-08-30T15:04:10Z

Thanks @nanounanue.

I didn't change how Triage generates the random seed. This update ensures that the generated random seed is consistently used as part of the hyperparameters when initializing any predictive model (sklean, xgboost, or lightgbm). Specifically, it now explicitly assigns the random_seed to the random_state parameter. Previously, we didn't specify any value in random_state, which led to non-reproducible results even when using the same matrices and configurations initially used for training a model.

For example, a config file with random_seed: 23895478 will generate a random seed value of 1684690576 for the model. The latter will be saved in the triage_metadata.models table under the random_seed column (and in some other tables), while the former will be saved in the triage_metadata.triage_runs table in the random_seed column. However, the generated model random_seed won't be used when initializing the model, so sklearn, xgboost, or lightgbm would assign a different random seed that is not persisted in our pickle object. As a result, the exact results can't be reproduced.

I hope this clarifies things. Let me know if you'd like to go over this in a meeting.

nanounanue · 2024-09-08T00:36:57Z

Thank you @silil and @kasunamare.

Just one more question, when you mention unreproducible (is this a word?) results ... do you mean the predicted entities list?

If you prefer we can have a video call to see this behaviour

silil · 2024-09-09T14:35:32Z

@nanounanue @kasunamare:

Following our conversation with @rayidghani last Friday, I made changes to the code. The random seed is now set only at the moment of instantiation, right before fitting the model. In this way:

The hyperparameters stored in the DB don't include the random seed.
The model hash is not affected by the random seed.
The random seed is included in the model pickle.

I tested these changes (with @kasunamare) and the results are as follows:

When using a different experiment random seed in the yaml file but keeping the same model and hyperparameters we observe for each time split:

Same model group id
Same model random seed
Same model_id

In DB:

In pickle:

Is that the expected behavior? I assumed changing the random seed in the experiment would result in different model random seeds. What do you think @nanounanue @kasunamare @rayidghani?

nanounanue · 2024-09-10T23:18:47Z

Regarding the model ID and model hash, I believe that's the desired behavior (or at least, that was the consensus). The rationale is that the random_seed of the model is not a hyperparameter, so it shouldn't change the identifier (i.e., the model’s behavior should be resilient to the seed). However, this doesn't mean we want several models with the same model ID. Again, the predictions shouldn't be heavily affected by the seed, meaning the model should be relatively "stable."

That said, I’m not sure what @rayidghani thinks about this.

Regarding, the model random_state, that's generated based on the following code

triage/src/triage/component/catwalk/model_trainers.py

Line 444 in 18bbbe4

    
           def get_or_generate_random_seed(self, model_group_id, matrix_metadata, train_matrix_uuid):

I will check if that's overwriting your code.

nanounanue · 2024-09-10T23:20:03Z

If the list is very different between runs, maybe the problem is the seed in postgresql (the rank and all of that)

silil · 2024-09-12T19:24:02Z

The lists generated on both sides (DSaPP and JoCo) weren't that different, but we were expecting to have the same list (we know we have the same data in the DB on both sides), we were assuming that because we were using the same random seed, that's how we found out we weren't using our random seed in the model.

kasunamare · 2024-09-12T20:00:53Z

Going by the screenshot @silil has posted, it looks like changing the experiment_random_seed doesn't result in a new set of model_random_seed values? I'm guessing that's not our desired behavior?

My understanding of the desired behavior for changing the random seed at the experiment level:

It generates a series of new model (or model group) level random seeds
These new model/model group level seeds should not create a new model group (model_group_id) as it's not a hyperparameter,
However, it should create new instances of the model group (models) for the same train_end_times (new model_id and model_hash)
This results in multiple instances of the model group for the same train_end_times. If the model is relatively stable, the variance in performance across these models should not be high.

Is this correct @nanounanue? If so, looks like Step 1. is not working?

nanounanue · 2024-09-13T16:15:48Z

Yes, totally agree.

It shouldn't be a new model group, but maybe a new model_id. So if we change the random seed, now we have several point in each as_of_date per model_group_id. So nothing is broken (apparently) in training, but it will be broken in aequitas, postmodeling, (maybe in how the evaluation is calculated? the stochastic one) and plotting, right?

silil · 2024-09-17T03:56:37Z

@nanounanue @kasunamare

With the help of @kasunamare we were able to fix the last part of the riddle. We were missing a condition in the join clause of this function.

triage/src/triage/component/catwalk/utils.py

Line 253 in 18bbbe4

def retrieve_existing_model_random_seeds(

When a new random seed is used in the experiment config file -and only that- we need to check if the triage_metadata.models table has the record associated with this triage run id, hence we were missing the condition of the triage run id on this join:

triage/src/triage/component/catwalk/utils.py

Line 271 in 18bbbe4

on (experiment_models.experiment_hash = triage_runs.run_hash)

Now everything works as expected -> with a new random seed and everything else we have the same experiment hash, same model group id, different model id. The model random seed is only saved in the triage_metadata.models table in the column random_seed but not in the hyperparameters, and also in the trained model pickle.

silil · 2024-10-15T18:51:43Z

We added logic to deal with the case of a custom random_state defined by the user in the grid_config for any type of model defined in this section. If a custom random_state exists in the grid_config, Triage will use that random seed in all the model ids generated given the hyperparameters defined for that type of model. The custom random_state won't be saved in the DB within the hyperparameters column, it will be saved in the model random_seed column of the triage_metadata.models table.

The changes associated with this behavior have been tested in Donors Choose project in Linux, RedHat, and Microsoft (WSL), as well as by using SingleThread and MultiCore options with Baselines, DecisionTrees, and Thresholder models.

random seed in parameters

a7908e9

silil requested review from nanounanue, rayidghani, shaycrk, kasunamare and aylai August 16, 2024 23:41

random seed in model fit

829de81

silil added 2 commits September 17, 2024 01:53

adding condition to join

7eb80bf

logging messages

fff2fe1

silil added 2 commits September 20, 2024 03:35

parameters to query

7193130

adding random state to baseliners

6d07cc2

silil marked this pull request as draft September 24, 2024 19:30

silil force-pushed the hotfix_random_seed_in_models branch from c55adb7 to 6d07cc2 Compare September 25, 2024 15:03

silil added 3 commits September 25, 2024 17:18

add random state to BaseLine ranker

1cd00da

pyarrow version

c4bef83

order and versions

f49af59

silil added 2 commits September 26, 2024 15:09

important comments

986bed3

change from pandas 1.3 to 1.5

f40513b

silil marked this pull request as ready for review September 26, 2024 18:28

silil added 2 commits October 14, 2024 19:06

dealing with custom random state

6c29a54

level of logging while using custom random seed

71b6e25

warning messages in colab example

da7a003

silil merged commit 0abb177 into master Oct 16, 2024
1 check failed

silil deleted the hotfix_random_seed_in_models branch October 22, 2024 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random seed in parameters #948

random seed in parameters #948

silil commented Aug 16, 2024 •

edited

Loading

nanounanue commented Aug 17, 2024

silil commented Aug 28, 2024

silil commented Aug 28, 2024

nanounanue commented Aug 28, 2024

kasunamare commented Aug 30, 2024 •

edited

Loading

silil commented Aug 30, 2024

nanounanue commented Sep 8, 2024

silil commented Sep 9, 2024

nanounanue commented Sep 10, 2024

nanounanue commented Sep 10, 2024

silil commented Sep 12, 2024 •

edited

Loading

kasunamare commented Sep 12, 2024 •

edited

Loading

nanounanue commented Sep 13, 2024 •

edited

Loading

silil commented Sep 17, 2024

silil commented Oct 15, 2024

random seed in parameters #948

random seed in parameters #948

Conversation

silil commented Aug 16, 2024 • edited Loading

nanounanue commented Aug 17, 2024

silil commented Aug 28, 2024

silil commented Aug 28, 2024

nanounanue commented Aug 28, 2024

kasunamare commented Aug 30, 2024 • edited Loading

silil commented Aug 30, 2024

nanounanue commented Sep 8, 2024

silil commented Sep 9, 2024

nanounanue commented Sep 10, 2024

nanounanue commented Sep 10, 2024

silil commented Sep 12, 2024 • edited Loading

kasunamare commented Sep 12, 2024 • edited Loading

nanounanue commented Sep 13, 2024 • edited Loading

silil commented Sep 17, 2024

silil commented Oct 15, 2024

silil commented Aug 16, 2024 •

edited

Loading

kasunamare commented Aug 30, 2024 •

edited

Loading

silil commented Sep 12, 2024 •

edited

Loading

kasunamare commented Sep 12, 2024 •

edited

Loading

nanounanue commented Sep 13, 2024 •

edited

Loading