You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A feature Alex has asked for is a way for OneModel to automatically search for the best hyperparameters for
regmod smooth. This is already implemented for the weave and swimr stages, we can probably extend this framework for regmod smooth.
Additionally, we might also want to do cross validation. We currently only use the indicated test_col
from config to create a training and testing set, but we could do k-fold CV using the specified holdout sets.
This is a proposed design for efficiently selecting an optimal hyperparameter set for regmod smooth.
Config
Borrowing from weave, we'll make the config parameters list-like. We can then construct Subsets from the cross
product of the parameters. We'll define parameters that can possibly be included in this cross product:
# TODO: What parameters would've been useful to mess around with? Maybe allow for different var groupings?# Different dimensions? regmod_smooth:
model_type: binomialobs: obs_ratelam: [0.1, 0.4, 1.0, 1.5]var_groups:
group1:
- col: "intercept"
- col: "intercept"dim: "super_region_id"gprior: [0, 0.35]group2:
- col: "intercept"
- col: "intercept"dim: "age_group_id"dims:
dim1:
- name: "age_group_id"type: "categorical"
- name: "age_mid"type: "numerical"dim2:
- name: "age_group_id"type: "categorical"
- name: "super_region_id"type: "categorical"fit_args:
options:
verbose: falsem_scale: 0.1
Idea is we'll now have 16 component submodels - 2 for each var_group, 2 for each dim, and 4 values of the smoothing
parameter lam.
We'll have to do something like this in the stage definitions:
# PseudocodeclassStage:
def__init__(self, hyperparams):
self.hyperparams=hyperparamsself.subsets=Subsets(hyperparams)
self.create_subsets()
defcreate_subsets(self):
self.submodel_ids=self.subsets._create_subsets()
self.dataif.dump_submodels(self.submodel_ids, "submodels.csv")
defcreate_tasks(self):
forsubmodel_idinself.submodel_ids:
create_task(submodel_id=submodel_id)
classSubsets:
hyperparams: list[str]
defself._create_subsets():
submodel_ids=product([config[param] forparaminself.hyperparams])
returnsubmodel_ids# Decide whether regmod stage is an instance or a subclass of Stage# This snippet treats regmod stage as an instance of Stageregmod_stage=Stage(hyperparams=["dims", "lam", "var_groups"])
regmod_stage.create_tasks()
Question: Any other hyperparameters to be expanded? What was included in the grid search?
Parallelization
We have two additional axes of parallelization: the subsets and the folds. We can just make each subset + holdout combination
a separate task in onemod, and run them concurrently with Jobmon. This might make debugging a little harder, but probably a
useful tradeoff - instead of 1 smoother job, if we have the above 16 submodels and 5 holdout columns we'll end up with 80 jobs
where there was 1 previously.
Another idea is just to perform subset-specific cross validation in memory, this approach might take longer but would
reduce the amount of necessary IO and simplify ensembling later on. Benefit to this is less need for collections modules and the like,
but fitting models in sequence instead of in separate cluster jobs could take a much longer time when there are a lot
of holdout folds.
Ensembling
The first step in ensembling is probably coming up with an ensemble of the fold-specific runs. We can maybe average together
the fold-specific coefficient values to some ensemble model, by subset.
To average out the subsets and select a "best" model we might want to consider a similar approach as weave.
The component weave models are ensembled by selecting the n best submodel scores (presumably out of sample RMSE) and
calculating a weighted average of those n best scores.
We could do the same thing in regmod_smooth for a single ensemble estimate, or we could simply report the best hyperparameter
combination and use that as a single model going forwards.
Decisions
Should we think about model-specific stages as subclasses of a general template, or as instances?
Should we do cross validation in memory or in separate jobs?
How should we ensemble the cross validation folds?
Should we report a single model or an ensemble of models?
What hyperparameters should we include in the grid search?
Other proposals for hyperparameter listings in the config?
Wonder if the config doesn't distinguish well enough between what might be included in grid search and what won't be
The text was updated successfully, but these errors were encountered:
Gridsearch Design
A feature Alex has asked for is a way for OneModel to automatically search for the best hyperparameters for
regmod smooth. This is already implemented for the weave and swimr stages, we can probably extend this framework for regmod smooth.
Additionally, we might also want to do cross validation. We currently only use the indicated test_col
from config to create a training and testing set, but we could do k-fold CV using the specified holdout sets.
This is a proposed design for efficiently selecting an optimal hyperparameter set for regmod smooth.
Config
Borrowing from weave, we'll make the config parameters list-like. We can then construct Subsets from the cross
product of the parameters. We'll define parameters that can possibly be included in this cross product:
Idea is we'll now have 16 component submodels - 2 for each var_group, 2 for each dim, and 4 values of the smoothing
parameter lam.
We'll have to do something like this in the stage definitions:
Question: Any other hyperparameters to be expanded? What was included in the grid search?
Parallelization
We have two additional axes of parallelization: the subsets and the folds. We can just make each subset + holdout combination
a separate task in onemod, and run them concurrently with Jobmon. This might make debugging a little harder, but probably a
useful tradeoff - instead of 1 smoother job, if we have the above 16 submodels and 5 holdout columns we'll end up with 80 jobs
where there was 1 previously.
Another idea is just to perform subset-specific cross validation in memory, this approach might take longer but would
reduce the amount of necessary IO and simplify ensembling later on. Benefit to this is less need for collections modules and the like,
but fitting models in sequence instead of in separate cluster jobs could take a much longer time when there are a lot
of holdout folds.
Ensembling
The first step in ensembling is probably coming up with an ensemble of the fold-specific runs. We can maybe average together
the fold-specific coefficient values to some ensemble model, by subset.
To average out the subsets and select a "best" model we might want to consider a similar approach as weave.
The component weave models are ensembled by selecting the n best submodel scores (presumably out of sample RMSE) and
calculating a weighted average of those n best scores.
We could do the same thing in regmod_smooth for a single ensemble estimate, or we could simply report the best hyperparameter
combination and use that as a single model going forwards.
Decisions
The text was updated successfully, but these errors were encountered: