-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support submitit
Hydra launcher, to launch runs on SLURM clusters
#69
base: dev
Are you sure you want to change the base?
Conversation
bcf2090
to
baae5a9
Compare
@lemairecarl @ThierryJudge I'm trying to test this new launcher to launch a few jobs on Beluga, but I'm having trouble creating an environment with up-to-date dependencies on the cluster. That's why I decide not to delay this PR any longer and to gather your feedback ASAP. I looked at the output config generated by Hydra, and as far as that goes everything should be okay 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might require some extra documentation (maybe another readme) but looks good. I added a few questions and might add some more when I test it out.
timeout_min: ${oc.select:run_time_min,60} | ||
setup: | ||
- "module load httpproxy" # load module allowing to connect to whitelisted domains | ||
- "source $ALLIANCE_VENV_PATH/bin/activate" # activate the pre-installed virtual environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with venv but does this activate the venv from a path that is not on the compute node ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the assignation in .env.example
, I suppose that the venv will typically be in project storage. That is what is meant by "pre-installed". Doing it this way works pretty well and is the most flexible. However, there is debate about what is the best practice. I pretty much always use venvs from shared storage instead of installing them every time on the compute node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lemairecarl explained pretty much the whole thing. On the compute node, you still have access to everything all the cluster's distributed filesystem, e.g. your home folder, etc. It's just that some folders of the filesystem are on the compute node itself, and some are on other machines altogether. So, depending on how much you expect to read/write to some files, it can be a good idea to move them to the folders physically on the compute node.
Regarding the python environment, unless I'm very much mistaken, the python code is only read from disk once when you launch the interpreter (at which point all the imported code is loaded into RAM), and afterwards everything is read from the RAM. So I don't think there are major performance drawbacks to keeping the python environment on a distributed filesystem, since it will only be accessed once per job, and this way you avoid potential issues with reproducing exactly the same environment each time you run a job.
The dataset is another matter entirely, since it's big, probably won't fit on the RAM, and you'll be reading from it constantly throughout the job's runtime. That's why in that case, it's worth to go through the hassle of copying it on the compute node, to avoid having to access it remotely.
|
||
hydra: | ||
launcher: | ||
timeout_min: ${oc.select:run_time_min,60} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pourrais-tu m'expliquer cette ligne stp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolument! Le champ hydra.launcher.timeout_min
est à la base fourni par submitit pour spécifier la durée (en minutes) à demander pour la job. Quand à l'interprétation de la valeur du champ elle-même, oc.select
essaie d'interpoler une autre valeur dans la config Hydra (dans ce cas-ci run_time_min
), mais en utilisant la 2e valeur par défaut (dans ce cas-ci 60
) si la 1ère valeur ne correspond pas à une clé dans la config Hydra.
Il n'y a pas d'exemple de configuration dans vital
qui profite de cette option, mais c'est ce que j'ai trouvé comme façon de faire pour permettre facilement de spécialiser le runtime de chaque "experiment" Hydra. Dans mon projet, je spécifiais des valeurs adaptées au runtime de mes modèles dans mes config d'experiment (e.g. 60 pour un ENet, 500 pour un VAE, etc.), mais si l'utilisateur ne le spécifie pas, la job va avoir un runtime par défaut de 60 minutes. C'est la valeur par défaut de submitit, donc je l'ai simplement copiée ici pour garder le comportement par défaut.
Ci-dessous, un extrait de ma config pour mon experiment vae.yaml
, pour mieux illustrer ce que je veux dire:
# @package _global_
defaults:
[...]
trainer:
max_epochs: 200
precision: 16
data:
batch_size: 128
[...]
task:
optim:
lr: 5e-4
weight_decay: 1e-3
run_time_min: 500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quelques questions, mais ça me semble bon!
baae5a9
to
c214ef4
Compare
In the end, when testing the output configs one last time before moving on to real-world tests on Beluga, I couldn't reproduce the good output config I initially got, and it's due to a bug I discovered for some edge-case behavior in Hydra. I openened an issue with Hydra (#2316), which we should wait to be resolved before finalizing + merging the current PR. |
7488e59
to
e8b1554
Compare
6378a22
to
fa8184c
Compare
6084769
to
9fc633a
Compare
9fc633a
to
43a9215
Compare
The use of this launcher is meant in time to completely replace the `vital.utils.jobs` package
43a9215
to
5a06309
Compare
@lemairecarl @ThierryJudge I finally received an answer recently to the issue I had opened with Hydra regarding the composition order of the config, and how built-in nodes are a special case. The short answer is that Hydra's current behavior is a feature, not a bug, so the previous way I designed to handle switching between local/Slurm runs wouldn't work. Therefore, I changed the design a bit to introduce a new custom The purpose of the new config group is to "override" in a way the built-in launcher, so that we use it rather than I know it's been a long time (and I'm in no hurry to get this merged), but if you ever have the time I would appreciate if you could give your opinions on the new design. It would be nice to finally have support for launching |
…tom `launcher` This allows us to revert the custom checks on the config that made sure required launcher config fields exist (because now we can rely on Hydra's built-in parsing)
The use of this launcher, once tested over a few weeks to launch real jobs, is meant to completely replace the
vital.utils.jobs
package, which will eventually be deleted.