Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support submitit Hydra launcher, to launch runs on SLURM clusters #69

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

nathanpainchaud
Copy link
Member

The use of this launcher, once tested over a few weeks to launch real jobs, is meant to completely replace the vital.utils.jobs package, which will eventually be deleted.

@nathanpainchaud nathanpainchaud added the enhancement New feature or request label Jul 25, 2022
@nathanpainchaud nathanpainchaud self-assigned this Jul 25, 2022
@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch from bcf2090 to baae5a9 Compare July 25, 2022 15:49
@nathanpainchaud
Copy link
Member Author

@lemairecarl @ThierryJudge I'm trying to test this new launcher to launch a few jobs on Beluga, but I'm having trouble creating an environment with up-to-date dependencies on the cluster. That's why I decide not to delay this PR any longer and to gather your feedback ASAP. I looked at the output config generated by Hydra, and as far as that goes everything should be okay 🤞

Copy link
Collaborator

@ThierryJudge ThierryJudge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might require some extra documentation (maybe another readme) but looks good. I added a few questions and might add some more when I test it out.

timeout_min: ${oc.select:run_time_min,60}
setup:
- "module load httpproxy" # load module allowing to connect to whitelisted domains
- "source $ALLIANCE_VENV_PATH/bin/activate" # activate the pre-installed virtual environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with venv but does this activate the venv from a path that is not on the compute node ?

Copy link
Contributor

@lemairecarl lemairecarl Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the assignation in .env.example, I suppose that the venv will typically be in project storage. That is what is meant by "pre-installed". Doing it this way works pretty well and is the most flexible. However, there is debate about what is the best practice. I pretty much always use venvs from shared storage instead of installing them every time on the compute node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lemairecarl explained pretty much the whole thing. On the compute node, you still have access to everything all the cluster's distributed filesystem, e.g. your home folder, etc. It's just that some folders of the filesystem are on the compute node itself, and some are on other machines altogether. So, depending on how much you expect to read/write to some files, it can be a good idea to move them to the folders physically on the compute node.

Regarding the python environment, unless I'm very much mistaken, the python code is only read from disk once when you launch the interpreter (at which point all the imported code is loaded into RAM), and afterwards everything is read from the RAM. So I don't think there are major performance drawbacks to keeping the python environment on a distributed filesystem, since it will only be accessed once per job, and this way you avoid potential issues with reproducing exactly the same environment each time you run a job.

The dataset is another matter entirely, since it's big, probably won't fit on the RAM, and you'll be reading from it constantly throughout the job's runtime. That's why in that case, it's worth to go through the hassle of copying it on the compute node, to avoid having to access it remotely.

vital/config/hydra/launcher/alliance.yaml Outdated Show resolved Hide resolved
@vitalab vitalab deleted a comment from ThierryJudge Jul 25, 2022
vital/runner.py Outdated Show resolved Hide resolved

hydra:
launcher:
timeout_min: ${oc.select:run_time_min,60}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pourrais-tu m'expliquer cette ligne stp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolument! Le champ hydra.launcher.timeout_min est à la base fourni par submitit pour spécifier la durée (en minutes) à demander pour la job. Quand à l'interprétation de la valeur du champ elle-même, oc.select essaie d'interpoler une autre valeur dans la config Hydra (dans ce cas-ci run_time_min), mais en utilisant la 2e valeur par défaut (dans ce cas-ci 60) si la 1ère valeur ne correspond pas à une clé dans la config Hydra.

Il n'y a pas d'exemple de configuration dans vital qui profite de cette option, mais c'est ce que j'ai trouvé comme façon de faire pour permettre facilement de spécialiser le runtime de chaque "experiment" Hydra. Dans mon projet, je spécifiais des valeurs adaptées au runtime de mes modèles dans mes config d'experiment (e.g. 60 pour un ENet, 500 pour un VAE, etc.), mais si l'utilisateur ne le spécifie pas, la job va avoir un runtime par défaut de 60 minutes. C'est la valeur par défaut de submitit, donc je l'ai simplement copiée ici pour garder le comportement par défaut.

Ci-dessous, un extrait de ma config pour mon experiment vae.yaml, pour mieux illustrer ce que je veux dire:

# @package _global_

defaults:
  [...]

trainer:
  max_epochs: 200
  precision: 16

data:
  batch_size: 128
  [...]

task:
  optim:
    lr: 5e-4
    weight_decay: 1e-3

run_time_min: 500

Copy link
Contributor

@lemairecarl lemairecarl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quelques questions, mais ça me semble bon!

@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch from baae5a9 to c214ef4 Compare July 25, 2022 21:04
@nathanpainchaud
Copy link
Member Author

nathanpainchaud commented Jul 26, 2022

In the end, when testing the output configs one last time before moving on to real-world tests on Beluga, I couldn't reproduce the good output config I initially got, and it's due to a bug I discovered for some edge-case behavior in Hydra.

I openened an issue with Hydra (#2316), which we should wait to be resolved before finalizing + merging the current PR.

@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 7488e59 to e8b1554 Compare August 1, 2022 17:16
@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 6378a22 to fa8184c Compare September 5, 2022 23:25
@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 6084769 to 9fc633a Compare September 14, 2022 23:30
@nathanpainchaud nathanpainchaud marked this pull request as draft September 14, 2022 23:57
@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch from 9fc633a to 43a9215 Compare September 30, 2022 12:29
The use of this launcher is meant in time to completely replace the `vital.utils.jobs` package
@nathanpainchaud nathanpainchaud force-pushed the feature/submitit-launcher branch from 43a9215 to 5a06309 Compare February 21, 2023 16:08
@nathanpainchaud nathanpainchaud marked this pull request as ready for review February 21, 2023 16:11
@nathanpainchaud
Copy link
Member Author

@lemairecarl @ThierryJudge I finally received an answer recently to the issue I had opened with Hydra regarding the composition order of the config, and how built-in nodes are a special case.

The short answer is that Hydra's current behavior is a feature, not a bug, so the previous way I designed to handle switching between local/Slurm runs wouldn't work. Therefore, I changed the design a bit to introduce a new custom launcher config group (not to be confused with the built-in hydra/launcher config group).

The purpose of the new config group is to "override" in a way the built-in launcher, so that we use it rather than hydra/launcher to switch between local/Slurm runs. This is less elegant than before, but it was necessary because, from my discussions with the Hydra dev, we essentially can't override application options from the override of a built-in config group. So I had no choice but to a create new config group (that I could set to be resolved after application-level options), to enable the necessary overrides.

I know it's been a long time (and I'm in no hurry to get this merged), but if you ever have the time I would appreciate if you could give your opinions on the new design. It would be nice to finally have support for launching vital jobs on Slurm clusters again 🙂

…tom `launcher`

This allows us to revert the custom checks on the config that made sure required launcher config fields exist (because now we can rely on Hydra's built-in parsing)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants