-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress * basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress Signed-off-by: César Miguel Valdez Córdova <[email protected]> * removed repo_dir param, config interpolation * add uv lock * remote launcher params, configs * unrolled config to try params to circumvent pre-emption * more configs * Tweaks Signed-off-by: Fabrice Normandin <[email protected]> * Add a `cluster` config group, tweak resources Signed-off-by: Fabrice Normandin <[email protected]> * Tweak the configs Signed-off-by: Fabrice Normandin <[email protected]> * Remove hydra_plugins folder, tweak hydra `Plugins` Signed-off-by: Fabrice Normandin <[email protected]> * Fix weird PL .log bug in callback Signed-off-by: Fabrice Normandin <[email protected]> * add first mock test * removed direct mock call, overrides for argv * add trainer params, assertion * added assertion to mock calls * avoid parameter overwriting on executor * hotpatched uv not found error, path instead of string * Debug issues with test for cluster group Signed-off-by: Fabrice Normandin <[email protected]> * Try to make things more explicit Signed-off-by: Fabrice Normandin <[email protected]> * Add "cpu" and "gpu" resources configs Signed-off-by: Fabrice Normandin <[email protected]> * Add Beluga and Cedar configs Signed-off-by: Fabrice Normandin <[email protected]> * Update the remote executor Signed-off-by: Fabrice Normandin <[email protected]> * Update the remote slurm executor to latest commit Signed-off-by: Fabrice Normandin <[email protected]> * Update executor plugin, add todos Signed-off-by: Fabrice Normandin <[email protected]> * debug config revamp * Add back the one_gpu.yaml config (local only) Signed-off-by: Fabrice Normandin <[email protected]> * keep basic assertion within test * Fix pre-commit version issue Signed-off-by: Fabrice Normandin <[email protected]> * empty line at the end of debug.yaml * Fix pre-commit issues Signed-off-by: Fabrice Normandin <[email protected]> * Rename test module Signed-off-by: Fabrice Normandin <[email protected]> * Add missing marks on test Signed-off-by: Fabrice Normandin <[email protected]> * Remove debug config Signed-off-by: Fabrice Normandin <[email protected]> * Show git diff when a pre-commit hook fails Signed-off-by: Fabrice Normandin <[email protected]> * Add back weirdly formatted import in main.py?! Signed-off-by: Fabrice Normandin <[email protected]> * Remove duplicated, outdated test in main_test.py Signed-off-by: Fabrice Normandin <[email protected]> * Fix the one_gpu.yaml config not having a target Signed-off-by: Fabrice Normandin <[email protected]> * Don't include DRAC slurm account in configs Signed-off-by: Fabrice Normandin <[email protected]> * WIP: Use the same resource group local and remote Signed-off-by: Fabrice Normandin <[email protected]> * Add a patched config for submitit_slurm Signed-off-by: Fabrice Normandin <[email protected]> * Use patched config in `cluster=current` Signed-off-by: Fabrice Normandin <[email protected]> * Fix structure of log dirs with submitit launchers Signed-off-by: Fabrice Normandin <[email protected]> * Remove the duplicate 'one_gpu' config Signed-off-by: Fabrice Normandin <[email protected]> * Undo change to Trainer default config Signed-off-by: Fabrice Normandin <[email protected]> * Silence tiny typing error in PatchedSlurmQueueConf Signed-off-by: Fabrice Normandin <[email protected]> * Add test to load the configs Signed-off-by: Fabrice Normandin <[email protected]> * Fix test for remote launcher plugin Signed-off-by: Fabrice Normandin <[email protected]> * Fix bug in overwriting of `setup` from executor Signed-off-by: Fabrice Normandin <[email protected]> * Add marks to tests for the remote launcher plugin Signed-off-by: Fabrice Normandin <[email protected]> --------- Signed-off-by: César Miguel Valdez Córdova <[email protected]> Signed-off-by: Fabrice Normandin <[email protected]> Co-authored-by: Fabrice Normandin <[email protected]>
- Loading branch information
1 parent
f753a20
commit 0031536
Showing
19 changed files
with
1,151 additions
and
285 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# @package _global_ | ||
defaults: | ||
- narval | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: beluga | ||
internet_access_on_compute_nodes: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# @package _global_ | ||
defaults: | ||
- narval | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: cedar | ||
internet_access_on_compute_nodes: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
hydra: | ||
mode: MULTIRUN | ||
run: | ||
# output directory, generated dynamically on each run | ||
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S} | ||
sweep: | ||
dir: logs/${name}/multiruns | ||
subdir: ${hydra.job.id} | ||
launcher: | ||
stderr_to_stdout: true | ||
submitit_folder: ${hydra.sweep.dir}/%j |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: remote_submitit_slurm | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
hydra: | ||
mode: MULTIRUN | ||
run: | ||
# output directory, generated dynamically on each run | ||
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S} | ||
sweep: | ||
dir: logs/${name}/multiruns | ||
subdir: ${hydra.job.id} | ||
|
||
launcher: | ||
executor: | ||
_target_: remote_slurm_executor.RemoteSlurmExecutor | ||
_partial_: true | ||
folder: "${hydra.sweep.dir}/%j" | ||
cluster_hostname: mila | ||
internet_access_on_compute_nodes: true | ||
|
||
stderr_to_stdout: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# @package _global_ | ||
defaults: | ||
- mila.yaml | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: narval | ||
internet_access_on_compute_nodes: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
trainer: | ||
accelerator: cpu | ||
devices: auto | ||
|
||
hydra: | ||
mode: MULTIRUN | ||
launcher: | ||
nodes: 1 | ||
tasks_per_node: 1 | ||
cpus_per_task: 8 | ||
mem_gb: 16 | ||
array_parallelism: 16 # max num of jobs to run in parallel | ||
# Other things to pass to `sbatch`: | ||
additional_parameters: | ||
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS) | ||
|
||
|
||
## A list of commands to add to the generated sbatch script before running srun: | ||
# setup: | ||
# - export LD_PRELOAD=/some/folder/with/libraries/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
|
||
hydra: | ||
mode: MULTIRUN | ||
launcher: | ||
cpus_per_task: 4 | ||
gpus_per_task: 1 | ||
array_parallelism: 16 # max num of jobs to run in parallel | ||
# Other things to pass to `sbatch`: | ||
additional_parameters: | ||
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS) | ||
# TODO: It would be better to have those be arguments to the launcher (as is the case in the | ||
# RemoteLauncherPlugin), that way we could use only SLURM argument names.. | ||
nodes: 1 | ||
mem: 16G | ||
ntasks_per_node: 1 | ||
|
||
## A list of commands to add to the generated sbatch script before running srun: | ||
# setup: | ||
# - export LD_PRELOAD=/some/folder/with/libraries/ |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.