From a7883e60556bfe60f8d70328ae2220984540cdfb Mon Sep 17 00:00:00 2001 From: jctian98 Date: Fri, 23 Aug 2024 15:56:56 -0400 Subject: [PATCH 1/2] babel guidance version 1 --- _pages/info.md | 1 + _posts/2024-08-20-babel-usage.md | 104 +++++++++++++++++++++++++++++++ 2 files changed, 105 insertions(+) create mode 100644 _posts/2024-08-20-babel-usage.md diff --git a/_pages/info.md b/_pages/info.md index d1e779f9..bbc0e888 100644 --- a/_pages/info.md +++ b/_pages/info.md @@ -12,6 +12,7 @@ This page has some information guidelines for members of WAVLab. * [AWS use instructions]({% post_url 2022-01-01-aws-usage %}) * [PSC cluster use instructions]({% post_url 2022-01-01-psc-usage %}) * [Delta cluster use instructions]({% post_url 2023-04-02-delta-usage %}) +* [Babel cluster use instructions]({% post_url 2024-08-20-babel-usage %}) * [ESPnet2 recipes]({% post_url 2022-01-01-espnet2-recipe %}) * [Lab logos and slides template](https://github.com/shinjiwlab/lab_logo) (Need to request access) diff --git a/_posts/2024-08-20-babel-usage.md b/_posts/2024-08-20-babel-usage.md new file mode 100644 index 00000000..d19f841c --- /dev/null +++ b/_posts/2024-08-20-babel-usage.md @@ -0,0 +1,104 @@ +--- +layout: post +title: Babel Usage +date: 2024-08-20 09:00:00-0800 +description: Babel cluster usage. +comments: false +--- + +## Important Information +* Document: Babel is the cluster hosted in LTI, CMU. Besides this page, **please also check the [official document](https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Main_Page)**. You will need a CMU identity to access this document (i.e., andrew ID). +* Slack Channel: Babel users should join the `babel-babble` channel in `LTI` slack space to receive the latest information. You may also contact the cluster admin through that channel. +* Use Policy: + * Generally, each user can use up to 8 GPUs without notifying the admin of the cluster. + * Occasionally, one can use more than 8 GPUs but need to send a message in the slack channel to clarify the number of GPUs and the estimated time to finish. The admin will request you to lower your usage when the cluster is busy. + * There is no charging mechanism in babel but please still use it reasonably. +* `swl_general` and `swl_short` partitions: + * Nodes with names `babel-11-*` are former SWL cluster. Our lab members will have priority to these nodes as long as you use partitions `swl_general` and `swl_short`. + * Nodes `babel-11-[13,29]` are exclusively reserved for users from `WavLab`. Your jobs on these nodes will not be counted in the 8-GPU limit policy. + + +## Cluster Access +* Before you proceed, please make sure your access to Babel is approved by Prof. Shinji Watanabe. +* Submit this [form](https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform). You will receive an email from the cluster admin when your account is created. + * HPC Cluster Name: `babel` + * Department Association: `LTI` + * Faculty Sponsoring Account: `swatanab` + * Additional Groups: `swl` +* Connect to the cluster by `ssh @babel.lti.cs.cmu.edu` + +## Login nodes, working nodes and working directories +* Once login, you will be in a `login` node. These nodes are used for login only and are not for real jobs. +* You jobs will be conducted by `working` nodes. You can allocate CPU/GPU resources for your jobs. Once allocated, you can also login these nodes from the login node by `ssh`. E.g., if there is a job running on `babel-11-29`, you can login that node by `ssh babel-11-29`. +* Working directories below are commonly used. Note `/data` is not visible to the `login` nodes. + * Personal directory: `/data/user_data/` + * Shared corpus storage `/data/group_data/swl/corpora` + * Legacy working directory of previous SWL user: `/data/group_data/swl/old_home` + * Personal home, with very limited space. Do not use it for your works: `/home/` + + +## Resource Allocation +* Resources in Babel are managed by `slurm`. For general use cases, please refer to [this document](https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Slurm) +* For ESPnet users, jobs are submitted to `slurm` automatically. + * For each recipe (e.g., `espnet/egs2/librispeech/asr1`), there are a `cmd.sh` and a `conf/slurm.conf` files. Setting `backend=slurm` in `cmd.sh` and setting `conf/slurm.conf` properly should be sufficient to use Babel resources. An example `conf/slurm.con` is below. + ``` + # Default configuration + command sbatch --export=PATH + option name=* --job-name $0 + default time=2-00:00:00 + option time=* --time $0 + option mem=* --mem-per-cpu $0 + option mem=0 + option num_threads=* --cpus-per-task $0 + option num_threads=1 --cpus-per-task 1 + option num_nodes=* --nodes $0 + default gpu=0 + option gpu=0 -p swl_general --mem 2000M + option gpu=1 -p swl_general --gres=gpu:1 -c 8 --mem 30000M + option gpu=2 -p swl_general --gres=gpu:2 -c 16 --mem 60000M + option gpu=3 -p swl_general --gres=gpu:3 -c 24 --mem 90000M + option gpu=4 -p swl_general --gres=gpu:4 -c 32 --mem 120000M + option gpu=8 -p swl_general --gres=gpu:8 -c 48 --mem 240000M + ``` + * Based on the number of GPUs you request, it will automatically select the setup above. E.g., if 2 GPUs are requested, configuration `gpu=2 -p swl_general --gres=gpu:2 -c 16 --mem 60000M` will be in use. + * `-p swl_general` specify which `partition` the jobs are submitted to. Use `sinfo` to check all available partitions. Each partition will contain different resources. Members from `WavLab` will be able to use partitions `debug`, `general`, `long`, `cpu`, `swl_general` and `swl_short`. + * `-c` means the CPU cores to allocate, usually 8 CPU cores for each GPU. + * `--mem` means the CPU memory to allocate, usually 30G for each GPU. + * Make sure `gpu=N` matches `--gres=gpu:N` + * `default time=2-00:00:00` specify the estimated time of your jobs. The maximum valid time will be differnt based on the partition. Use `sinfo` to check that for each partition. + * Your jobs will fail if the requested number of GPUs / CPU cores / memory beyond the possible configuration. + * By adding `--exclude=`, you can avoid submitting your jobs to certain nodes. E.g., `--exclude=babel-11-[13,29]`. + * By adding `-w `, you can submit your jobs to certain nodes, E.g., `--w babel-11-[13,29]`. + * You can also specify the GPU types. E.g., to request A6000 GPUs, replace `--gres=gpu:4` to `--gres=gpu:A6000:4`. + +## ESPnet +Using ESPnet on Babel will not cause extra difficulties. To setup the environment: +``` +git clone https://github.com/espnet/espnet.git +cd espnet/tools +./setup_anaconda.sh # E.g., ./setup_anaconda.sh /data/user_data//tools/miniconda3 espnet 3.10 +make TH_VERSION= CUDA_VERSION= # E.g., make TH_VERSION=2.1.0 CUDA_VERSION=11.8 +``` +* Note: You will not need to use `module load` as before, as the conda will handle the CUDA automatically. + +Then you can run ESPnet recipes. E.g., +``` +cd espnet/egs2/librispeech/asr1/ +# configurate cmd.sh to use slurm backend +# configurate conf/slurm.conf as above +# Add your dataset path to db.sh +bash run.sh +``` +Further ESPnet use guidance is beyond the scope of Babel. Readers can refer to the [tutorials](https://espnet.github.io/espnet/tutorial.html ) in our [website](https://github.com/espnet/espnet). + +## Misc. +* VSCode: Both login nodes and working nodes can be accessed by VSCode. Search `VSCode` in Babel official document for guidance. +* As `/data` directory is not visible to login nodes, one can keep a small CPU job for coding. Please only use a small amount of memory / CPU cores for this porpose. For short-time use, you can also allocate some GPUs, but please don't allocate GPUs for a long time for coding and debugging. + + ``` + sbatch --partition=swl_general --nodes=1 --tasks=1 --tasks-per-node=1 --cpus-per-task=4 --mem=8000M -w babel-11-17 --time=15-00:00:00 /home//run.sh & + + ### with the run.sh example below + #!/bin/bash + sleep 15d + ``` From 56a00b4d5a04c6ed9f49cf23170ac5e912b499bc Mon Sep 17 00:00:00 2001 From: jctian98 Date: Mon, 9 Sep 2024 16:24:18 -0400 Subject: [PATCH 2/2] remove sensitive info --- _posts/2024-08-20-babel-usage.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/_posts/2024-08-20-babel-usage.md b/_posts/2024-08-20-babel-usage.md index d19f841c..700be064 100644 --- a/_posts/2024-08-20-babel-usage.md +++ b/_posts/2024-08-20-babel-usage.md @@ -15,12 +15,11 @@ comments: false * There is no charging mechanism in babel but please still use it reasonably. * `swl_general` and `swl_short` partitions: * Nodes with names `babel-11-*` are former SWL cluster. Our lab members will have priority to these nodes as long as you use partitions `swl_general` and `swl_short`. - * Nodes `babel-11-[13,29]` are exclusively reserved for users from `WavLab`. Your jobs on these nodes will not be counted in the 8-GPU limit policy. ## Cluster Access * Before you proceed, please make sure your access to Babel is approved by Prof. Shinji Watanabe. -* Submit this [form](https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform). You will receive an email from the cluster admin when your account is created. + * Go to [LTI intranet](https://lti.cs.cmu.edu/misc-pages/intranet-forms.html) and then submit `HPC Cluster User Account Request Form`. * HPC Cluster Name: `babel` * Department Association: `LTI` * Faculty Sponsoring Account: `swatanab`