-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any example script to run multi-node training for slurm? #1378
Comments
We don't have a slurm example, but here are the environment variables that the composer launcher sets/requires: https://github.com/mosaicml/composer/blob/6d4628a1043d1f118dc38eb359ede5524e0a9aa0/composer/cli/launcher.py#L344-L352. It should just be the normal torch distributed env vars. And here are the env vars that mcli sets for you: https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/environment.html#runtime-environment-variables |
Thanks for helping me! @dakinggg #!/bin/bash
#SBATCH --job-name=wavy-llmfoundry-test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:8
#SBATCH --output=slurm-logs/%x-%j.out
GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=19963
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
WORK_DIR="/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/llm-foundry"
export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO
export RANK=$NNODES
export WORLD_SIZE=$WORLD_SIZE
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export LOCAL_WORLD_SIZE=$GPUS_PER_NODE
export NUM_NODES=$NNODES
export LAUNCHER="composer --world_size $WORLD_SIZE \
--master_addr $MASTER_ADDR \
--master_port 19963"
export CMD="$WORK_DIR/scripts/train/train.py \
$WORK_DIR/scripts/train/yamls/pretrain/llama3-8b.yaml"
srun \
--container-image /mnt/datafs/ib-a100-cluster-a-pri/lmt/images/wavy-llm-foundry-v0.10.0.sqsh \
--container-mounts /mnt/datafs:/mnt/datafs \
--container-workdir $WORK_DIR \
--jobid $SLURM_JOBID \
bash -c "export NODE_RANK=$SLURM_PROCID && $LAUNCHER --node_rank $SLURM_PROCID $CMD \
save_folder=/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/checkpoints/composer/llama3-8b-slurm" However, the error below was thrown: So I tried with # export LAUNCHER="composer --world_size $WORLD_SIZE \
# --master_addr $MASTER_ADDR \
# --master_port 19963"
export LAUNCHER="torchrun \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d "
|
Ah looks like an issue on a shared fs (see #1253 (comment) for more discussion of this). I haven't quite finished fixing that yet. |
Could you try this PR? #1381. You may also need composer with this pr mosaicml/composer#3485 |
@dakinggg Thanks! I'll try with those PRs |
@dakinggg It seems that 1381 was reverted -> 221d3e2 I tried pulling the latest docker image (mosaicml/llm-foundry:2.3.1_cu121-e882658) but I am still getting this error when trying to run in a multi-node setting: [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 Is this expected? Thanks in advance! |
Yes, we will reapply soon, but you can still try with that PR. unhandled error seems different though and suggest your distributed env is not set up correctly |
Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure
composer
arguments and commands.Is there any example script to run training on slurm nodes with
composer
?The text was updated successfully, but these errors were encountered: