-
Notifications
You must be signed in to change notification settings - Fork 30
Frequently asked questions
Check the Slurm MPI Users Guide, Slurm is responsible for launching the tasks and mpirun
is not needed.
How do I configure pyxis for multi-node workloads through PMIx?
Make sure you configure enroot with the extra PMIx hook, as described in enroot configuration. It it doesn't work, check slurmd configuration
Under a PMIx allocation, i.e. srun --mpi=pmix
, you can only do a single MPI_Init
. In other words, you can't have srun
execute a script that launches multiple MPI applications in sequence.
Instead, you can save the container state with --container-name
and then do multiple invocations of srun
, one for each MPI application:
# From the login-node:
$ salloc -N2
$ srun --container-name=tf --container-image=tensorflow bash -c 'apt-get update && apt-get install -y ...'
$ srun --mpi=pmix --container-name=tf mpiapp1 ....
$ srun --mpi=pmix --container-name=tf mpiapp2 ....
Under a PMIx allocation, you can only do a single MPI_Init
(see above).
In addition, MPI_Comm_spawn
is known to not be available with PMIx and Slurm.
This is a known issue in older versions of Slurm when using srun --pty
. We recommend using at least Slurm 20.02.5 and pyxis 0.8.1 to solve this problem.
You can do sbatch --container-image
with pyxis 0.12. It will run the sbatch script inside the container therefore you will not be able to use srun
from within the containerized sbatch
script.
Enroot does not create a network namespace for the container, so you don't need to "publish" ports like with Docker, it's no different than running outside the container or --network=host
, but as a user you won't be able to listen on privileged ports (ports 1 to 1023 by default)
For example, with ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot
in enroot.conf
:
$ srun --export NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
slurmstepd: error: pyxis: child 1692947 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: XDG_RUNTIME_DIR: unbound variable
slurmstepd: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: child 1692966 failed with error code: 1
In this case, the issue is that --export
will unset all other environment variables from the user environment, and only set NVIDIA_VISIBLE_DEVICES=0
. It is recommended to add the ALL
option when using --export
:
$ srun --export ALL,NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
pyxis: imported docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-acce903c-39ee-787e-3dbc-f1d82df43fe7)
This behavior can be surprising for users familiar with Docker, as the --export
argument of Slurm does not behave like the --env
argument of Docker Engine.