Running on JLab ifarm GPUs

This note briefly explains the necessary steps to run any code the requires GPUs on the JLab ifarm. Nearly all code snippets shown here have been taken from this excellent source.

Prerequisites

In order for all this to work, you need a jlab ifarm account which is explained here in more detail. After that, you need to follow the instructions in the README.md and create your environment, lets call it tomography_env

Running interactively

After logging onto the ifarm via:

ssh -Y <user_name>@scilogin.jlab.org (enter 2 factor authentification) 
ssh ifarm (enter your password)

Now you are able to allocate nodes with GPUs. Lets say you require 3 GPUs on one node and you need 128 GB and your script will run about 5h. You would run:

salloc --gres gpu:TitanRTX:3 --partition gpu --nodes 1 --time=05:00:00 --mem=128GB (allocate the required resources)
srun --pty bash (request interactive node)

Now you have accessed the node and need to set up your environment, so that you can run your code. To activate the environment run:

conda activate <YOUR_ENVIRONMENT>

If you plan to use tensorflow with GPU, you need to additionally run these two commands:

CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH

Now you are all set up and you can run your script:

python <full_path_to_your_script>/<your_script>.py

Check if your job is running via:

sacct

Or monitor the GPU usage:

nvidia-smi

Running via Batch System

If you intend to run multiple jobs, the above method can become tedious very fast. An elegant way to submit a job via sbatch can be achieved by setting up a bash script, we call it submission_script.sh . Running with the same resources as above, the submission script should look like this:

#!/usr/bin/env bash                                                                                             
#SBATCH --partition=gpu (You wish to run on GPU)                                                                                     
#SBATCH --gres=gpu:T4:3 (You request 3 TitanTRX GPUs)                                                                                    
#SBATCH --time=05:00:00 (Your jobs will last no longer than 5 h)                                                                                      
#SBATCH --mem=128GB (You have 128GB at your disposal)                                                                                            
#SBATCH --nodes=1 (You are running on one node only)                                                                                                                                                                                  
#SBATCH --output=./<name_of_logfile>.log  (All error messages and your script output will be stored in that .log file)                                                                        
#SBATCH --job-name=<name_of_your_job> (Name of your job)

# Make the shell aware that you intend to use conda --> This is something that slurm needs to do:
source /etc/profile.d/conda.sh
  
# Acticate your environment                                                                                    
source activate <YOUR_ENVIRONMENT>

# Make sure that cuda is available for tensorflow
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH

# Run your script
python <full_path_to_your_script>/<your_script>.py

As one can see, we are using the exact same commands that were used for the interactive session. To submit your job to the ifarm, run:

sbatch submission_script.sh

and monitor it:

squeue -u <user_name>

Using multiple Nodes and / or GPUs via mpi4py

Your script might want to execute multiple tasks in parallel. One common tool for this is mpi4py. You can either install it with mpich:

 conda install -c conda-forge mpi4py mpich

or with (the very common) openmpi:

conda install -c conda-forge mpi4py openmpi

I tried both methods and they worked equally well for me. More details about installation options can be found here.

If we go back to the example above: We wish to use 3 GPUs and we want that each rank has exactly one GPU. In interactive mode, we need to modify our commands to:

salloc --gres gpu:TitanRTX:3 --partition gpu --nodes 1 --ntasks-per-node=3 --time=05:00:00 --mem=128GB (allocate the required resources)
srun --pty bash 
# If you installed mpi4py with openmpi:
mpirun --mca btl_tcp_port_min_v4 32768 --mca btl_tcp_port_range_v4 28230 python <full_path_to_your_script>/<your_script>.py
# Or if you installed mpi4py with mpich:
mpirun python <full_path_to_your_script>/<your_script>.py

Similarly, the batch submission script needs to be changed to:

#!/usr/bin/env bash                                                                                             
#SBATCH --partition=gpu (You wish to run on GPU)                                                                                     
#SBATCH --gres=gpu:T4:3 (You request 3 TitanTRX GPUs)                                                                                    
#SBATCH --time=05:00:00 (Your jobs will last no longer than 5 h)                                                                                      
#SBATCH --mem=128GB (You have 128GB at your disposal)                                                                                            
#SBATCH --nodes=1 (You are running on one node only)   
#SBATCH --ntasks-per-node=3 (Tell each node to use exactly 3 GPUs)                                                                                                                                                                               
#SBATCH --output=./<name_of_logfile>.log  (All error messages and your script output will be stored in that .log file)                                                                        
#SBATCH --job-name=<name_of_your_job> (Name of your job)

# Make the shell aware that you intend to use conda --> This is something that slurm needs to do:
source /etc/profile.d/conda.sh
  
# Acticate your environment                                                                                    
source activate <YOUR_ENVIRONMENT>

# Make sure that cuda is available for tensorflow
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH

# Run your script via mpi
# openmpi installation:
mpirun --mca btl_tcp_port_min_v4 32768 --mca btl_tcp_port_range_v4 28230 python <full_path_to_your_script>/<your_script>.py
# or mpich installation:
mpirun python <full_path_to_your_script>/<your_script>.py

Now we make things a bit more interesting and try to run on 2 nodes with 4 GPUs each. The corresponding allocation looks like:

salloc --gres gpu:TitanRTX:4 --ntasks-per-node=4 --partition gpu --nodes 2 --time=05:00:00 --mem=128GB
srun --pty bash
mpirun python <full_path_to_your_script>/<your_script>.py

Your bash script will change to:

#!/usr/bin/env bash                                                                                             
#SBATCH --partition=gpu                                                                                         
#SBATCH --gres=gpu:T4:4                                                                                  
#SBATCH --time=05:00:00                                                                                         
#SBATCH --mem=128GB     
#SBATCH --n-tasks-per-node=4                                                                                        
#SBATCH --nodes=2                                                                                                                                                                                
#SBATCH --output=./<name_of_logfile>.log                                                                          
#SBATCH --job-name=<name_of_your_job>    

# Make the shell aware that you intend to use conda --> This is something that slurm needs to do:
source /etc/profile.d/conda.sh
  
# Acticate your environment                                                                                    
source activate <YOUR_ENVIRONMENT>

CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH

# mpi4py installed with openmpi:
mpirun --mca btl_tcp_port_min_v4 32768 --mca btl_tcp_port_range_v4 28230 python <full_path_to_your_script>/<your_script>.py

# mpi4py installed with mpich:
mpirun python <full_path_to_your_script>/<your_script>.py

The additional command -mca btl_tcp_port_min_v4 32768 --mca btl_tcp_port_range_v4 28230 ensures that mpi runs across different nodes. Technically, this addition should not be needed, but for some reason it is required to run on multiple jlab nodes.

A Simple Example

The lines below show a simple code that requires 2 nodes with 4 GPUs each and registers a tensor on each GPU:

from mpi4py import MPI
import torch

# Get MPI related info:                                                                                                                                                                              
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# Get the number of available GPUs:                                                                                                                                                                  
n_gpus = torch.cuda.device_count()

# Make sure that we assign the correct gpu to the corresponding rank:                                                                                                                                
gpu_idx = rank
if rank >= n_gpus:
    gpu_idx = rank-n_gpus

# Register a torch tensor on a GPU:                                                                                                                                                                  
dev = "cuda:"+str(gpu_idx)
tensor = torch.tensor([0.0,1.1,2.2],device=dev,dtype=torch.float32)

print(f"This is rank {rank} with device {tensor.device}")

Running this script via mpirun will show, that rank 0-3 use a cuda device: 0,1,2,3, whereas ranks 4-7 also use a cuda device: 0,1,2,3. This is due to the fact that the first four ranks run on node 1 and the last four ranks run on node 2, and each node has 4 GPUs

If you run the above examples, you will see now 8 ranks, where ranks 0-3 utilize the GPUs on one node and ranks 4-7 run on the second node GPUs. The code snippets above where tested with mpi version: mpirun (Open MPI) 4.0.4.

Resources

If you do not posses a working conda environment - do not worry. Enter the following directory on the ifarm:

cd /w/data_science-sciwork18/conda_ifarm_envs/

There you will find a yml file that covers the basics (i.e. tensorflow, pytorch, torchmetrics, scikit, mpi with mpich,...):

ifarm_conda_env_27june2024.yml

You can create your very own conda environment via:

conda env create --name <NAME_YOUR_ENVIRONMENT> -f ifarm_conda_env_27june2024.yml

This may take a while (as in: "grab a coffee and work on something else"-while). Once this is done, you should be able to run most of your projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly