Enable Nvidia GPU Support #57

dsloanm · 2024-12-20T17:46:20Z

Draft PR as there are outstanding actions:

Tests need written.

Accounting configuration needs added: AccountingStorageTRES=gres/gpu, etc.

Better logging and status messages are needed.

Especially after the unit reboot following driver installation: if slurmd and slurmctld are not related, the unit status is set to slurmd not running instead of relation needed.

Approaches for package install, reboot detection and (lack of) NVLink detection aren't finalised.

This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The slurmd charm is extended to perform automated GPU detection and driver installation. The slurmctld charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration in slurm.conf and the new gres.conf configuration file.

Usage

# Assuming a bootstrapped AWS juju controller and a new model
# Deploy controller, login node and compute node
juju deploy slurmctld
juju deploy sackd
# Testing has been performed on AWS g4dn.xlarge instances, each equipped with an Nvidia Tesla T4 GPU
juju deploy --constraints="instance-type=g4dn.xlarge" slurmd compute-gpu

# Relate applications
juju integrate slurmctld:login-node sackd:slurmctld
juju integrate slurmctld:slurmd compute-gpu:slurmctld

# Wait for deployment to complete

# Bring up compute node
juju run compute-gpu/0 node-configured

# Connect to login node and submit a GPU job
juju ssh sackd/0
sudo su -

# Create a new job script.
# Note the "#SBATCH --gres=gpu:tesla_t4:1" requesting a GPU.
cat << 'EOF' > test.submit
#SBATCH --job-name=test
#SBATCH --partition=slurmd
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:30
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --gres=gpu:tesla_t4:1

echo "Hello from GPU-enabled node `hostname`"
echo "gres.conf contents:"
cat /run/slurm/conf/gres.conf
EOF

# Submit and check output
sbatch test.submit
logout
juju ssh compute-gpu/0
sudo su -
cat test.1.out

# Sample output:
#
#   Hello from GPU-enabled node ip-172-31-2-0
#   gres.conf contents:
#   NodeName=ip-172-31-2-0 Name=gpu Type=tesla_t4 File=/dev/nvidia0
#
# A GPU-enabled job was submitted, scheduled and executed. Success!

…ency.

Done temporarily to get GPU support working. Undo this once `slurm_ops` is properly updated.

jamesbeedy · 2024-12-26T22:49:29Z

Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs auto-update: true, auto-upgrade: true; oftentimes, depending on how dated the boot image is, there are kernel and/or driver updates that will automatically install when the charm initializes and put the machine into a reboot-required state. I don't see any issues with what you have done here.... just wondering if we want to account for this case needs-reboot case separately instead of rolling it into GPU context.

In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel).

I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel.

Thoughts?

jamesbeedy · 2024-12-26T22:58:08Z

Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs?

something like

#!/bin/bash
# Filename: dispatch

if ! [[ -f '.init-reboot' ]]
then
	if [[ -f '/var/run/reboot-required' ]]
	then
		reboot
        fi
	
	touch .init-reboot
fi

JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py

jamesbeedy · 2024-12-26T23:11:24Z

Side note (if you do modify the dispatch file)- since we now build nhc in the charmcraft build step, we probably don't need to install make in the dispatch file with apt anymore.

if ! [[ -f '.installed' ]]
then
    # Necessary to compile and install NHC
    apt-get install --assume-yes make
    touch .installed
fi

^ can be safely removed.

dsloanm added 4 commits December 20, 2024 17:04

chore(all): bump slurmutils to 0.11.0 and add jsonschema depend…

a5a84c4

…ency.

feat(slurmd): add GPU detection and driver installation.

cd248fe

feat(slurmctld): add GPU and other GRES device support.

2c95313

HACK: update slurm_ops and modify to use latest slurmutils.

a5265ab

Done temporarily to get GPU support working. Undo this once `slurm_ops` is properly updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Nvidia GPU Support #57

Enable Nvidia GPU Support #57

dsloanm commented Dec 20, 2024

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024

Enable Nvidia GPU Support #57

Are you sure you want to change the base?

Enable Nvidia GPU Support #57

Conversation

dsloanm commented Dec 20, 2024

Usage

jamesbeedy commented Dec 26, 2024 • edited Loading

jamesbeedy commented Dec 26, 2024 • edited Loading

jamesbeedy commented Dec 26, 2024

jamesbeedy commented Dec 26, 2024 •

edited

Loading

jamesbeedy commented Dec 26, 2024 •

edited

Loading