Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Nvidia GPU Support #57

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Enable Nvidia GPU Support #57

wants to merge 4 commits into from

Conversation

dsloanm
Copy link
Contributor

@dsloanm dsloanm commented Dec 20, 2024

Draft PR as there are outstanding actions:

  • Tests need written.
  • Accounting configuration needs added: AccountingStorageTRES=gres/gpu, etc.
  • Better logging and status messages are needed.
    • Especially after the unit reboot following driver installation: if slurmd and slurmctld are not related, the unit status is set to slurmd not running instead of relation needed.
  • Approaches for package install, reboot detection and (lack of) NVLink detection aren't finalised.

This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The slurmd charm is extended to perform automated GPU detection and driver installation. The slurmctld charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration in slurm.conf and the new gres.conf configuration file.

Usage

# Assuming a bootstrapped AWS juju controller and a new model
# Deploy controller, login node and compute node
juju deploy slurmctld
juju deploy sackd
# Testing has been performed on AWS g4dn.xlarge instances, each equipped with an Nvidia Tesla T4 GPU
juju deploy --constraints="instance-type=g4dn.xlarge" slurmd compute-gpu

# Relate applications
juju integrate slurmctld:login-node sackd:slurmctld
juju integrate slurmctld:slurmd compute-gpu:slurmctld

# Wait for deployment to complete

# Bring up compute node
juju run compute-gpu/0 node-configured

# Connect to login node and submit a GPU job
juju ssh sackd/0
sudo su -

# Create a new job script.
# Note the "#SBATCH --gres=gpu:tesla_t4:1" requesting a GPU.
cat << 'EOF' > test.submit
#SBATCH --job-name=test
#SBATCH --partition=slurmd
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:30
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --gres=gpu:tesla_t4:1

echo "Hello from GPU-enabled node `hostname`"
echo "gres.conf contents:"
cat /run/slurm/conf/gres.conf
EOF

# Submit and check output
sbatch test.submit
logout
juju ssh compute-gpu/0
sudo su -
cat test.1.out

# Sample output:
#
#   Hello from GPU-enabled node ip-172-31-2-0
#   gres.conf contents:
#   NodeName=ip-172-31-2-0 Name=gpu Type=tesla_t4 File=/dev/nvidia0
#
# A GPU-enabled job was submitted, scheduled and executed. Success!

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Dec 26, 2024

Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs auto-update: true, auto-upgrade: true; oftentimes, depending on how dated the boot image is, there are kernel and/or driver updates that will automatically install when the charm initializes and put the machine into a reboot-required state. I don't see any issues with what you have done here.... just wondering if we want to account for this case needs-reboot case separately instead of rolling it into GPU context.

In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel).

I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel.

Thoughts?

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Dec 26, 2024

Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs?

something like

#!/bin/bash
# Filename: dispatch

if ! [[ -f '.init-reboot' ]]
then
	if [[ -f '/var/run/reboot-required' ]]
	then
		reboot
        fi
	
	touch .init-reboot
fi

JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py

@jamesbeedy
Copy link
Contributor

Side note (if you do modify the dispatch file)- since we now build nhc in the charmcraft build step, we probably don't need to install make in the dispatch file with apt anymore.

if ! [[ -f '.installed' ]]
then
    # Necessary to compile and install NHC
    apt-get install --assume-yes make
    touch .installed
fi

^ can be safely removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants