-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Nvidia GPU Support #57
base: main
Are you sure you want to change the base?
Conversation
Done temporarily to get GPU support working. Undo this once `slurm_ops` is properly updated.
Hey @dsloanm , really nice! I only have one concern: auto-rebooting. Juju has default model configs In revs past, we put the application into a blocked status if the machine needs reboot so that we could handle rebooting to get the new kernel before installing device drivers so we didn't install drivers for the outgoing kernel (like if you install a kernel and then install other drivers before rebooting to get the new kernel). I'm not sure if the Nvidia drivers installed via apt will have this issue, but we had the issue when we previously installed infiniband drivers with another charm- where the charm installed drivers for the running, outgoing kernel. Thoughts? |
Possibly a conditional reboot of the machine in the dispatch file would be best so we can catch it before any charm code actually runs? something like #!/bin/bash
# Filename: dispatch
if ! [[ -f '.init-reboot' ]]
then
if [[ -f '/var/run/reboot-required' ]]
then
reboot
fi
touch .init-reboot
fi
JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv /usr/bin/env python3 ./src/charm.py |
Side note (if you do modify the if ! [[ -f '.installed' ]]
then
# Necessary to compile and install NHC
apt-get install --assume-yes make
touch .installed
fi ^ can be safely removed. |
This PR enables use of Nvidia GPUs on a Charmed HPC cluster. The
slurmd
charm is extended to perform automated GPU detection and driver installation. Theslurmctld
charm is extended to be aware of GPU-enabled compute nodes and to provide the necessary configuration inslurm.conf
and the newgres.conf
configuration file.Usage