Oscar has two DGX H100 nodes. H100 is based on the Nividia Hopper architecutre that accelerates the training of AI models. The two DGX nodes provides better performance when multiple GPUS are used, in particular with Nvidia software like NGC containers.
{% hint style="info" %} Multiple-Instance GPU (MIG) is not enabled on the DGX H100 nodes {% endhint %}
Each DGX H100 node has 112 Intel CPUs with 2TB memory, and 8 Nvidia H100 GPUs. Each H100 GPU has 80G memory.
The two DGX H100 nodes are in the gpu-he
partition. To access H100 GPUs, users need to submit jobs to the gpu-he partition and request the h100 feature, i.e.
#SBATCH --partition=gpu-he
#SBATCH --constraint=h100
NGC containers provide the best performance from the DGX H100 nodes. Running tensorflow containers is an example for running NGC containers.
The two nodes have Intel CPUs. So Oscar modules can still be loaded and run on the two DGX nodes.