Add tests for DSS on NVIDIA GPUs and only CPUs (New) (#1609)

Changes to tests jobs - Individual test jobs that used to check for available GPUs have been replaced by using the graphics_card resource, and enable skipping respective tests when the relevant GPUs are not available. - Tests have been added to check setting up DSS for using CUDA with NVIDIA GPUs, and then some simple tests are run to see if PyTorch and Tensorflow can actually use the GPUs. - Tests have been added to run only on the CPU. These tests will be run on all machines irrespective of available GPUs. - Shell scripts have been refactored to have more re-usable functions. - Tests verifying Intel GPU were updated since they had a bug where they started counting NVIDIA GPUs too as Intel GPUs. The tests are less precise now, i.e. now testing for a minimum expected value for GPU counts and capacity to be available, instead of previous tests looking for exact counts. Changes to the test plan - Changes related to resource.pxu and additional NVIDIA tests, as explained above. - Unused test-plans for individually testing ITEX or IPEX have been removed since they get tested in the main test plan any way. Changes to the snap - The command to trigger the tests from the checkbox-dss snap produced with the provider has been changed from validate-intel-gpu to validate-with-gpu. - The command to install all the dependencies for the running the tests called install-deps has been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, and kubectl. - These are backwards-incompatible change to the snap and hence its version has been bumped from 2.0 to 3.0, and changes have been made to the relevant snapcraft.yaml and to the README. Changes to the relevant GitHub workflow - The GitHub workflow for running DSS has been refactored to now need a single job definition that can be used for all the values from the test matrix. - An NVIDIA DGX machine has been added as a target machine, representing a machine that does not have any Intel GPUs, instead, only NVIDIA GPUs. - Multiple Microk8s versions have been added to the test-matrix. Full Changelog * add jobs to DSS validation for setup and test on NVIDIA GPUs For the moment we lump it together in the validate-intel-gpu launcher... more refactoring coming * fix cuda test for tensorflow and give more time for things to settle * fix dependency of nvidia_gpu_addon/enable job * fix wrong dependency for cuda jobs and make validation more reliable * fix shebang to use control instead of remote in launcher script * fix flaky gpu addon rollout checking in better order and more sleep * make the GPU checking into resources to control GPU tests are run * remove flaky mlflow deployed test This is covered by checking that DSS's status says 'MLFlow deployment: Ready'. The way the removed test was implemented assumed position of the service's name in the output and made it flaky, especially when re-running the tests. * update other dss test-plans to use the GPU as resources * reduce max_attempts for retry to 2 Since many tests here depend on some resources to be available, specifically: GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine and hence we should not waste our time too much retrying these tests. * add cpu-only tests for dss * rename validate script to not contain intel and bump snap's version * refactor testflinger job file builder to unify into one re-usable one * add nvidia dgx as target machine for DSS testflinger jobs * allow other workflow jobs in matrix to continue running if one fails * add notebook removal tests and rename cases to be consistent Notebook removal is part of the CLI of DSS anyway, and makes sense to be tested. Nevertheless, the main reason to add these tests is so that the entire checkbox test plan can be repeated without having to uninstall everything; removing notebook resets DSS into a re-testable state. * skip installing intel gpu plugin if it is already there * remove unused itex- and ipex-only test plans * rename check_dss.sh to check_dss for pseudo-fluent usage * refactor remove notebook test to accept multiple arguments * extract out notebook creation to reused function * disable intel gpu capacity tests temporarily the tests fail on re-runs because they start counting nvidia gpus too * rename test case for dss to be more fluid * refactor checking dss status into reusable function * add missing usage string for dss create notebook function * use pushd popd instead of cd-ing to HOME in check dss * rename check_cuda.sh to check_cuda to have a pseudo-fluent usage * refactor cuda notebook tests to reusable script * refactor out the notebook tests for cpu * refactor out itex tests to common notebook script one redundant test job has been removed since the new test-case now implicitly tests importing itex as well * refactor out ipex tests to common notebook script one redundant test job has been removed since the new test-case now implicitly tests importing ipex as well * reformat long requires clauses to multi-line ones * drop .sh extension from check_intel script * fix failing intel gpu verification tests There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA GPUs too under its label once NVIDIA's plugin is enabled. The tests are now updated to check for matching the minimum slot count instead of an exact one. * reduce sleep time in steps while enabling nvidia gpu addon * fix help string for check_notebook * refactor install-deps script allowing customization of microk8s and kubectl too * add customized microk8s channels to github workflow for dss * fix default dss_snap_channel to latest/stable instead of non-existent 1/stable * add .sh extension back to the test runner scripts It helps to know which script is being run * use graphics_card resource for checking GPU instead of own * change to detecting GPU based on vendor the previous approach was checking for driver, but that does not work for NVIDIA GPUs because we don't install their drivers on the machine (the drivers are installed in the k8s operator). * fix mention of default channel for DSS in the README * remove unnecessary dss integration tests script (coming later) Fix CHECKBOX-1586 Fix CHECKBOX-1668
canonical · Dec 3, 2024 · 355ecf0 · 355ecf0
1 parent 843f2a8
commit 355ecf0
Show file tree

Hide file tree

Showing 15 changed files with 504 additions and 374 deletions.
diff --git a/.github/workflows/testflinger-contrib-dss-regression.yaml b/.github/workflows/testflinger-contrib-dss-regression.yaml
@@ -21,36 +21,31 @@ jobs:
       run:
         working-directory: contrib/checkbox-dss-validation
     strategy:
+      fail-fast: false
       matrix:
         dss_channel:
           - latest/stable
           - latest/edge
+        microk8s_channel:
+          - 1.28/stable
+          - 1.31/stable
         queue:
-          - dell-precision-3470-c30322 #ADL iGPU + NVIDIA GPU
-          - dell-precision-5680-c31665 #RPL iGPU + Arc Pro A60M dGPU
+          - name: dell-precision-3470-c30322 #ADL iGPU + NVIDIA GPU
+            provision_data: "distro: jammy"
+          - name: dell-precision-5680-c31665 #RPL iGPU + Arc Pro A60M dGPU
+            provision_data: "url: http://10.102.196.9/somerville/Platforms/jellyfish-muk/X96_A00/dell-bto-jammy-jellyfish-muk-X96-20230419-19_A00.iso"
+          - name: nvidia-dgx-station-c25989  # NO iGPU + NVIDIA GPU
+            provision_data: "distro: jammy"
     steps:
       - name: Check out code
         uses: actions/checkout@v4
-      - name: Build job file from template with maas2 provisioning
-        if: ${{ matrix.queue == 'dell-precision-3470-c30322' }}
-        env:
-          PROVISION_DATA: "distro: jammy"
+      - name: Build job file from template
         run: |
           sed -e "s|REPLACE_BRANCH|${BRANCH}|" \
-          -e "s|REPLACE_QUEUE|${{ matrix.queue }}|" \
-          -e "s|REPLACE_PROVISION_DATA|${PROVISION_DATA}|" \
-          -e "s|REPLACE_DSS_CHANNEL|${{ matrix.dss_channel }}|" \
-          ${GITHUB_WORKSPACE}/contrib/checkbox-dss-validation/testflinger/job-def.yaml > \
-          ${GITHUB_WORKSPACE}/job.yaml
-      - name: Build job file from template with oemscript provisioning
-        if: ${{ matrix.queue == 'dell-precision-5680-c31665' }}
-        env:
-          PROVISION_DATA: "url: http://10.102.196.9/somerville/Platforms/jellyfish-muk/X96_A00/dell-bto-jammy-jellyfish-muk-X96-20230419-19_A00.iso"
-        run: |
-          sed -e "s|REPLACE_BRANCH|${BRANCH}|" \
-          -e "s|REPLACE_QUEUE|${{ matrix.queue }}|" \
-          -e "s|REPLACE_PROVISION_DATA|${PROVISION_DATA}|" \
+          -e "s|REPLACE_QUEUE|${{ matrix.queue.name }}|" \
+          -e "s|REPLACE_PROVISION_DATA|${{ matrix.queue.provision_data }}|" \
           -e "s|REPLACE_DSS_CHANNEL|${{ matrix.dss_channel }}|" \
+          -e "s|REPLACE_MICROK8S_CHANNEL|${{ matrix.microk8s_channel }}|" \
           ${GITHUB_WORKSPACE}/contrib/checkbox-dss-validation/testflinger/job-def.yaml > \
           ${GITHUB_WORKSPACE}/job.yaml
       - name: Submit testflinger job

diff --git a/contrib/checkbox-dss-validation/README.md b/contrib/checkbox-dss-validation/README.md
@@ -1,12 +1,14 @@
 # Welcome to the Checkbox DSS project!
 
-This repository contains the Checkbox DSS Provider (test cases and test plans for validating Intel GPU support in the [Data Science Stack](https://documentation.ubuntu.com/data-science-stack/en/latest/)) as well as everything that is required to build the `checkbox-dss` snap.
+This repository contains the Checkbox DSS Provider (test cases and test plans for validating Intel and NVIDIA GPU support in the [Data Science Stack](https://documentation.ubuntu.com/data-science-stack/en/latest/)) as well as everything that is required to build the `checkbox-dss` snap.
 
 # Requirements
 
 - Ubuntu Jammy (22.04)
 - Supported hardware platforms:
+  - No GPUs
   - Intel platforms with recent GPU (>= Broadwell)
+  - Recent NVIDIA GPU
 
 # Installation
 
@@ -19,7 +21,7 @@ lxd init --auto
 git clone https://github.com/canonical/checkbox
 cd checkbox/contrib/checkbox-dss-validation
 snapcraft
-sudo snap install --dangerous --classic ./checkbox-dss_2.0_amd64.snap
+sudo snap install --dangerous --classic ./checkbox-dss_3.0_amd64.snap
 ```
 
 Make sure that the provider service is running and active:
@@ -40,15 +42,27 @@ By default this will install the `data-science-stack` snap from the `latest/stab
 channel. To instead install from `latest/edge` use:
 
 ```shell
-checkbox-dss.install-deps --dss-snap-channel=latest/edge
+checkbox-dss.install-deps --dss-snap-channel latest/edge
 ```
 
+Furthermore, the default `microk8s` snap channel is `1.28/stable` in classic mode,
+but this can be customized as
+(please note that this snap must to be `--classic` to enable GPU support):
+
+```shell
+checkbox-dss.install-deps --microk8s-snap-channel 1.31/stable
+```
+
+These validations also need the `kubectl` snap installed, and the default channel
+used for that is `1.29/stable`, but can be customized as shown previously by passing
+the appropriate channel name for `--kubectl-snap-channel`.
+
 # Automated Run
 
 To run the test plans:
 
 ```shell
-checkbox-dss.validate-intel-gpu
+checkbox-dss.validate-with-gpu
 ```
 
 # Cleanup

diff --git a/contrib/checkbox-dss-validation/bin/install-deps b/contrib/checkbox-dss-validation/bin/install-deps
@@ -1,56 +1,89 @@
 #!/bin/bash
 set -e
 
-echo -e "\nStep 1/5: Installing microk8s snap"
-sudo snap install microk8s --channel 1.28/stable --classic
-
-USER=$(id -nu ${SNAP_UID})
-HOME=${SNAP_REAL_HOME}
-
-# microk8s commands run from tests are run without sudo
-sudo usermod -a -G microk8s $USER
-# Directory needed for sharing microk8s config with kubectl snap
-mkdir -p $HOME/.kube
-
-echo -e "\nStep 2/5: Configuring microk8s addons"
-sudo microk8s status --wait-ready
-# Give microk8s another minute to stabilize
-# to avoid intermittent failures when
-# enabling hostpath-storage
-echo "Giving microk8s a minute to stabilize..."
-sleep 60
-sudo microk8s enable hostpath-storage
-sudo microk8s enable dns
-sudo microk8s enable rbac
-
-echo "Waiting for microk8s addons to become ready..."
-sudo microk8s.kubectl wait \
-  --for=condition=available \
-  --timeout 1800s \
-  -n kube-system \
-  deployment/coredns \
-  deployment/hostpath-provisioner
-sudo microk8s.kubectl -n kube-system rollout status ds/calico-node
-
-# This is needed to overcome the following bug within microk8s:
-# https://github.com/canonical/microk8s/issues/4453
-echo -e "\nStep 3/5: Installing kubectl snap"
-sudo snap install kubectl --classic --channel=1.29/stable
-# hack as redirecting stdout anywhere but /dev/null throws a permission denied error
-# see: https://forum.snapcraft.io/t/eksctl-cannot-write-to-stdout/17254/4
-sudo microk8s.kubectl config view --raw | tee $HOME/.kube/config > /dev/null
-
-# intel_gpu_top command used for host-level GPU check
-# jq used for cases where jsonpath is insufficient for parsing json results
-echo -e "\nStep 4/5: Installing intel-gpu-tools"
-DEBIAN_FRONTEND=noninteractive sudo apt install -y intel-gpu-tools jq
-
-echo -e "\nStep 5/5: Installing data-science-stack snap"
-optional_arg=$1
-if [ "${optional_arg}" = "--dss-snap-channel=latest/edge" ]; then
-  echo "Installing from edge"
-  sudo snap install data-science-stack --channel latest/edge
-else
-  echo "Installing from stable"
-  sudo snap install data-science-stack --channel latest/stable
-fi
+dss_snap_channel="latest/stable"
+microk8s_snap_channel="1.28/stable"
+kubectl_snap_channel="1.29/stable"
+
+setup_microk8s_snap() {
+    echo -e "\nInstalling microk8s snap from channel $1"
+    sudo snap install microk8s --channel "$1" --classic
+
+    SNAP_USER=$(id -nu "${SNAP_UID}")
+
+    # microk8s commands run from tests are run without sudo
+    sudo usermod -a -G microk8s "$SNAP_USER"
+    # Directory needed for sharing microk8s config with kubectl snap
+    mkdir -p "${SNAP_REAL_HOME}/.kube"
+
+    echo -e "\nConfiguring microk8s addons"
+    sudo microk8s status --wait-ready
+    # Give microk8s another minute to stabilize
+    # to avoid intermittent failures when
+    # enabling hostpath-storage
+    echo "Giving microk8s a minute to stabilize..."
+    sleep 60
+    sudo microk8s enable hostpath-storage
+    sudo microk8s enable dns
+    sudo microk8s enable rbac
+
+    echo "Waiting for microk8s addons to become ready..."
+    sudo microk8s.kubectl wait \
+        --for=condition=available \
+        --timeout 1800s \
+        -n kube-system \
+        deployment/coredns \
+        deployment/hostpath-provisioner
+    sudo microk8s.kubectl -n kube-system rollout status ds/calico-node
+}
+
+setup_kubectl_snap() {
+    # This is needed to overcome the following bug within microk8s:
+    # https://github.com/canonical/microk8s/issues/4453
+    echo -e "\nInstalling kubectl snap from channel $1"
+    sudo snap install kubectl --classic --channel="$1"
+    # hack as redirecting stdout anywhere but /dev/null throws a permission denied error
+    # see: https://forum.snapcraft.io/t/eksctl-cannot-write-to-stdout/17254/4
+    sudo microk8s.kubectl config view --raw | tee "${SNAP_REAL_HOME}/.kube/config" >/dev/null
+}
+
+help_function() {
+    echo "This script is used install all dependencies for checkbox-dss to run; defaults for optional arguments are shown in usage"
+    echo "Usage: checkbox-dss.install-deps [--dss-snap-channel $dss_snap_channel] [--microk8s-snap-channel $microk8s_snap_channel] [--kubectl-snap-channel $kubectl_snap_channel]"
+}
+
+main() {
+    while [ $# -ne 0 ]; do
+        case $1 in
+        --dss-snap-channel)
+            dss_snap_channel="$2"
+            shift 2
+            ;;
+        --microk8s-snap-channel)
+            microk8s_snap_channel="$2"
+            shift 2
+            ;;
+        --kubectl-snap-channel)
+            kubectl_snap_channel="$2"
+            shift 2
+            ;;
+        *) help_function ;;
+        esac
+    done
+
+    echo -e "\n Step 1/4: Setting up microk8s"
+    setup_microk8s_snap "$microk8s_snap_channel"
+
+    echo -e "\n Step 2/4: Setting up kubectl"
+    setup_kubectl_snap "$kubectl_snap_channel"
+
+    # intel_gpu_top command used for host-level GPU check
+    # jq used for cases where jsonpath is insufficient for parsing json results
+    echo -e "\nStep 3/4: Installing intel-gpu-tools"
+    DEBIAN_FRONTEND=noninteractive sudo apt install -y intel-gpu-tools jq
+
+    echo -e "\nStep 4/4: Installing data-science-stack snap from channel $dss_snap_channel"
+    sudo snap install data-science-stack --channel "$dss_snap_channel"
+}
+
+main "$@"
diff --git a/...box-dss-validation/bin/validate-intel-gpu → ...kbox-dss-validation/bin/validate-with-gpu b/...box-dss-validation/bin/validate-intel-gpu → ...kbox-dss-validation/bin/validate-with-gpu
@@ -1,4 +1,4 @@
-#!/usr/bin/env -S checkbox-cli-wrapper remote 127.0.0.1
+#!/usr/bin/env -S checkbox-cli-wrapper control 127.0.0.1
 [launcher]
 app_id = com.canonical.contrib.dss-validation:checkbox
 launcher_version = 1
@@ -14,5 +14,5 @@ forced = yes
 [ui]
 type = silent
 auto_retry = yes
-max_attempts = 10
+max_attempts = 2
 delay_before_retry = 10
diff --git a/contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_cuda.sh b/contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_cuda.sh
@@ -0,0 +1,54 @@
+#!/usr/bin/env bash
+
+set -euxo pipefail
+
+check_nvidia_gpu_addon_can_be_enabled() {
+    # TODO: enable changing GPU_OPERATOR_VERSION
+    GPU_OPERATOR_VERSION=24.6.2
+    echo "[INFO]: enabling the NVIDIA GPU addon"
+    sudo microk8s enable gpu --driver=operator --version="$GPU_OPERATOR_VERSION"
+    SLEEP_SECS=10
+    echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking GPU feature discovery has rolled out."
+    sleep ${SLEEP_SECS}
+    microk8s.kubectl -n gpu-operator-resources rollout status ds/gpu-operator-node-feature-discovery-worker
+    echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking if daemonsets have rolled out."
+    sleep ${SLEEP_SECS}
+    microk8s.kubectl -n gpu-operator-resources rollout status ds/nvidia-device-plugin-daemonset
+    echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking GPU validations have rolled out."
+    sleep ${SLEEP_SECS}
+    echo "[INFO]: Waiting for the GPU validations to rollout"
+    microk8s.kubectl -n gpu-operator-resources rollout status ds/nvidia-operator-validator
+    echo "Test success: NVIDIA GPU addon enabled."
+}
+
+check_nvidia_gpu_validations_succeed() {
+    SLEEP_SECS=5
+    echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking if GPU validations were successful."
+    sleep ${SLEEP_SECS}
+    result=$(microk8s.kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator)
+    if [ "${result}" = "all validations are successful" ]; then
+        echo "Test success: NVIDIA GPU validations were successful!"
+    else
+        >&2 echo "Test failure: NVIDIA GPU validations were not successful, got ${result}"
+        exit 1
+    fi
+}
+
+help_function() {
+    echo "This script is used for tests related to CUDA"
+    echo "Usage: check_dss.sh <test_case>"
+    echo
+    echo "Test cases currently implemented:"
+    echo -e "\t<gpu_addon_can_be_enabled>: check_nvidia_gpu_addon_can_be_enabled"
+    echo -e "\t<gpu_validations_succeed>: check_nvidia_gpu_validations_succeed"
+}
+
+main() {
+    case ${1} in
+    gpu_addon_can_be_enabled) check_nvidia_gpu_addon_can_be_enabled ;;
+    gpu_validations_succeed) check_nvidia_gpu_validations_succeed ;;
+    *) help_function ;;
+    esac
+}
+
+main "$@"