Skip to content

Commit

Permalink
Add tests for DSS on NVIDIA GPUs and only CPUs (New) (#1609)
Browse files Browse the repository at this point in the history
Changes to tests jobs

- Individual test jobs that used to check for available GPUs have been replaced by using the graphics_card resource, and enable skipping respective tests when the relevant GPUs are not available.
- Tests have been added to check setting up DSS for using CUDA with NVIDIA GPUs, and then some simple tests are run to see if PyTorch and Tensorflow can actually use the GPUs.
- Tests have been added to run only on the CPU. These tests will be run on all machines irrespective of available GPUs.
- Shell scripts have been refactored to have more re-usable functions.
- Tests verifying Intel GPU were updated since they had a bug where they started counting NVIDIA GPUs too as Intel GPUs. The tests are less precise now, i.e. now testing for a minimum expected value for GPU counts and capacity to be available, instead of previous tests looking for exact counts.

Changes to the test plan

 - Changes related to resource.pxu and additional NVIDIA tests, as explained above.
- Unused test-plans for individually testing ITEX or IPEX have been removed since they get tested in the main test plan any way.

Changes to the snap

- The command to trigger the tests from the checkbox-dss snap produced with the provider has been changed from validate-intel-gpu to validate-with-gpu.
- The command to install all the dependencies for the running the tests called install-deps has been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, and kubectl.
-  These are backwards-incompatible change to the snap and hence its version has been bumped from 2.0 to 3.0, and changes have been made to the relevant snapcraft.yaml and to the README.

Changes to the relevant GitHub workflow

- The GitHub workflow for running DSS has been refactored to now need a single job definition that can be used for all the values from the test matrix.
- An NVIDIA DGX machine has been added as a target machine, representing a machine that does not have any Intel GPUs, instead, only NVIDIA GPUs.
- Multiple Microk8s versions have been added to the test-matrix.

Full Changelog

* add jobs to DSS validation for setup and test on NVIDIA GPUs

For the moment we lump it together in the validate-intel-gpu launcher...
more refactoring coming

* fix cuda test for tensorflow and give more time for things to settle

* fix dependency of nvidia_gpu_addon/enable job

* fix wrong dependency for cuda jobs and make validation more reliable

* fix shebang to use control instead of remote in launcher script

* fix flaky gpu addon rollout checking in better order and more sleep

* make the GPU checking into resources to control GPU tests are run

* remove flaky mlflow deployed test

This is covered by checking that DSS's status says 'MLFlow deployment: Ready'.
The way the removed test was implemented assumed position of the service's name
in the output and made it flaky, especially when re-running the tests.

* update other dss test-plans to use the GPU as resources

* reduce max_attempts for retry to 2

Since many tests here depend on some resources to be available, specifically:
GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine
and hence we should not waste our time too much retrying these tests.

* add cpu-only tests for dss

* rename validate script to not contain intel and bump snap's version

* refactor testflinger job file builder to unify into one re-usable one

* add nvidia dgx as target machine for DSS testflinger jobs

* allow other workflow jobs in matrix to continue running if one fails

* add notebook removal tests and rename cases to be consistent

Notebook removal is part of the CLI of DSS anyway, and makes sense to be
tested.  Nevertheless, the main reason to add these tests is so that the
entire checkbox test plan can be repeated without having to uninstall
everything; removing notebook resets DSS into a re-testable state.

* skip installing intel gpu plugin if it is already there

* remove unused itex- and ipex-only test plans

* rename check_dss.sh to check_dss for pseudo-fluent usage

* refactor remove notebook test to accept multiple arguments

* extract out notebook creation to reused function

* disable intel gpu capacity tests temporarily

the tests fail on re-runs because they start counting nvidia gpus too

* rename test case for dss to be more fluid

* refactor checking dss status into reusable function

* add missing usage string for dss create notebook function

* use pushd popd instead of cd-ing to HOME in check dss

* rename check_cuda.sh to check_cuda to have a pseudo-fluent usage

* refactor cuda notebook tests to reusable script

* refactor out the notebook tests for cpu

* refactor out itex tests to common notebook script

one redundant test job has been removed since the new test-case now implicitly
tests importing itex as well

* refactor out ipex tests to common notebook script

one redundant test job has been removed since the new test-case now implicitly
tests importing ipex as well

* reformat long requires clauses to multi-line ones

* drop .sh extension from check_intel script

* fix failing intel gpu verification tests

There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA
GPUs too under its label once NVIDIA's plugin is enabled.  The tests are now
updated to check for matching the minimum slot count instead of an exact one.

* reduce sleep time in steps while enabling nvidia gpu addon

* fix help string for check_notebook

* refactor install-deps script allowing customization of microk8s and kubectl too

* add customized microk8s channels to github workflow for dss

* fix default dss_snap_channel to latest/stable instead of non-existent 1/stable

* add .sh extension back to the test runner scripts

It helps to know which script is being run

* use graphics_card resource for checking GPU instead of own

* change to detecting GPU based on vendor

the previous approach was checking for driver, but
that does not work for NVIDIA GPUs because we don't
install their drivers on the machine (the drivers
are installed in the k8s operator).

* fix mention of default channel for DSS in the README

* remove unnecessary dss integration tests script (coming later)



Fix CHECKBOX-1586
Fix CHECKBOX-1668
  • Loading branch information
motjuste authored Dec 3, 2024
1 parent 843f2a8 commit 355ecf0
Show file tree
Hide file tree
Showing 15 changed files with 504 additions and 374 deletions.
33 changes: 14 additions & 19 deletions .github/workflows/testflinger-contrib-dss-regression.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,36 +21,31 @@ jobs:
run:
working-directory: contrib/checkbox-dss-validation
strategy:
fail-fast: false
matrix:
dss_channel:
- latest/stable
- latest/edge
microk8s_channel:
- 1.28/stable
- 1.31/stable
queue:
- dell-precision-3470-c30322 #ADL iGPU + NVIDIA GPU
- dell-precision-5680-c31665 #RPL iGPU + Arc Pro A60M dGPU
- name: dell-precision-3470-c30322 #ADL iGPU + NVIDIA GPU
provision_data: "distro: jammy"
- name: dell-precision-5680-c31665 #RPL iGPU + Arc Pro A60M dGPU
provision_data: "url: http://10.102.196.9/somerville/Platforms/jellyfish-muk/X96_A00/dell-bto-jammy-jellyfish-muk-X96-20230419-19_A00.iso"
- name: nvidia-dgx-station-c25989 # NO iGPU + NVIDIA GPU
provision_data: "distro: jammy"
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Build job file from template with maas2 provisioning
if: ${{ matrix.queue == 'dell-precision-3470-c30322' }}
env:
PROVISION_DATA: "distro: jammy"
- name: Build job file from template
run: |
sed -e "s|REPLACE_BRANCH|${BRANCH}|" \
-e "s|REPLACE_QUEUE|${{ matrix.queue }}|" \
-e "s|REPLACE_PROVISION_DATA|${PROVISION_DATA}|" \
-e "s|REPLACE_DSS_CHANNEL|${{ matrix.dss_channel }}|" \
${GITHUB_WORKSPACE}/contrib/checkbox-dss-validation/testflinger/job-def.yaml > \
${GITHUB_WORKSPACE}/job.yaml
- name: Build job file from template with oemscript provisioning
if: ${{ matrix.queue == 'dell-precision-5680-c31665' }}
env:
PROVISION_DATA: "url: http://10.102.196.9/somerville/Platforms/jellyfish-muk/X96_A00/dell-bto-jammy-jellyfish-muk-X96-20230419-19_A00.iso"
run: |
sed -e "s|REPLACE_BRANCH|${BRANCH}|" \
-e "s|REPLACE_QUEUE|${{ matrix.queue }}|" \
-e "s|REPLACE_PROVISION_DATA|${PROVISION_DATA}|" \
-e "s|REPLACE_QUEUE|${{ matrix.queue.name }}|" \
-e "s|REPLACE_PROVISION_DATA|${{ matrix.queue.provision_data }}|" \
-e "s|REPLACE_DSS_CHANNEL|${{ matrix.dss_channel }}|" \
-e "s|REPLACE_MICROK8S_CHANNEL|${{ matrix.microk8s_channel }}|" \
${GITHUB_WORKSPACE}/contrib/checkbox-dss-validation/testflinger/job-def.yaml > \
${GITHUB_WORKSPACE}/job.yaml
- name: Submit testflinger job
Expand Down
22 changes: 18 additions & 4 deletions contrib/checkbox-dss-validation/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# Welcome to the Checkbox DSS project!

This repository contains the Checkbox DSS Provider (test cases and test plans for validating Intel GPU support in the [Data Science Stack](https://documentation.ubuntu.com/data-science-stack/en/latest/)) as well as everything that is required to build the `checkbox-dss` snap.
This repository contains the Checkbox DSS Provider (test cases and test plans for validating Intel and NVIDIA GPU support in the [Data Science Stack](https://documentation.ubuntu.com/data-science-stack/en/latest/)) as well as everything that is required to build the `checkbox-dss` snap.

# Requirements

- Ubuntu Jammy (22.04)
- Supported hardware platforms:
- No GPUs
- Intel platforms with recent GPU (>= Broadwell)
- Recent NVIDIA GPU

# Installation

Expand All @@ -19,7 +21,7 @@ lxd init --auto
git clone https://github.com/canonical/checkbox
cd checkbox/contrib/checkbox-dss-validation
snapcraft
sudo snap install --dangerous --classic ./checkbox-dss_2.0_amd64.snap
sudo snap install --dangerous --classic ./checkbox-dss_3.0_amd64.snap
```

Make sure that the provider service is running and active:
Expand All @@ -40,15 +42,27 @@ By default this will install the `data-science-stack` snap from the `latest/stab
channel. To instead install from `latest/edge` use:

```shell
checkbox-dss.install-deps --dss-snap-channel=latest/edge
checkbox-dss.install-deps --dss-snap-channel latest/edge
```

Furthermore, the default `microk8s` snap channel is `1.28/stable` in classic mode,
but this can be customized as
(please note that this snap must to be `--classic` to enable GPU support):

```shell
checkbox-dss.install-deps --microk8s-snap-channel 1.31/stable
```

These validations also need the `kubectl` snap installed, and the default channel
used for that is `1.29/stable`, but can be customized as shown previously by passing
the appropriate channel name for `--kubectl-snap-channel`.

# Automated Run

To run the test plans:

```shell
checkbox-dss.validate-intel-gpu
checkbox-dss.validate-with-gpu
```

# Cleanup
Expand Down
139 changes: 86 additions & 53 deletions contrib/checkbox-dss-validation/bin/install-deps
Original file line number Diff line number Diff line change
@@ -1,56 +1,89 @@
#!/bin/bash
set -e

echo -e "\nStep 1/5: Installing microk8s snap"
sudo snap install microk8s --channel 1.28/stable --classic

USER=$(id -nu ${SNAP_UID})
HOME=${SNAP_REAL_HOME}

# microk8s commands run from tests are run without sudo
sudo usermod -a -G microk8s $USER
# Directory needed for sharing microk8s config with kubectl snap
mkdir -p $HOME/.kube

echo -e "\nStep 2/5: Configuring microk8s addons"
sudo microk8s status --wait-ready
# Give microk8s another minute to stabilize
# to avoid intermittent failures when
# enabling hostpath-storage
echo "Giving microk8s a minute to stabilize..."
sleep 60
sudo microk8s enable hostpath-storage
sudo microk8s enable dns
sudo microk8s enable rbac

echo "Waiting for microk8s addons to become ready..."
sudo microk8s.kubectl wait \
--for=condition=available \
--timeout 1800s \
-n kube-system \
deployment/coredns \
deployment/hostpath-provisioner
sudo microk8s.kubectl -n kube-system rollout status ds/calico-node

# This is needed to overcome the following bug within microk8s:
# https://github.com/canonical/microk8s/issues/4453
echo -e "\nStep 3/5: Installing kubectl snap"
sudo snap install kubectl --classic --channel=1.29/stable
# hack as redirecting stdout anywhere but /dev/null throws a permission denied error
# see: https://forum.snapcraft.io/t/eksctl-cannot-write-to-stdout/17254/4
sudo microk8s.kubectl config view --raw | tee $HOME/.kube/config > /dev/null

# intel_gpu_top command used for host-level GPU check
# jq used for cases where jsonpath is insufficient for parsing json results
echo -e "\nStep 4/5: Installing intel-gpu-tools"
DEBIAN_FRONTEND=noninteractive sudo apt install -y intel-gpu-tools jq

echo -e "\nStep 5/5: Installing data-science-stack snap"
optional_arg=$1
if [ "${optional_arg}" = "--dss-snap-channel=latest/edge" ]; then
echo "Installing from edge"
sudo snap install data-science-stack --channel latest/edge
else
echo "Installing from stable"
sudo snap install data-science-stack --channel latest/stable
fi
dss_snap_channel="latest/stable"
microk8s_snap_channel="1.28/stable"
kubectl_snap_channel="1.29/stable"

setup_microk8s_snap() {
echo -e "\nInstalling microk8s snap from channel $1"
sudo snap install microk8s --channel "$1" --classic

SNAP_USER=$(id -nu "${SNAP_UID}")

# microk8s commands run from tests are run without sudo
sudo usermod -a -G microk8s "$SNAP_USER"
# Directory needed for sharing microk8s config with kubectl snap
mkdir -p "${SNAP_REAL_HOME}/.kube"

echo -e "\nConfiguring microk8s addons"
sudo microk8s status --wait-ready
# Give microk8s another minute to stabilize
# to avoid intermittent failures when
# enabling hostpath-storage
echo "Giving microk8s a minute to stabilize..."
sleep 60
sudo microk8s enable hostpath-storage
sudo microk8s enable dns
sudo microk8s enable rbac

echo "Waiting for microk8s addons to become ready..."
sudo microk8s.kubectl wait \
--for=condition=available \
--timeout 1800s \
-n kube-system \
deployment/coredns \
deployment/hostpath-provisioner
sudo microk8s.kubectl -n kube-system rollout status ds/calico-node
}

setup_kubectl_snap() {
# This is needed to overcome the following bug within microk8s:
# https://github.com/canonical/microk8s/issues/4453
echo -e "\nInstalling kubectl snap from channel $1"
sudo snap install kubectl --classic --channel="$1"
# hack as redirecting stdout anywhere but /dev/null throws a permission denied error
# see: https://forum.snapcraft.io/t/eksctl-cannot-write-to-stdout/17254/4
sudo microk8s.kubectl config view --raw | tee "${SNAP_REAL_HOME}/.kube/config" >/dev/null
}

help_function() {
echo "This script is used install all dependencies for checkbox-dss to run; defaults for optional arguments are shown in usage"
echo "Usage: checkbox-dss.install-deps [--dss-snap-channel $dss_snap_channel] [--microk8s-snap-channel $microk8s_snap_channel] [--kubectl-snap-channel $kubectl_snap_channel]"
}

main() {
while [ $# -ne 0 ]; do
case $1 in
--dss-snap-channel)
dss_snap_channel="$2"
shift 2
;;
--microk8s-snap-channel)
microk8s_snap_channel="$2"
shift 2
;;
--kubectl-snap-channel)
kubectl_snap_channel="$2"
shift 2
;;
*) help_function ;;
esac
done

echo -e "\n Step 1/4: Setting up microk8s"
setup_microk8s_snap "$microk8s_snap_channel"

echo -e "\n Step 2/4: Setting up kubectl"
setup_kubectl_snap "$kubectl_snap_channel"

# intel_gpu_top command used for host-level GPU check
# jq used for cases where jsonpath is insufficient for parsing json results
echo -e "\nStep 3/4: Installing intel-gpu-tools"
DEBIAN_FRONTEND=noninteractive sudo apt install -y intel-gpu-tools jq

echo -e "\nStep 4/4: Installing data-science-stack snap from channel $dss_snap_channel"
sudo snap install data-science-stack --channel "$dss_snap_channel"
}

main "$@"
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/usr/bin/env -S checkbox-cli-wrapper remote 127.0.0.1
#!/usr/bin/env -S checkbox-cli-wrapper control 127.0.0.1
[launcher]
app_id = com.canonical.contrib.dss-validation:checkbox
launcher_version = 1
Expand All @@ -14,5 +14,5 @@ forced = yes
[ui]
type = silent
auto_retry = yes
max_attempts = 10
max_attempts = 2
delay_before_retry = 10
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env bash

set -euxo pipefail

check_nvidia_gpu_addon_can_be_enabled() {
# TODO: enable changing GPU_OPERATOR_VERSION
GPU_OPERATOR_VERSION=24.6.2
echo "[INFO]: enabling the NVIDIA GPU addon"
sudo microk8s enable gpu --driver=operator --version="$GPU_OPERATOR_VERSION"
SLEEP_SECS=10
echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking GPU feature discovery has rolled out."
sleep ${SLEEP_SECS}
microk8s.kubectl -n gpu-operator-resources rollout status ds/gpu-operator-node-feature-discovery-worker
echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking if daemonsets have rolled out."
sleep ${SLEEP_SECS}
microk8s.kubectl -n gpu-operator-resources rollout status ds/nvidia-device-plugin-daemonset
echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking GPU validations have rolled out."
sleep ${SLEEP_SECS}
echo "[INFO]: Waiting for the GPU validations to rollout"
microk8s.kubectl -n gpu-operator-resources rollout status ds/nvidia-operator-validator
echo "Test success: NVIDIA GPU addon enabled."
}

check_nvidia_gpu_validations_succeed() {
SLEEP_SECS=5
echo "[INFO]: sleeping for ${SLEEP_SECS} seconds before checking if GPU validations were successful."
sleep ${SLEEP_SECS}
result=$(microk8s.kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator)
if [ "${result}" = "all validations are successful" ]; then
echo "Test success: NVIDIA GPU validations were successful!"
else
>&2 echo "Test failure: NVIDIA GPU validations were not successful, got ${result}"
exit 1
fi
}

help_function() {
echo "This script is used for tests related to CUDA"
echo "Usage: check_dss.sh <test_case>"
echo
echo "Test cases currently implemented:"
echo -e "\t<gpu_addon_can_be_enabled>: check_nvidia_gpu_addon_can_be_enabled"
echo -e "\t<gpu_validations_succeed>: check_nvidia_gpu_validations_succeed"
}

main() {
case ${1} in
gpu_addon_can_be_enabled) check_nvidia_gpu_addon_can_be_enabled ;;
gpu_validations_succeed) check_nvidia_gpu_validations_succeed ;;
*) help_function ;;
esac
}

main "$@"
Loading

0 comments on commit 355ecf0

Please sign in to comment.