Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add tests for DSS on NVIDIA GPUs and only CPUs (New) (#1609)
Changes to tests jobs - Individual test jobs that used to check for available GPUs have been replaced by using the graphics_card resource, and enable skipping respective tests when the relevant GPUs are not available. - Tests have been added to check setting up DSS for using CUDA with NVIDIA GPUs, and then some simple tests are run to see if PyTorch and Tensorflow can actually use the GPUs. - Tests have been added to run only on the CPU. These tests will be run on all machines irrespective of available GPUs. - Shell scripts have been refactored to have more re-usable functions. - Tests verifying Intel GPU were updated since they had a bug where they started counting NVIDIA GPUs too as Intel GPUs. The tests are less precise now, i.e. now testing for a minimum expected value for GPU counts and capacity to be available, instead of previous tests looking for exact counts. Changes to the test plan - Changes related to resource.pxu and additional NVIDIA tests, as explained above. - Unused test-plans for individually testing ITEX or IPEX have been removed since they get tested in the main test plan any way. Changes to the snap - The command to trigger the tests from the checkbox-dss snap produced with the provider has been changed from validate-intel-gpu to validate-with-gpu. - The command to install all the dependencies for the running the tests called install-deps has been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, and kubectl. - These are backwards-incompatible change to the snap and hence its version has been bumped from 2.0 to 3.0, and changes have been made to the relevant snapcraft.yaml and to the README. Changes to the relevant GitHub workflow - The GitHub workflow for running DSS has been refactored to now need a single job definition that can be used for all the values from the test matrix. - An NVIDIA DGX machine has been added as a target machine, representing a machine that does not have any Intel GPUs, instead, only NVIDIA GPUs. - Multiple Microk8s versions have been added to the test-matrix. Full Changelog * add jobs to DSS validation for setup and test on NVIDIA GPUs For the moment we lump it together in the validate-intel-gpu launcher... more refactoring coming * fix cuda test for tensorflow and give more time for things to settle * fix dependency of nvidia_gpu_addon/enable job * fix wrong dependency for cuda jobs and make validation more reliable * fix shebang to use control instead of remote in launcher script * fix flaky gpu addon rollout checking in better order and more sleep * make the GPU checking into resources to control GPU tests are run * remove flaky mlflow deployed test This is covered by checking that DSS's status says 'MLFlow deployment: Ready'. The way the removed test was implemented assumed position of the service's name in the output and made it flaky, especially when re-running the tests. * update other dss test-plans to use the GPU as resources * reduce max_attempts for retry to 2 Since many tests here depend on some resources to be available, specifically: GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine and hence we should not waste our time too much retrying these tests. * add cpu-only tests for dss * rename validate script to not contain intel and bump snap's version * refactor testflinger job file builder to unify into one re-usable one * add nvidia dgx as target machine for DSS testflinger jobs * allow other workflow jobs in matrix to continue running if one fails * add notebook removal tests and rename cases to be consistent Notebook removal is part of the CLI of DSS anyway, and makes sense to be tested. Nevertheless, the main reason to add these tests is so that the entire checkbox test plan can be repeated without having to uninstall everything; removing notebook resets DSS into a re-testable state. * skip installing intel gpu plugin if it is already there * remove unused itex- and ipex-only test plans * rename check_dss.sh to check_dss for pseudo-fluent usage * refactor remove notebook test to accept multiple arguments * extract out notebook creation to reused function * disable intel gpu capacity tests temporarily the tests fail on re-runs because they start counting nvidia gpus too * rename test case for dss to be more fluid * refactor checking dss status into reusable function * add missing usage string for dss create notebook function * use pushd popd instead of cd-ing to HOME in check dss * rename check_cuda.sh to check_cuda to have a pseudo-fluent usage * refactor cuda notebook tests to reusable script * refactor out the notebook tests for cpu * refactor out itex tests to common notebook script one redundant test job has been removed since the new test-case now implicitly tests importing itex as well * refactor out ipex tests to common notebook script one redundant test job has been removed since the new test-case now implicitly tests importing ipex as well * reformat long requires clauses to multi-line ones * drop .sh extension from check_intel script * fix failing intel gpu verification tests There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA GPUs too under its label once NVIDIA's plugin is enabled. The tests are now updated to check for matching the minimum slot count instead of an exact one. * reduce sleep time in steps while enabling nvidia gpu addon * fix help string for check_notebook * refactor install-deps script allowing customization of microk8s and kubectl too * add customized microk8s channels to github workflow for dss * fix default dss_snap_channel to latest/stable instead of non-existent 1/stable * add .sh extension back to the test runner scripts It helps to know which script is being run * use graphics_card resource for checking GPU instead of own * change to detecting GPU based on vendor the previous approach was checking for driver, but that does not work for NVIDIA GPUs because we don't install their drivers on the machine (the drivers are installed in the k8s operator). * fix mention of default channel for DSS in the README * remove unnecessary dss integration tests script (coming later) Fix CHECKBOX-1586 Fix CHECKBOX-1668
- Loading branch information