From 98ec1f586ba77815dd8461d451df781b61e52b79 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Szczepanik?= Date: Mon, 7 Aug 2023 17:42:24 +0200 Subject: [PATCH] Update DVC workflow for DVC v3 DVC 3.0 was released with a set of breaking changes. However, the only one which affects us is the removal of `dvc run` in favour of `dvc stage add --run`. This commit updates the DVC workflow to use the 3.0 command, removing the cause of build failures reported in #965. This is effectively a reversal of 47b3ed9 which pinned dvc < 3.0 for our build. While there were multiple quick-release 3.x versions (going from 3.0 to 3.12 in the course of 3 weeks), they seem to adhere to semver, as the changelog reports no breaking changes. At a quick glance, none of the new features or bug fixes are particularly relevant for our (rather basic) workflow, so it seems ok to go with an updated major version dependency. --- docs/beyond_basics/101-168-dvc.rst | 11 +++++++---- requirements-devel.txt | 2 +- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/beyond_basics/101-168-dvc.rst b/docs/beyond_basics/101-168-dvc.rst index fb81bacf4..e1e70db96 100644 --- a/docs/beyond_basics/101-168-dvc.rst +++ b/docs/beyond_basics/101-168-dvc.rst @@ -584,7 +584,7 @@ The final script, ``src/evaluate.py`` is used to evaluate the trained classifier There are more detailed insights and explanations of the actual analysis code in the `Tutorial `_ if you're interested in finding out more. For workflow management, DVC has the concept of a "DVC pipeline". -A pipeline consists of multiple stages and is executed using a :shcmd:`dvc run` command. +A pipeline consists of multiple stages, which are set up and executed using a :shcmd:`dvc stage add [--run]` command. Each stage has three components: "deps", "outs", and "command". Each of the scripts in the repository will be represented by a stage in the DVC pipeline. @@ -615,9 +615,10 @@ The following command sets up the stage: :language: console ### DVC - $ dvc run -n prepare \ + $ dvc stage add -n prepare \ -d src/prepare.py -d data/raw \ -o data/prepared/train.csv -o data/prepared/test.csv \ + --run \ python src/prepare.py The ``-n`` parameter gives the stage a name, the ``-d`` parameter passes the dependencies -- the raw data -- to the command, and the ``-o`` parameter defines the outputs of the command -- the CSV files that ``prepare.py`` will create. @@ -662,9 +663,10 @@ The following command sets it up: :workdir: DVCvsDL/DVC :language: console - $ dvc run -n train \ + $ dvc stage add -n train \ -d src/train.py -d data/prepared/train.csv \ -o model/model.joblib \ + --run \ python src/train.py Afterwards, ``train.py`` has been executed, and the pipelines have been updated with a second stage. @@ -684,9 +686,10 @@ The following command sets it up: :workdir: DVCvsDL/DVC :language: console - $ dvc run -n evaluate \ + $ dvc stage add -n evaluate \ -d src/evaluate.py -d model/model.joblib \ -M metrics/accuracy.json \ + --run \ python src/evaluate.py .. runrecord:: _examples/DL-101-168-158 diff --git a/requirements-devel.txt b/requirements-devel.txt index 6b81de0da..4293b7667 100644 --- a/requirements-devel.txt +++ b/requirements-devel.txt @@ -7,5 +7,5 @@ scikit-learn scikit-image # https://github.com/mwaskom/seaborn/issues/3192 numpy < 1.24 -dvc < 3.0 +dvc >= 3.0 datalad-catalog