Add deepspeed test to amd scheduled CI #27633

echarlaix · 2023-11-21T15:39:44Z

No description provided.

HuggingFaceDocBuilderDev · 2023-11-21T15:59:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

echarlaix · 2023-11-30T14:58:32Z

.github/workflows/build-docker-images.yml

@@ -271,3 +271,39 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-tensorflow-gpu
+
+  # latest-pytorch-deepspeed-amd:


Also disabled as

transformers/.github/workflows/build-docker-images.yml

Line 211 in 510270a

# Need to be fixed with the help from Guillaume.

cc @ydshieh

@ydshieh is there something we need to do here?

This is another issue that we can deal with outside this PR.

But here we have to build the image manually.

just added one manually so that we can verify the deepspeed tests : echarlaix/amd-deepspeed-test

echarlaix · 2023-11-30T15:00:21Z

.github/workflows/self-scheduled-amd-caller.yml

@@ -5,7 +5,7 @@ on:
    - cron: "17 2 * * *"
  push:
    branches:
-      - run_amd_scheduled_ci_caller*
+      - run_amd_scheduled_ci_caller__*


will remove this modification before merging (added to disable all the other AMD scheduled tests)

.github/workflows/self-nightly-scheduled.yml

ydshieh

Thank you @echarlaix !

LGTM 🚀 , except one question in the docker file.

ydshieh · 2023-11-30T16:11:04Z

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

+FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
+LABEL maintainer="Hugging Face"
+
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.0.1'
+ARG ROCM='5.7'
+
+RUN apt update && \
+    apt install -y --no-install-recommends libaio-dev git && \
+    apt clean && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+RUN python3 -m pip uninstall -y apex
+
+ARG REF=main
+WORKDIR /
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
+
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+
+RUN python3 -c "from deepspeed.launcher.runner import main"


This will (always) use the torch 2.0.1 which is not what we want. I suggest to

either based on docker/transformers-pytorch-amd-gpu/Dockerfile on current main

or not even to build an AMD+deepespeed image: just build+install deepspeed at CI runtime

otherwise, based on this image, but need to install the

ARG PYTORCH='2.1.0'

ARG TORCH_VISION='0.16.0'

ARG TORCH_AUDIO='2.1.0'

Thanks @ydshieh that makes sense, just upgraded the torch version in f846b80.
I cannot use docker/transformers-pytorch-amd-gpu/Dockerfile at the moment as there is an incompatibility with ROCm and deepspeed for some reason (coming from the adam extension), but can modify it and change its parent image if needed

In this case, try to build the image + run it to make sure we are still good.
(more importantly, check the versions of installed torch etc. do give the expected versions - we do get some surprise sometimes :-) )

Regarding where to build the image, let's talk.

Sure I can update huggingface/transformers-pytorch-deepspeed-amd-gpu and trigger the tests again so that we can check everything is working as expected before merging. When running the tests locally (with the updated image so with torch==2.1.0+rocm5.6), the failing tests looked like the same as for the current CI

ydshieh · 2023-12-04T17:34:15Z

Let's cancel it. I will show you how we usually do the experimentation tomorrow.

fxmarty · 2023-12-04T18:16:35Z

Sorry!

ydshieh · 2023-12-05T10:17:55Z

docker/transformers-pytorch-amd-gpu/Dockerfile

+# Invalidate docker cache from here if new commit is available.
+ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json


Is this necessary? I mean, do you see any issue so adding this line to avoid it?

For the motivation: https://stackoverflow.com/questions/36996046/how-to-prevent-dockerfile-caching-git-clone

We do not want docker to cache the following git clone.

I am not familiar with this part. Is it relevant only if we build the image on the same machine multiple times?
So far we don't have such issue, but maybe that is because we build the image on Github Actions hosted machine rather than self-hosted VM.

I had the issue locally when using this docker image, where I would not pick up the latest transformers commit due to docker cache. I think it is useful to make sure we use the latest commit - even though in the CI this is not an issue.

ok。

So the issue occurs when you run the docker image, not at the time of docker build, right?

If this is more on when we use the image, I think it is expected. These images are not built for long-term usage: they are re-built on a daily basis to run the CI at a specific and the same commit .

Anyone wants to use the image locally is responsible to do git fetch && git pull (or git checkout).

If we accept this change (and IIUC), it means, on the CI (GitHub Actions), each job may get different commits to run the test against, which is not what we want.

The above is just to explain the current behavior (before this change), not meaning we have issue, as on CI, we have

git fetch && git checkout ${{ github.sha }}

so we are safe.

So the issue occurs when you run the docker image, not at the time of docker build, right?

No, docker build caches intermediate layers, one of them being RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF. If we use REF=main and actually want to use the latest commit, we need to invalidate the docker cache and this is what this ADD is doing.

Otherwise, docker cached intermediate layer is used and we may use an outdated commit compared to the latest commit available at build time.

OK, so during build time. Thanks for the detailed explanation.

ydshieh · 2023-12-05T10:20:33Z

.github/workflows/build-docker-images.yml

@@ -271,3 +271,39 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-tensorflow-gpu
+
+  # latest-pytorch-deepspeed-amd:


This is another issue that we can deal with outside this PR.

ydshieh · 2023-12-05T10:25:15Z

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

+ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]


I would suggest to have

RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

or whatever equivalent for deepspeed in ROCM if necessary.

sure we can pre-compile deepspeed for this set of ops, was just wondering whether we can keep it in jit mode so that all the machine compatible ops can be dynamically build at runtime

jit mode will make some tests slower and potentially timeout, right?

didn't observe this when testing but added just in case : 9696cc4

Thanks @echarlaix . The most important is to do this again at CI time, as mentioned in the comment below.

(It may or may not relevant now, but I never checked again. I keep both to just avoid potential issue popping up)

ydshieh · 2023-12-05T10:28:04Z

.github/workflows/self-scheduled-amd.yml

+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+


maybe add

# To avoid unknown test failures - name: Pre build DeepSpeed *again* working-directory: /workspace run: | python3 -m pip uninstall -y deepspeed DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

to be the same as in other workflow file.

not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?

I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.

in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu and just install deepspeed in the tests directly (to avoid this step)

ydshieh · 2023-12-05T10:28:36Z

.github/workflows/build-docker-images.yml

@@ -271,3 +271,39 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-tensorflow-gpu
+
+  # latest-pytorch-deepspeed-amd:


But here we have to build the image manually.

ydshieh · 2023-12-05T10:29:47Z

Hi @fxmarty Let me know if you have any question or need help regarding my above comments.

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

echarlaix · 2023-12-05T10:51:50Z

Thanks @fxmarty @ydshieh for the updates while I was sick, let me push the new image manually, just launching all the tests locally to verify everything is working with the updated image before pushing

ydshieh · 2023-12-05T10:59:30Z

Hi @echarlaix You don't need to run all the tests. Just make sure

the image could build
the deepspeed build step works
the workflow contain no bug - i.e. it could be trigger and run on GH actions
the deepspeed tests could be launched (we don't really care how many failing tests it has for now)

echarlaix · 2023-12-06T14:44:44Z

There is 11 failing tests for AMD vs 7 for the current CI, these 4 tests are bf16 variant of already failing tests.

Failing test current CI :

tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_do_eval_no_train
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_fp16
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_qa_led
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_fsmt

Failing tests for AMD :

tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_do_eval_no_train
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_fp16
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_t5_v1
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_fsmt

.github/workflows/self-scheduled-amd.yml

.github/workflows/self-scheduled-amd-mi210-caller.yml

ydshieh

Thanks @fxmarty @echarlaix ! I pushed a commit to finalize it.

Ping @LysandreJik for a core maintainer's review.

LysandreJik

Thanks all!

ydshieh · 2023-12-11T14:38:38Z

I will merge after a fix #27951 being merged.

* add deepspeed scheduled test for amd * fix image * add dockerfile * add comment * enable tests * trigger * remove trigger for this branch * trigger * change runner env to trigger the docker build image test * use new docker image * remove test suffix from docker image tag * replace test docker image with original image * push new image * Trigger * add back amd tests * fix typo * add amd tests back * fix * comment until docker image build scheduled test fix * remove deprecated deepspeed build option * upgrade torch * update docker & make tests pass * Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile * fix * tmp disable test * precompile deepspeed to avoid timeout during tests * fix comment * trigger deepspeed tests with new image * comment tests * trigger * add sklearn dependency to fix slow tests * enable back other tests * final update --------- Co-authored-by: Felix Marty <[email protected]> Co-authored-by: Félix Marty <[email protected]> Co-authored-by: ydshieh <[email protected]>

add deepspeed scheduled test for amd

1e8ce66

echarlaix added 17 commits November 22, 2023 00:32

fix image

bf276ed

add dockerfile

2cfb53d

add comment

5a9a529

enable tests

af46e87

trigger

c29d249

remove trigger for this branch

a0c3daf

trigger

4cb9d6f

change runner env to trigger the docker build image test

a703349

use new docker image

a47ac2c

remove test suffix from docker image tag

233bd7f

replace test docker image with original image

971ba80

push new image

da4774c

Trigger

cbe995f

add back amd tests

e16c271

fix typo

70c3580

add amd tests back

090b88e

fix

508ae29

echarlaix marked this pull request as ready for review November 30, 2023 14:43

comment until docker image build scheduled test fix

09fee9e

echarlaix commented Nov 30, 2023

View reviewed changes

remove deprecated deepspeed build option

407cfe9

echarlaix commented Nov 30, 2023

View reviewed changes

.github/workflows/self-nightly-scheduled.yml Show resolved Hide resolved

ydshieh reviewed Nov 30, 2023

View reviewed changes

upgrade torch

f846b80

echarlaix requested review from fxmarty, mfuntowicz and mht-sharma November 30, 2023 17:54

ydshieh reviewed Dec 5, 2023

View reviewed changes

fxmarty reviewed Dec 5, 2023

View reviewed changes

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile Outdated Show resolved Hide resolved

fxmarty and others added 3 commits December 5, 2023 19:34

Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

f0f931e

fix

40398b9

tmp disable test

3332cd2

echarlaix added 7 commits December 5, 2023 16:11

precompile deepspeed to avoid timeout during tests

9696cc4

fix comment

84a7a33

Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test

fc6d890

trigger deepspeed tests with new image

df00cff

comment tests

92c402d

trigger

fa82a9c

add sklearn dependency to fix slow tests

ecb9239

echarlaix commented Dec 6, 2023

View reviewed changes

.github/workflows/self-scheduled-amd.yml Outdated Show resolved Hide resolved

enable back other tests

cfcc312

echarlaix commented Dec 7, 2023

View reviewed changes

.github/workflows/self-scheduled-amd-mi210-caller.yml Outdated Show resolved Hide resolved

final update

ae82b3f

ydshieh approved these changes Dec 7, 2023

View reviewed changes

ydshieh requested a review from LysandreJik December 7, 2023 15:52

LysandreJik approved these changes Dec 11, 2023

View reviewed changes

ydshieh merged commit 39acfe8 into main Dec 11, 2023
24 of 40 checks passed

ydshieh deleted the run_amd_scheduled_ci_caller_deepspeed_test branch December 11, 2023 15:33

		# Invalidate docker cache from here if new commit is available.
		ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json

Add deepspeed test to amd scheduled CI #27633

Add deepspeed test to amd scheduled CI #27633

Conversation

echarlaix commented Nov 21, 2023

HuggingFaceDocBuilderDev commented Nov 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Dec 4, 2023

fxmarty commented Dec 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxmarty Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Dec 5, 2023

echarlaix commented Dec 5, 2023

ydshieh commented Dec 5, 2023 • edited Loading

echarlaix commented Dec 6, 2023 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

ydshieh commented Dec 11, 2023

fxmarty Dec 5, 2023 •

edited

Loading

ydshieh commented Dec 5, 2023 •

edited

Loading

echarlaix commented Dec 6, 2023 •

edited

Loading