Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepspeed test to amd scheduled CI #27633

Merged
merged 35 commits into from
Dec 11, 2023

Conversation

echarlaix
Copy link
Collaborator

No description provided.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@echarlaix echarlaix marked this pull request as ready for review November 30, 2023 14:43
@@ -271,3 +271,39 @@ jobs:
REF=main
push: true
tags: huggingface/transformers-tensorflow-gpu

# latest-pytorch-deepspeed-amd:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also disabled as

# Need to be fixed with the help from Guillaume.
cc @ydshieh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh is there something we need to do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another issue that we can deal with outside this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here we have to build the image manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added one manually so that we can verify the deepspeed tests : echarlaix/amd-deepspeed-test

@@ -5,7 +5,7 @@ on:
- cron: "17 2 * * *"
push:
branches:
- run_amd_scheduled_ci_caller*
- run_amd_scheduled_ci_caller__*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this modification before merging (added to disable all the other AMD scheduled tests)

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @echarlaix !

LGTM 🚀 , except one question in the docker file.

Comment on lines 1 to 26
FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTORCH='2.0.1'
ARG ROCM='5.7'

RUN apt update && \
apt install -y --no-install-recommends libaio-dev git && \
apt clean && \
rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --no-cache-dir --upgrade pip

RUN python3 -m pip uninstall -y apex

ARG REF=main
WORKDIR /
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop

RUN python3 -c "from deepspeed.launcher.runner import main"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will (always) use the torch 2.0.1 which is not what we want. I suggest to

  • either based on docker/transformers-pytorch-amd-gpu/Dockerfile on current main

  • or not even to build an AMD+deepespeed image: just build+install deepspeed at CI runtime

  • otherwise, based on this image, but need to install the

    • ARG PYTORCH='2.1.0'
    • ARG TORCH_VISION='0.16.0'
    • ARG TORCH_AUDIO='2.1.0'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ydshieh that makes sense, just upgraded the torch version in f846b80.
I cannot use docker/transformers-pytorch-amd-gpu/Dockerfile at the moment as there is an incompatibility with ROCm and deepspeed for some reason (coming from the adam extension), but can modify it and change its parent image if needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, try to build the image + run it to make sure we are still good.
(more importantly, check the versions of installed torch etc. do give the expected versions - we do get some surprise sometimes :-) )

Regarding where to build the image, let's talk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can update huggingface/transformers-pytorch-deepspeed-amd-gpu and trigger the tests again so that we can check everything is working as expected before merging. When running the tests locally (with the updated image so with torch==2.1.0+rocm5.6), the failing tests looked like the same as for the current CI

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 4, 2023

Let's cancel it. I will show you how we usually do the experimentation tomorrow.

@fxmarty
Copy link
Contributor

fxmarty commented Dec 4, 2023

Sorry!

Comment on lines +26 to +27
# Invalidate docker cache from here if new commit is available.
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I mean, do you see any issue so adding this line to avoid it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the motivation: https://stackoverflow.com/questions/36996046/how-to-prevent-dockerfile-caching-git-clone

We do not want docker to cache the following git clone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this part. Is it relevant only if we build the image on the same machine multiple times?
So far we don't have such issue, but maybe that is because we build the image on Github Actions hosted machine rather than self-hosted VM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the issue locally when using this docker image, where I would not pick up the latest transformers commit due to docker cache. I think it is useful to make sure we use the latest commit - even though in the CI this is not an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok。

So the issue occurs when you run the docker image, not at the time of docker build, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is more on when we use the image, I think it is expected. These images are not built for long-term usage: they are re-built on a daily basis to run the CI at a specific and the same commit .

Anyone wants to use the image locally is responsible to do git fetch && git pull (or git checkout).

If we accept this change (and IIUC), it means, on the CI (GitHub Actions), each job may get different commits to run the test against, which is not what we want.

The above is just to explain the current behavior (before this change), not meaning we have issue, as on CI, we have

git fetch && git checkout ${{ github.sha }}

so we are safe.

Copy link
Contributor

@fxmarty fxmarty Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue occurs when you run the docker image, not at the time of docker build, right?

No, docker build caches intermediate layers, one of them being RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF. If we use REF=main and actually want to use the latest commit, we need to invalidate the docker cache and this is what this ADD is doing.

Otherwise, docker cached intermediate layer is used and we may use an outdated commit compared to the latest commit available at build time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so during build time. Thanks for the detailed explanation.

@@ -271,3 +271,39 @@ jobs:
REF=main
push: true
tags: huggingface/transformers-tensorflow-gpu

# latest-pytorch-deepspeed-amd:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another issue that we can deal with outside this PR.

ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to have

RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

or whatever equivalent for deepspeed in ROCM if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can pre-compile deepspeed for this set of ops, was just wondering whether we can keep it in jit mode so that all the machine compatible ops can be dynamically build at runtime

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jit mode will make some tests slower and potentially timeout, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't observe this when testing but added just in case : 9696cc4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @echarlaix . The most important is to do this again at CI time, as mentioned in the comment below.

(It may or may not relevant now, but I never checked again. I keep both to just avoid potential issue popping up)

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add

      # To avoid unknown test failures
      - name: Pre build DeepSpeed *again*
        working-directory: /workspace
        run: |
          python3 -m pip uninstall -y deepspeed
          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

to be the same as in other workflow file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu and just install deepspeed in the tests directly (to avoid this step)

@@ -271,3 +271,39 @@ jobs:
REF=main
push: true
tags: huggingface/transformers-tensorflow-gpu

# latest-pytorch-deepspeed-amd:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here we have to build the image manually.

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 5, 2023

Hi @fxmarty Let me know if you have any question or need help regarding my above comments.

@echarlaix
Copy link
Collaborator Author

Thanks @fxmarty @ydshieh for the updates while I was sick, let me push the new image manually, just launching all the tests locally to verify everything is working with the updated image before pushing

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 5, 2023

Hi @echarlaix You don't need to run all the tests. Just make sure

  • the image could build
  • the deepspeed build step works
  • the workflow contain no bug - i.e. it could be trigger and run on GH actions
  • the deepspeed tests could be launched (we don't really care how many failing tests it has for now)

@echarlaix
Copy link
Collaborator Author

echarlaix commented Dec 6, 2023

There is 11 failing tests for AMD vs 7 for the current CI, these 4 tests are bf16 variant of already failing tests.

Failing test current CI :

tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_do_eval_no_train
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_fp16
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_qa_led
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_fsmt

Failing tests for AMD :

tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_do_eval_no_train
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_fp32_non_distributed_zero3_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero2_fp16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_bf16
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_resume_train_not_from_ds_checkpoint_zero3_fp16
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_t5_v1
tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero3_trans_fsmt

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fxmarty @echarlaix ! I pushed a commit to finalize it.

Ping @LysandreJik for a core maintainer's review.

@ydshieh ydshieh requested a review from LysandreJik December 7, 2023 15:52
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all!

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 11, 2023

I will merge after a fix #27951 being merged.

@ydshieh ydshieh merged commit 39acfe8 into main Dec 11, 2023
24 of 40 checks passed
@ydshieh ydshieh deleted the run_amd_scheduled_ci_caller_deepspeed_test branch December 11, 2023 15:33
iantbutler01 pushed a commit to BismuthCloud/transformers that referenced this pull request Dec 16, 2023
* add deepspeed scheduled test for amd

* fix image

* add dockerfile

* add comment

* enable tests

* trigger

* remove trigger for this branch

* trigger

* change runner env to trigger the docker build image test

* use new docker image

* remove test suffix from docker image tag

* replace test docker image with original image

* push new image

* Trigger

* add back amd tests

* fix typo

* add amd tests back

* fix

* comment until docker image build scheduled test fix

* remove deprecated deepspeed build option

* upgrade torch

* update docker & make tests pass

* Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

* fix

* tmp disable test

* precompile deepspeed to avoid timeout during tests

* fix comment

* trigger deepspeed tests with new image

* comment tests

* trigger

* add sklearn dependency to fix slow tests

* enable back other tests

* final update

---------

Co-authored-by: Felix Marty <[email protected]>
Co-authored-by: Félix Marty <[email protected]>
Co-authored-by: ydshieh <[email protected]>
staghado pushed a commit to staghado/transformers that referenced this pull request Jan 15, 2024
* add deepspeed scheduled test for amd

* fix image

* add dockerfile

* add comment

* enable tests

* trigger

* remove trigger for this branch

* trigger

* change runner env to trigger the docker build image test

* use new docker image

* remove test suffix from docker image tag

* replace test docker image with original image

* push new image

* Trigger

* add back amd tests

* fix typo

* add amd tests back

* fix

* comment until docker image build scheduled test fix

* remove deprecated deepspeed build option

* upgrade torch

* update docker & make tests pass

* Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

* fix

* tmp disable test

* precompile deepspeed to avoid timeout during tests

* fix comment

* trigger deepspeed tests with new image

* comment tests

* trigger

* add sklearn dependency to fix slow tests

* enable back other tests

* final update

---------

Co-authored-by: Felix Marty <[email protected]>
Co-authored-by: Félix Marty <[email protected]>
Co-authored-by: ydshieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants