Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepspeed test to amd scheduled CI #27633

Merged
merged 35 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1e8ce66
add deepspeed scheduled test for amd
echarlaix Nov 21, 2023
bf276ed
fix image
echarlaix Nov 21, 2023
2cfb53d
add dockerfile
echarlaix Nov 23, 2023
5a9a529
add comment
echarlaix Nov 23, 2023
af46e87
enable tests
echarlaix Nov 23, 2023
c29d249
trigger
echarlaix Nov 27, 2023
a0c3daf
remove trigger for this branch
echarlaix Nov 27, 2023
4cb9d6f
trigger
echarlaix Nov 28, 2023
a703349
change runner env to trigger the docker build image test
echarlaix Nov 28, 2023
a47ac2c
use new docker image
echarlaix Nov 28, 2023
233bd7f
remove test suffix from docker image tag
echarlaix Nov 28, 2023
971ba80
replace test docker image with original image
echarlaix Nov 28, 2023
da4774c
push new image
echarlaix Nov 29, 2023
cbe995f
Trigger
echarlaix Nov 30, 2023
e16c271
add back amd tests
echarlaix Nov 30, 2023
70c3580
fix typo
echarlaix Nov 30, 2023
090b88e
add amd tests back
echarlaix Nov 30, 2023
508ae29
fix
echarlaix Nov 30, 2023
09fee9e
comment until docker image build scheduled test fix
echarlaix Nov 30, 2023
407cfe9
remove deprecated deepspeed build option
echarlaix Nov 30, 2023
f846b80
upgrade torch
echarlaix Nov 30, 2023
785b63a
update docker & make tests pass
Dec 4, 2023
ba8cc9f
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
fxmarty Dec 4, 2023
f0f931e
Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
fxmarty Dec 5, 2023
40398b9
fix
echarlaix Dec 5, 2023
3332cd2
tmp disable test
echarlaix Dec 5, 2023
9696cc4
precompile deepspeed to avoid timeout during tests
echarlaix Dec 5, 2023
84a7a33
fix comment
echarlaix Dec 5, 2023
fc6d890
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
echarlaix Dec 5, 2023
df00cff
trigger deepspeed tests with new image
echarlaix Dec 5, 2023
92c402d
comment tests
echarlaix Dec 5, 2023
fa82a9c
trigger
echarlaix Dec 6, 2023
ecb9239
add sklearn dependency to fix slow tests
echarlaix Dec 6, 2023
cfcc312
enable back other tests
echarlaix Dec 6, 2023
ae82b3f
final update
ydshieh Dec 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/build-docker-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -271,3 +271,39 @@ jobs:
REF=main
push: true
tags: huggingface/transformers-tensorflow-gpu

# latest-pytorch-deepspeed-amd:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also disabled as

# Need to be fixed with the help from Guillaume.
cc @ydshieh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh is there something we need to do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another issue that we can deal with outside this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here we have to build the image manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added one manually so that we can verify the deepspeed tests : echarlaix/amd-deepspeed-test

# name: "PyTorch + DeepSpeed (AMD) [dev]"

# runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
# steps:
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v3
# - name: Check out code
# uses: actions/checkout@v3
# - name: Login to DockerHub
# uses: docker/login-action@v3
# with:
# username: ${{ secrets.DOCKERHUB_USERNAME }}
# password: ${{ secrets.DOCKERHUB_PASSWORD }}
# - name: Build and push
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
# # Push CI images still need to be re-built daily
# -
# name: Build and push (for Push CI) in a daily basis
# # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
# # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
# if: inputs.image_postfix != '-push-ci'
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
4 changes: 2 additions & 2 deletions .github/workflows/self-nightly-scheduled.yml
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ jobs:
python3 -m pip uninstall -y deepspeed
rm -rf DeepSpeed
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
ydshieh marked this conversation as resolved.
Show resolved Hide resolved

- name: NVIDIA-SMI
run: |
Expand Down Expand Up @@ -286,4 +286,4 @@ jobs:
with:
name: |
single-*
multi-*
multi-*
4 changes: 2 additions & 2 deletions .github/workflows/self-past.yml
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ jobs:
python3 -m pip uninstall -y deepspeed
rm -rf DeepSpeed
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

- name: NVIDIA-SMI
run: |
Expand Down Expand Up @@ -353,4 +353,4 @@ jobs:
with:
name: |
single-*
multi-*
multi-*
4 changes: 2 additions & 2 deletions .github/workflows/self-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ jobs:
working-directory: /workspace
run: |
python3 -m pip uninstall -y deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

- name: NVIDIA-SMI
run: |
Expand Down Expand Up @@ -456,7 +456,7 @@ jobs:
working-directory: /workspace
run: |
python3 -m pip uninstall -y deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

- name: NVIDIA-SMI
run: |
Expand Down
61 changes: 59 additions & 2 deletions .github/workflows/self-scheduled-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,63 @@ jobs:
name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu

run_tests_torch_deepspeed_gpu:
name: Torch ROCm deepspeed tests
strategy:
fail-fast: false
matrix:
machine_type: [single-gpu, multi-gpu]

runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
needs: setup
container:
image: huggingface/transformers-pytorch-deepspeed-amd-gpu
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add

      # To avoid unknown test failures
      - name: Pre build DeepSpeed *again*
        working-directory: /workspace
        run: |
          python3 -m pip uninstall -y deepspeed
          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

to be the same as in other workflow file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu and just install deepspeed in the tests directly (to avoid this step)

- name: ROCM-SMI
run: |
rocm-smi
- name: ROCM-INFO
run: |
rocminfo | grep "Agent" -A 14

- name: Show ROCR environment
run: |
echo "ROCR: $ROCR_VISIBLE_DEVICES"

- name: Environment
working-directory: /transformers
run: |
python3 utils/print_env.py

- name: Show installed libraries and their versions
working-directory: /transformers
run: pip freeze

- name: Run all tests on GPU
working-directory: /transformers
run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_deepspeed_gpu tests/deepspeed tests/extended

- name: Failure short reports
if: ${{ failure() }}
continue-on-error: true
run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu/failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v3
with:
name: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu

run_extract_warnings:
name: Extract warnings in CI artifacts
runs-on: ubuntu-22.04
Expand All @@ -368,7 +425,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu
run_tests_torch_deepspeed_gpu
]
steps:
- name: Checkout transformers
Expand Down Expand Up @@ -417,7 +474,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu,
run_tests_torch_deepspeed_gpu,
run_extract_warnings
]
steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/self-scheduled.yml
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ jobs:
working-directory: /workspace
run: |
python3 -m pip uninstall -y deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

- name: NVIDIA-SMI
run: |
Expand Down
4 changes: 4 additions & 0 deletions docker/transformers-pytorch-amd-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,11 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools ninja git+htt

ARG REF=main
WORKDIR /

# Invalidate docker cache from here if new commit is available.
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
Comment on lines +26 to +27
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I mean, do you see any issue so adding this line to avoid it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the motivation: https://stackoverflow.com/questions/36996046/how-to-prevent-dockerfile-caching-git-clone

We do not want docker to cache the following git clone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this part. Is it relevant only if we build the image on the same machine multiple times?
So far we don't have such issue, but maybe that is because we build the image on Github Actions hosted machine rather than self-hosted VM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the issue locally when using this docker image, where I would not pick up the latest transformers commit due to docker cache. I think it is useful to make sure we use the latest commit - even though in the CI this is not an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok。

So the issue occurs when you run the docker image, not at the time of docker build, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is more on when we use the image, I think it is expected. These images are not built for long-term usage: they are re-built on a daily basis to run the CI at a specific and the same commit .

Anyone wants to use the image locally is responsible to do git fetch && git pull (or git checkout).

If we accept this change (and IIUC), it means, on the CI (GitHub Actions), each job may get different commits to run the test against, which is not what we want.

The above is just to explain the current behavior (before this change), not meaning we have issue, as on CI, we have

git fetch && git checkout ${{ github.sha }}

so we are safe.

Copy link
Contributor

@fxmarty fxmarty Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue occurs when you run the docker image, not at the time of docker build, right?

No, docker build caches intermediate layers, one of them being RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF. If we use REF=main and actually want to use the latest commit, we need to invalidate the docker cache and this is what this ADD is doing.

Otherwise, docker cached intermediate layer is used and we may use an outdated commit compared to the latest commit available at build time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so during build time. Thanks for the detailed explanation.

RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video]

RUN python3 -m pip uninstall -y tensorflow flax
Expand Down
45 changes: 45 additions & 0 deletions docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
FROM rocm/dev-ubuntu-22.04:5.6
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTORCH='2.1.1'
ARG TORCH_VISION='0.16.1'
ARG TORCH_AUDIO='2.1.1'
ARG ROCM='5.6'

RUN apt update && \
apt install -y --no-install-recommends \
libaio-dev \
git \
# These are required to build deepspeed.
python3-dev \
python-is-python3 \
rocrand-dev \
rocthrust-dev \
hipsparse-dev \
hipblas-dev \
rocblas-dev && \
apt clean && \
rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --no-cache-dir --upgrade pip ninja "pydantic<2"
RUN python3 -m pip uninstall -y apex torch torchvision torchaudio
RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir

# Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout)
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1

ARG REF=main
WORKDIR /

# Invalidate docker cache from here if new commit is available.
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

RUN python3 -m pip install --no-cache-dir ./transformers[accelerate,testing,sentencepiece,sklearn]

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop

RUN python3 -c "from deepspeed.launcher.runner import main"
4 changes: 2 additions & 2 deletions docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ RUN python3 -m pip uninstall -y torch-tensorrt

# recompile apex
RUN python3 -m pip uninstall -y apex
RUN git clone https://github.com/NVIDIA/apex
# RUN git clone https://github.com/NVIDIA/apex
# `MAX_JOBS=1` disables parallel building to avoid cpu memory OOM when building image on GitHub Action (standard) runners
# TODO: check if there is alternative way to install latest apex
# RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
Expand All @@ -44,7 +44,7 @@ RUN python3 -m pip uninstall -y deepspeed
# This has to be run (again) inside the GPU VMs running the tests.
# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
# TODO: Find out why test fail.
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
Expand Down
4 changes: 2 additions & 2 deletions tests/deepspeed/test_deepspeed.py
Original file line number Diff line number Diff line change
Expand Up @@ -561,8 +561,8 @@ def test_gradient_accumulation(self, stage, dtype):
self.assertAlmostEqual(no_grad_accum_a, yes_grad_accum_a, places=5)
self.assertAlmostEqual(no_grad_accum_b, yes_grad_accum_b, places=5)

# see the note above how to get identical loss on a small bs
self.assertAlmostEqual(no_grad_accum_loss, yes_grad_accum_loss, places=2)
# Relative difference. See the note above how to get identical loss on a small bs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_gradient_accumulation is flacky on ROCm (no_grad_accum_loss=10.083333333333334, yes_grad_accum_loss=10.0751953125 on an MI250, no_grad_accum_loss=10.078125, yes_grad_accum_loss=10.0751953125 on an A100) hence this change comparing relative difference.

self.assertTrue((no_grad_accum_loss - yes_grad_accum_loss) / (no_grad_accum_loss + 1e-15) <= 1e-3)

def check_saved_checkpoints_deepspeed(self, output_dir, freq, total, stage, dtype):
# adapted from TrainerIntegrationCommon.check_saved_checkpoints
Expand Down
Loading