-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepspeed test to amd scheduled CI #27633
Changes from all commits
1e8ce66
bf276ed
2cfb53d
5a9a529
af46e87
c29d249
a0c3daf
4cb9d6f
a703349
a47ac2c
233bd7f
971ba80
da4774c
cbe995f
e16c271
70c3580
090b88e
508ae29
09fee9e
407cfe9
f846b80
785b63a
ba8cc9f
f0f931e
40398b9
3332cd2
9696cc4
84a7a33
fc6d890
df00cff
92c402d
fa82a9c
ecb9239
cfcc312
ae82b3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -356,6 +356,63 @@ jobs: | |
name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu | ||
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu | ||
|
||
run_tests_torch_deepspeed_gpu: | ||
name: Torch ROCm deepspeed tests | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
machine_type: [single-gpu, multi-gpu] | ||
|
||
runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] | ||
needs: setup | ||
container: | ||
image: huggingface/transformers-pytorch-deepspeed-amd-gpu | ||
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ | ||
steps: | ||
- name: Update clone | ||
working-directory: /transformers | ||
run: git fetch && git checkout ${{ github.sha }} | ||
|
||
- name: Reinstall transformers in edit mode (remove the one installed during docker image build) | ||
working-directory: /transformers | ||
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add
to be the same as in other workflow file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use |
||
- name: ROCM-SMI | ||
run: | | ||
rocm-smi | ||
- name: ROCM-INFO | ||
run: | | ||
rocminfo | grep "Agent" -A 14 | ||
|
||
- name: Show ROCR environment | ||
run: | | ||
echo "ROCR: $ROCR_VISIBLE_DEVICES" | ||
|
||
- name: Environment | ||
working-directory: /transformers | ||
run: | | ||
python3 utils/print_env.py | ||
|
||
- name: Show installed libraries and their versions | ||
working-directory: /transformers | ||
run: pip freeze | ||
|
||
- name: Run all tests on GPU | ||
working-directory: /transformers | ||
run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_deepspeed_gpu tests/deepspeed tests/extended | ||
|
||
- name: Failure short reports | ||
if: ${{ failure() }} | ||
continue-on-error: true | ||
run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu/failures_short.txt | ||
|
||
- name: Test suite reports artifacts | ||
if: ${{ always() }} | ||
uses: actions/upload-artifact@v3 | ||
with: | ||
name: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports | ||
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu | ||
|
||
run_extract_warnings: | ||
name: Extract warnings in CI artifacts | ||
runs-on: ubuntu-22.04 | ||
|
@@ -368,7 +425,7 @@ jobs: | |
run_tests_multi_gpu, | ||
run_examples_gpu, | ||
run_pipelines_torch_gpu, | ||
# run_all_tests_torch_cuda_extensions_gpu | ||
run_tests_torch_deepspeed_gpu | ||
] | ||
steps: | ||
- name: Checkout transformers | ||
|
@@ -417,7 +474,7 @@ jobs: | |
run_tests_multi_gpu, | ||
run_examples_gpu, | ||
run_pipelines_torch_gpu, | ||
# run_all_tests_torch_cuda_extensions_gpu, | ||
run_tests_torch_deepspeed_gpu, | ||
run_extract_warnings | ||
] | ||
steps: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,11 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools ninja git+htt | |
|
||
ARG REF=main | ||
WORKDIR / | ||
|
||
# Invalidate docker cache from here if new commit is available. | ||
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json | ||
Comment on lines
+26
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this necessary? I mean, do you see any issue so adding this line to avoid it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the motivation: https://stackoverflow.com/questions/36996046/how-to-prevent-dockerfile-caching-git-clone We do not want docker to cache the following There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not familiar with this part. Is it relevant only if we build the image on the same machine multiple times? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had the issue locally when using this docker image, where I would not pick up the latest transformers commit due to docker cache. I think it is useful to make sure we use the latest commit - even though in the CI this is not an issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok。 So the issue occurs when you run the docker image, not at the time of docker build, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is more on when we use the image, I think it is expected. These images are not built for long-term usage: they are re-built on a daily basis to run the CI at a specific and the same commit . Anyone wants to use the image locally is responsible to do If we accept this change (and IIUC), it means, on the CI (GitHub Actions), each job may get different commits to run the test against, which is not what we want. The above is just to explain the current behavior (before this change), not meaning we have issue, as on CI, we have
so we are safe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, Otherwise, docker cached intermediate layer is used and we may use an outdated commit compared to the latest commit available at build time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, so during build time. Thanks for the detailed explanation. |
||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | ||
|
||
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video] | ||
|
||
RUN python3 -m pip uninstall -y tensorflow flax | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
FROM rocm/dev-ubuntu-22.04:5.6 | ||
LABEL maintainer="Hugging Face" | ||
|
||
ARG DEBIAN_FRONTEND=noninteractive | ||
ARG PYTORCH='2.1.1' | ||
ARG TORCH_VISION='0.16.1' | ||
ARG TORCH_AUDIO='2.1.1' | ||
ARG ROCM='5.6' | ||
|
||
RUN apt update && \ | ||
apt install -y --no-install-recommends \ | ||
libaio-dev \ | ||
git \ | ||
# These are required to build deepspeed. | ||
python3-dev \ | ||
python-is-python3 \ | ||
rocrand-dev \ | ||
rocthrust-dev \ | ||
hipsparse-dev \ | ||
hipblas-dev \ | ||
rocblas-dev && \ | ||
apt clean && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
RUN python3 -m pip install --no-cache-dir --upgrade pip ninja "pydantic<2" | ||
RUN python3 -m pip uninstall -y apex torch torchvision torchaudio | ||
RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir | ||
|
||
# Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout) | ||
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1 | ||
|
||
ARG REF=main | ||
WORKDIR / | ||
|
||
# Invalidate docker cache from here if new commit is available. | ||
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json | ||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | ||
|
||
RUN python3 -m pip install --no-cache-dir ./transformers[accelerate,testing,sentencepiece,sklearn] | ||
|
||
# When installing in editable mode, `transformers` is not recognized as a package. | ||
# this line must be added in order for python to be aware of transformers. | ||
RUN cd transformers && python3 setup.py develop | ||
|
||
RUN python3 -c "from deepspeed.launcher.runner import main" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -561,8 +561,8 @@ def test_gradient_accumulation(self, stage, dtype): | |
self.assertAlmostEqual(no_grad_accum_a, yes_grad_accum_a, places=5) | ||
self.assertAlmostEqual(no_grad_accum_b, yes_grad_accum_b, places=5) | ||
|
||
# see the note above how to get identical loss on a small bs | ||
self.assertAlmostEqual(no_grad_accum_loss, yes_grad_accum_loss, places=2) | ||
# Relative difference. See the note above how to get identical loss on a small bs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
self.assertTrue((no_grad_accum_loss - yes_grad_accum_loss) / (no_grad_accum_loss + 1e-15) <= 1e-3) | ||
|
||
def check_saved_checkpoints_deepspeed(self, output_dir, freq, total, stage, dtype): | ||
# adapted from TrainerIntegrationCommon.check_saved_checkpoints | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also disabled as
transformers/.github/workflows/build-docker-images.yml
Line 211 in 510270a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh is there something we need to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another issue that we can deal with outside this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here we have to build the image manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added one manually so that we can verify the deepspeed tests :
echarlaix/amd-deepspeed-test