Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First round of ML updates #413

Open
wants to merge 4 commits into
base: ml
Choose a base branch
from

Conversation

lipovsek-aws
Copy link

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Lets create `ami.yml` (see [DLAMI release notes](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) to get AMI ARN (`ParentImage` in config bellow)):
```
Build:
SecurityGroupIds: [<insert you SG - in requires outbound traffic>]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you automate this?

Copy link
Author

@lipovsek-aws lipovsek-aws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's automated with pcluster CLI, we could add cloudformation that sets up VPC, subnet, SG and runs pcluster CLI via bash runner (or even better trigger lambda to run it). I though to create short bash script with templating for config yaml file but I don't think it's worth the effort since it's not abstracting anything and adds boilerplate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% chance someone won't know how to find a security group and or even what it is. Provide the steps to create one and retrieve ID or provide the retrieve security group id step.

First let's fetch the assets required to build the image:

```bash
wget https://ml.hpcworkshops.com/scripts/packer/packer.tar.gz
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the content of the archive before proposing to download.
For sake of clarity, I suggest you make the reader download the 3 files separately. At least they can review the file on github before hands. That's also give an opportunity to reviewer to look at the content of the files that are of this workshop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sean-smith you added this and I just moved it, any specific reason for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore. This can just be:

git clone [email protected]:aws-samples/parallelcluster-efa-gpu-preflight-ami.git

You can install Packer using [Brew](https://brew.sh/) on OSX or Linux as follows:

```bash
brew install packer
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardize on Cloud9 or cloudshell..
If I have windows how do I do?

Provide a specific version to prevent regression in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll standardize on Cloud9. Cloudshell storage space is too limited. IMHO most ML devops don't need instructions on how to use the cli. This is different than HPC.

brew install packer
```

Alternatively, you can download the Packer binary through the [tool website](https://www.packer.io/). Ensure your `PATH` is set to use the binary or use its absolute path. Once Packer installed, proceed to the next stage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll remove.

Copy link
Collaborator

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments


Now run [pcluster command](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.build-image-v3.html) that will add all pcluster dependencies to your DLAMI of choice:
```
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables...

Suggested change
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION
pcluster build-image -c ami.yml -i $NEW_AMI_ID -r $AWS_REGION

@@ -0,0 +1,10 @@
---
title: "b. Download, compile and run the NCCL tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "b. Download, compile and run the NCCL tests"
title: "b. Run the NCCL tests"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

```bash
cd ~

cat > compile_nccl.sh << EOF
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute path

Suggested change
cat > compile_nccl.sh << EOF
cat > ~/compile_nccl.sh << EOF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Create your job submission script for *OSU Latency* and use **sbatch** to submit your job:

```bash
cat > nccl_test.sbatch << \EOF
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cat > nccl_test.sbatch << \EOF
cat > ~/nccl_test.sbatch << EOF

NCCL_TEST_PATH=${HOME}/nccl-tests/build
MPI_PATH=/opt/amazon/openmpi

export LD_LIBRARY_PATH=${HOME}/nccl/build/lib:${HOME}/aws-ofi-nccl/install/lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use module....

#SBATCH --output=nccl.out

NCCL_TEST_PATH=${HOME}/nccl-tests/build
MPI_PATH=/opt/amazon/openmpi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load...


git clone -b v2.17.1-1 https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE='-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope cuda version does not change...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's default "system" cuda - new AMIs will have new CUDA on the same path. I can add a note that if they have custom cuda they change this path, I assumed if someone is more advanced to add specific CUDA version they will be familiar with these parameters. But good point, I'll add the note.


git clone -b v2.13.6 https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 CUDA_HOME=/usr/local/cuda MPI_HOME=/opt/amazon/openmpi NCCL_HOME=${HOME}/nccl/build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using OpenMPI form pcluster AMI, no need for IntelMPI to get correct performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load openmpi. That's what I said.

export NCCL_DEBUG=INFO
export FI_LOG_LEVEL=1

${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One the openmpi is loaded no need to have path like this.

export NCCL_DEBUG=INFO
export FI_LOG_LEVEL=1

${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \
${MPI_PATH}/bin/mpirun --map-by ppr:4:socket \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EbsSettings:
VolumeType: gp3
Size: 200
Throughput: 300
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

tags : ["Huggingface", "data", "ML", "srun", "slurm"]
---

In this section, you will learn how to run script from Huggingface examples with PyTorch FSDP and DDP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I running exactly? a Script? What does it do?

Copy link
Collaborator

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants