Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release - SuperBench v0.11.0 #653

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions .github/workflows/build-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,16 @@ jobs:
tags: superbench/main:cuda11.1.1,superbench/superbench:latest
runner: ubuntu-latest
build_args: "NUM_MAKE_JOBS=8"
- name: rocm5.7
dockerfile: rocm5.7.x
tags: superbench/main:rocm5.7
runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=64"
- name: rocm6.0
dockerfile: rocm6.0.x
tags: superbench/main:rocm6.0
runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=64"
build_args: "NUM_MAKE_JOBS=16"
- name: rocm6.2
dockerfile: rocm6.2.x
tags: superbench/main:rocm6.2
runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=16"
steps:
- name: Checkout
uses: actions/checkout@v2
Expand All @@ -68,6 +68,8 @@ jobs:
else
echo "No Docker images found with the specified references."
fi
sudo docker ps -q | grep build | xargs -r sudo docker stop
echo y | sudo docker system prune -a --volumes
df -h
- name: Prepare metadata
id: metadata
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

__SuperBench__ is a validation and profiling tool for AI infrastructure.

📢 [v0.10.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.10.0) has been released!
📢 [v0.11.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.11.0) has been released!

## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._

Expand Down
10 changes: 6 additions & 4 deletions dockerfile/cuda11.1.1.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,13 @@ RUN cd /tmp && \
rm -rf /tmp/MLNX_OFED_LINUX-${OFED_VERSION}*

# Install HPC-X
ENV HPCX_VERSION=v2.9.0
RUN cd /opt && \
wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu20.04-x86_64.tbz && \
tar xf hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu20.04-x86_64.tbz && \
ln -s hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu20.04-x86_64 hpcx && \
rm hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu20.04-x86_64.tbz
rm -rf hpcx && \
wget -q https://content.mellanox.com/hpc/hpc-x/${HPCX_VERSION}/hpcx-${HPCX_VERSION}-gcc-inbox-ubuntu20.04-x86_64.tbz -O hpcx.tbz && \
tar xf hpcx.tbz && \
mv hpcx-${HPCX_VERSION}-gcc-inbox-ubuntu20.04-x86_64 hpcx && \
rm hpcx.tbz

# Install NCCL RDMA SHARP plugins
RUN cd /tmp && \
Expand Down
19 changes: 18 additions & 1 deletion dockerfile/cuda12.4.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ FROM nvcr.io/nvidia/pytorch:24.03-py3
# - CUDA: 12.4.0
# - cuDNN: 9.0.0.306
# - cuBLAS: 12.4.2.65
# - NCCL: v2.20
# - NCCL: v2.23.4-1
# - TransformerEngine 1.4
# Mellanox:
# - OFED: 23.07-0.5.1.2
Expand Down Expand Up @@ -115,6 +115,23 @@ RUN cd /tmp && \
mv amd-blis /opt/AMD && \
rm -rf aocl-blis-linux-aocc-4.0.tar.gz

# Install NCCL 2.23.4
RUN cd /tmp && \
git clone -b v2.23.4-1 https://github.com/NVIDIA/nccl.git && \
cd nccl && \
make -j ${NUM_MAKE_JOBS} src.build && \
make install && \
rm -rf /tmp/nccl

# Install UCX v1.16.0 with multi-threading support
RUN cd /tmp && \
wget https://github.com/openucx/ucx/releases/download/v1.16.0/ucx-1.16.0.tar.gz && \
tar xzf ucx-1.16.0.tar.gz && \
cd ucx-1.16.0 && \
./contrib/configure-release-mt --prefix=/usr/local && \
make -j ${NUM_MAKE_JOBS} && \
make install

ENV PATH="${PATH}" \
LD_LIBRARY_PATH="/usr/local/lib:/usr/local/mpi/lib:${LD_LIBRARY_PATH}" \
SB_HOME=/opt/superbench \
Expand Down
5 changes: 5 additions & 0 deletions dockerfile/rocm6.0.x.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,11 @@ RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-6.0 HIPBLASL
RUN cd third_party/Megatron/Megatron-DeepSpeed && \
git apply ../megatron_deepspeed_rocm6.patch

# Install AMD SMI Python Library
RUN apt install amd-smi-lib -y && \
cd /opt/rocm/share/amd_smi && \
python3 -m pip install .

ADD . .
ENV USE_HIP_DATATYPE=1
ENV USE_HIPBLAS_COMPUTETYPE=1
Expand Down
193 changes: 193 additions & 0 deletions dockerfile/rocm6.2.x.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
ARG BASE_IMAGE=rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0

FROM ${BASE_IMAGE}

# OS:
# - Ubuntu: 22.04
# - Docker Client: 20.10.8
# ROCm:
# - ROCm: 6.2
# Lib:
# - torch: 2.3.0
# - rccl: 2.18.3+hip6.0 develop:7e1cbb4
# - hipblaslt: release-staging/rocm-rel-6.2
# - rocblas: release-staging/rocm-rel-6.2
# - openmpi: 4.1.x
# Intel:
# - mlc: v3.11

LABEL maintainer="SuperBench"

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get -q install -y --no-install-recommends \
autoconf \
automake \
bc \
build-essential \
curl \
dmidecode \
git \
hipify-clang \
iproute2 \
jq \
libaio-dev \
libboost-program-options-dev \
libcap2 \
libcurl4-openssl-dev \
libnuma-dev \
libpci-dev \
libssl-dev \
libtinfo5 \
libtool \
lshw \
net-tools \
numactl \
openssh-client \
openssh-server \
pciutils \
python3-mpi4py \
rsync \
sudo \
util-linux \
vim \
wget \
&& \
rm -rf /tmp/*

ARG NUM_MAKE_JOBS=64

# Check if CMake is installed and its version
RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \
required_version="3.24.1" && \
if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \
echo "existing cmake version is ${cmake_version}" && \
cd /tmp && \
wget -q https://github.com/Kitware/CMake/releases/download/v${required_version}/cmake-${required_version}.tar.gz && \
tar xzf cmake-${required_version}.tar.gz && \
cd cmake-${required_version} && \
./bootstrap --prefix=/usr --no-system-curl --parallel=16 && \
make -j ${NUM_MAKE_JOBS} && \
make install && \
rm -rf /tmp/cmake-${required_version}* \
else \
echo "CMake version is greater than or equal to 3.24.1"; \
fi

# Install Docker
ENV DOCKER_VERSION=20.10.8
RUN cd /tmp && \
wget -q https://download.docker.com/linux/static/stable/x86_64/docker-${DOCKER_VERSION}.tgz -O docker.tgz && \
tar --extract --file docker.tgz --strip-components 1 --directory /usr/local/bin/ && \
rm docker.tgz

# Update system config
RUN mkdir -p /root/.ssh && \
touch /root/.ssh/authorized_keys && \
mkdir -p /var/run/sshd && \
sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config && \
sed -i "s/[# ]*PermitUserEnvironment no/PermitUserEnvironment yes/" /etc/ssh/sshd_config && \
sed -i "s/[# ]*Port.*/Port 22/" /etc/ssh/sshd_config && \
echo "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
echo "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf


# Get Ubuntu version and set as an environment variable
RUN export UBUNTU_VERSION=$(lsb_release -r -s)
RUN echo "Ubuntu version: $UBUNTU_VERSION"
ENV UBUNTU_VERSION=${UBUNTU_VERSION}

# Install OFED
ENV OFED_VERSION=5.9-0.5.6.0
# Check if ofed_info is present and has a version
RUN if ! command -v ofed_info >/dev/null 2>&1; then \
echo "OFED not found. Installing OFED..."; \
cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \
fi

ENV ROCM_PATH=/opt/rocm

# Install OpenMPI
ENV OPENMPI_VERSION=4.1.x
ENV MPI_HOME=/usr/local/mpi
# Check if Open MPI is installed
RUN cd /tmp && \
git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION} && \
cd ompi && \
./autogen.pl && \
mkdir build && \
cd build && \
../configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --enable-prte-prefix-by-default --with-rocm=/opt/rocm && \
make -j $(nproc) && \
make -j $(nproc) install && \
ldconfig && \
cd / && \
rm -rf /tmp/openmpi-${OPENMPI_VERSION}*

# Install Intel MLC
RUN cd /tmp && \
wget -q https://downloadmirror.intel.com/763324/mlc_v3.10.tgz -O mlc.tgz && \
tar xzf mlc.tgz Linux/mlc && \
cp ./Linux/mlc /usr/local/bin/ && \
rm -rf ./Linux mlc.tgz

# Install RCCL
RUN cd /opt/ && \
git clone -b release/rocm-rel-6.2 https://github.com/ROCmSoftwarePlatform/rccl.git && \
cd rccl && \
mkdir build && \
cd build && \
CXX=/opt/rocm/bin/hipcc cmake -DHIP_COMPILER=clang -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_PREFIX_PATH="${ROCM_PATH}/hsa;${ROCM_PATH}/hip;${ROCM_PATH}/share/rocm/cmake/;${ROCM_PATH}" \
.. && \
make -j${NUM_MAKE_JOBS}

# Install AMD SMI Python Library
RUN apt install amd-smi-lib -y && \
cd /opt/rocm/share/amd_smi && \
python3 -m pip install .

ENV PATH="/usr/local/mpi/bin:/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \
LD_LIBRARY_PATH="/usr/local/mpi/lib:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
SB_HOME=/opt/superbench \
SB_MICRO_PATH=/opt/superbench \
ANSIBLE_DEPRECATION_WARNINGS=FALSE \
ANSIBLE_COLLECTIONS_PATH=/usr/share/ansible/collections

RUN echo PATH="$PATH" > /etc/environment && \
echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \
echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment

RUN apt install rocm-cmake -y && \
python3 -m pip install --upgrade pip wheel setuptools==65.7

WORKDIR ${SB_HOME}

ADD third_party third_party
# Apply patch
RUN cd third_party/perftest && \
git apply ../perftest_rocm6.patch
RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release-staging/rocm-rel-6.2 HIPBLASLT_BRANCH=release-staging/rocm-rel-6.2 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
RUN cp -r /opt/superbench/third_party/hipBLASLt/build/release/hipblaslt-install/lib/* /opt/rocm/lib/ && \
cp -r /opt/superbench/third_party/hipBLASLt/build/release/hipblaslt-install/include/* /opt/rocm/include/
RUN cd third_party/Megatron/Megatron-DeepSpeed && \
git apply ../megatron_deepspeed_rocm6.patch

# Install transformer_engine
RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git && \
cd TransformerEngine && \
export NVTE_FRAMEWORK=pytorch && \
pip install .

ADD . .
ENV USE_HIP_DATATYPE=1
ENV USE_HIPBLAS_COMPUTETYPE=1
RUN python3 -m pip install .[amdworker] && \
CXX=/opt/rocm/bin/hipcc make cppbuild && \
make postinstall

2 changes: 1 addition & 1 deletion docs/getting-started/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note
You should checkout corresponding tag to use release version, for example,

`git clone -b v0.10.0 https://github.com/microsoft/superbenchmark`
`git clone -b v0.11.0 https://github.com/microsoft/superbenchmark`
:::

```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/run-superbench.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note
You should deploy corresponding Docker image to use release version, for example,

`sb deploy -f local.ini -i superbench/superbench:v0.10.0-cuda12.2`
`sb deploy -f local.ini -i superbench/superbench:v0.11.0-cuda12.4`

You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.

Expand Down
2 changes: 1 addition & 1 deletion docs/superbench-config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ superbench:
<TabItem value='example'>

```yaml
version: v0.10
version: v0.11
superbench:
enable: benchmark_1
monitor:
Expand Down
6 changes: 6 additions & 0 deletions docs/user-tutorial/container-images.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ available tags are listed below for all stable versions.

| Tag | Description |
|--------------------|-------------------------------------|
| v0.11.0-cuda12.4 | SuperBench v0.11.0 with CUDA 12.4 |
| v0.11.0-cuda12.2 | SuperBench v0.11.0 with CUDA 12.2 |
| v0.11.0-cuda11.1.1 | SuperBench v0.11.0 with CUDA 11.1.1 |
| v0.10.0-cuda12.2 | SuperBench v0.10.0 with CUDA 12.2 |
| v0.10.0-cuda11.1.1 | SuperBench v0.10.0 with CUDA 11.1.1 |
| v0.9.0-cuda12.1 | SuperBench v0.9.0 with CUDA 12.1 |
Expand All @@ -50,6 +53,9 @@ available tags are listed below for all stable versions.

| Tag | Description |
|-------------------------------|--------------------------------------------------|
| v0.11.0-rocm6.2 | SuperBench v0.11.0 with ROCm 6.2 |
| v0.11.0-rocm6.0 | SuperBench v0.11.0 with ROCm 6.0 |
| v0.10.0-rocm6.0 | SuperBench v0.10.0 with ROCm 6.0 |
| v0.10.0-rocm5.7 | SuperBench v0.10.0 with ROCm 5.7 |
| v0.9.0-rocm5.1.3 | SuperBench v0.9.0 with ROCm 5.1.3 |
| v0.9.0-rocm5.1.1 | SuperBench v0.9.0 with ROCm 5.1.1 |
Expand Down
6 changes: 3 additions & 3 deletions docs/user-tutorial/data-diagnosis.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ superbench:
example:
```yaml
# SuperBench rules
version: v0.10
version: v0.11
superbench:
rules:
failure-rule:
Expand All @@ -83,8 +83,8 @@ superbench:
criteria: lambda x:x>0.05
categories: KernelLaunch
metrics:
- kernel-launch/event_overhead:\d+
- kernel-launch/wall_overhead:\d+
- kernel-launch/event_time:\d+
- kernel-launch/wall_time:\d+
rule1:
# Rule 1: If H2D_Mem_BW or D2H_Mem_BW test suffers > 5% downgrade, label it as defective
function: variance
Expand Down
6 changes: 3 additions & 3 deletions docs/user-tutorial/result-summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ superbench:

```yaml title="Example"
# SuperBench rules
version: v0.10
version: v0.11
superbench:
rules:
kernel_launch:
Expand All @@ -70,8 +70,8 @@ superbench:
aggregate: True
categories: KernelLaunch
metrics:
- kernel-launch/event_overhead
- kernel-launch/wall_overhead
- kernel-launch/event_time
- kernel-launch/wall_time
nccl:
statistics: mean
categories: NCCL
Expand Down
Loading
Loading