Merge branch 'main' into fix/logging

microsoft · Nov 27, 2024 · 5c97c37 · 5c97c37
2 parents 3754dcc + 96f5cce
commit 5c97c37
Show file tree

Hide file tree

Showing 15 changed files with 594 additions and 16 deletions.
diff --git a/.azure-pipelines/ansible-integration-test.yml b/.azure-pipelines/ansible-integration-test.yml
@@ -7,6 +7,7 @@ trigger:
 
 pool:
   name: SuperBench CI
+  demands: ansible-agent
   vmImage: ubuntu-latest
 
 container:

diff --git a/.azure-pipelines/cpu-unit-test.yml b/.azure-pipelines/cpu-unit-test.yml
@@ -7,12 +7,12 @@ trigger:
 
 strategy:
   matrix:
-    python-3.6:
-      imageTag: '3.6'
     python-3.7:
       imageTag: '3.7'
     python-3.8:
       imageTag: '3.8'
+    python-3.10:
+      imageTag: '3.10'
     # TODO
     #python-latest:
     #  imageTag: '3'

diff --git a/.azure-pipelines/cuda-unit-test.yml b/.azure-pipelines/cuda-unit-test.yml
@@ -7,22 +7,26 @@ trigger:
 
 pool:
   name: SuperBench CI
+  demands: cuda-agent
   vmImage: ubuntu-latest
 
 container:
-  image: nvcr.io/nvidia/pytorch:20.12-py3
-  options: '-v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker -v /usr/bin/sudo:/usr/bin/sudo -v /usr/lib/sudo/:/usr/lib/sudo/'
+  image: nvcr.io/nvidia/pytorch:24.03-py3
+  options: '--name cuda-ci -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker:ro'
 
 steps:
   - script: |
       echo "##vso[task.prependpath]$HOME/.local/bin"
     displayName: Export path
   - script: |
+      docker exec -t -u root -e DEBIAN_FRONTEND=noninteractive cuda-ci bash -c \
+        "apt-get update -y -q && \
+        yes '' | apt-get install -y -q sudo && \
+        apt-get install -y -q \
+        ffmpeg libavcodec-dev libavformat-dev libavutil-dev libboost-program-options-dev libswresample-dev"
       python3 -m pip install --upgrade pip setuptools==65.7
       python3 -m pip install .[test,nvworker]
       make postinstall
-      sudo DEBIAN_FRONTEND=noninteractive apt-get update
-      sudo DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev
     displayName: Install dependencies
   - script: |
       python3 setup.py lint

diff --git a/.codecov.yml b/.codecov.yml
@@ -14,16 +14,18 @@ coverage:
         target: 80%
         threshold: 1%
         flags:
-          - cpu-python3.6-unit-test
           - cpu-python3.7-unit-test
+          - cpu-python3.8-unit-test
+          - cpu-python3.10-unit-test
           - cuda-unit-test
           - directx-unit-test
     patch:
       default:
         target: 80%
         threshold: 1%
         flags:
-          - cpu-python3.6-unit-test
           - cpu-python3.7-unit-test
+          - cpu-python3.8-unit-test
+          - cpu-python3.10-unit-test
           - cuda-unit-test
           - directx-unit-test
diff --git a/docs/getting-started/installation.mdx b/docs/getting-started/installation.mdx
@@ -26,7 +26,7 @@ Here're the system requirements for control node.
 ### Requirements
 
 * Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later.
-* [Python](https://www.python.org/) version 3.6 or later (which can be checked by running `python3 --version`).
+* [Python](https://www.python.org/) version 3.7 or later (which can be checked by running `python3 --version`).
 * [Pip](https://pip.pypa.io/en/stable/installing/) version 18.0 or later (which can be checked by running `python3 -m pip --version`).
 
 :::note

diff --git a/docs/user-tutorial/benchmarks/micro-benchmarks.md b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -384,6 +384,82 @@ with topology distance of 2, 4, 6, respectively.
 | ib-traffic/ib\_write\_bw\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client}  | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client.  |
 | ib-traffic/ib\_write\_lat\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | time (us)        | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. |
 
+### `nvbandwidth`
+
+#### Introduction
+
+Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods,
+performed by [nvbandwidth](https://github.com/NVIDIA/nvbandwidth)
+
+#### Metrics
+
+| Metrics                                                 | Unit                   | Description                                                                                                                                                                |
+|---------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device CE memcpy using cuMemcpyAsync                      |
+| host_to_device_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host CE memcpy using cuMemcpyAsync                      |
+| device_to_host_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. |
+| host_to_device_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
+| device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. |
+| device_to_host_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_memcpy_read_ce_sum_bw             | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_memcpy_write_ce_sum_bw            | GB/s                | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_bidirectional_memcpy_read_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_bidirectional_memcpy_write_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
+| all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host. |
+| all_to_host_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
+| all_to_host_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices. |
+| host_to_all_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
+| host_to_all_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| all_to_one_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| all_to_one_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| one_to_all_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| one_to_all_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device SM memcpy using a copy kernel                      |
+| host_to_device_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host SM memcpy using a copy kernel                      |
+| device_to_host_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_memcpy_read_sm_sum_bw             | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_memcpy_write_sm_sum_bw            | GB/s                | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_bidirectional_memcpy_read_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_bidirectional_memcpy_write_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
+| all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host. |
+| all_to_host_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
+| all_to_host_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices. |
+| host_to_all_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
+| host_to_all_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| all_to_one_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| all_to_one_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| one_to_all_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| one_to_all_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| host_device_latency_sm_cpu[0-9]_gpu[0-9]_lat       | µs                  | Host - device SM copy latency using a ptr chase kernel            |
+| host_device_latency_sm_sum_lat                     | µs                  | Sum of the output matrix                                           |
+| device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_lat  | µs                  | Measures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency. |
+| device_to_device_latency_sm_sum_lat                | µs                  | Sum of the output matrix                                           |
+
 
 ## Computation-communication Benchmarks
 

diff --git a/examples/benchmarks/nvbandwidth.py b/examples/benchmarks/nvbandwidth.py
@@ -0,0 +1,33 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Micro benchmark example for nvbandwidth benchmark.
+
+Commands to run:
+  python3 examples/benchmarks/nvbandwidth.py
+"""
+
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.common.utils import logger
+
+if __name__ == '__main__':
+    context = BenchmarkRegistry.create_benchmark_context(
+        'nvbandwidth',
+        platform=Platform.CPU,
+        parameters=(
+            '--buffer_size 128 '
+            '--test_cases 0,1,19,20 '
+            '--skip_verification '
+            '--disable_affinity '
+            '--use_mean '
+            '--num_loops 10'
+        )
+    )
+
+    benchmark = BenchmarkRegistry.launch_benchmark(context)
+    if benchmark:
+        logger.info(
+            'benchmark: {}, return code: {}, result: {}'.format(
+                benchmark.name, benchmark.return_code, benchmark.result
+            )
+        )
diff --git a/setup.py b/setup.py
@@ -131,17 +131,17 @@ def run(self):
         'Operating System :: POSIX',
         'Programming Language :: Python :: 3',
         'Programming Language :: Python :: 3 :: Only',
-        'Programming Language :: Python :: 3.6',
         'Programming Language :: Python :: 3.7',
         'Programming Language :: Python :: 3.8',
         'Programming Language :: Python :: 3.9',
+        'Programming Language :: Python :: 3.10',
         'Topic :: System :: Benchmark',
         'Topic :: System :: Clustering',
         'Topic :: System :: Hardware',
     ],
     keywords='benchmark, AI systems',
     packages=find_packages(exclude=['tests']),
-    python_requires='>=3.6, <4',
+    python_requires='>=3.7, <4',
     use_scm_version={
         'local_scheme': 'node-and-date',
         'version_scheme': lambda _: superbench.__version__,

diff --git a/superbench/benchmarks/base.py b/superbench/benchmarks/base.py
@@ -48,6 +48,8 @@ def __init__(self, name, parameters=''):
             allow_abbrev=False,
             formatter_class=SortedMetavarTypeHelpFormatter,
         )
+        # Fix optionals title in Python 3.10
+        self._parser._optionals.title = 'optional arguments'
         self._args = None
         self._curr_run_index = 0
         self._result = None