Bugfix - nvbandwidth benchmark need to handle N/A value #675

polarG · 2024-12-02T06:21:03Z

Description

Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output.
Replaced the input format of test cases with a list.
Add nvbandwidth configuration example in default config files.

This reverts commit 3459eac.

…width output.

codecov · 2024-12-02T06:32:00Z

Codecov Report

Attention: Patch coverage is 70.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 85.45%. Comparing base (249e21c) to head (bd6aab2).

Files with missing lines	Patch %	Lines
...erbench/benchmarks/micro_benchmarks/nvbandwidth.py	70.00%	18 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #675      +/-   ##
==========================================
- Coverage   85.61%   85.45%   -0.16%     
==========================================
  Files          99       99              
  Lines        7165     7210      +45     
==========================================
+ Hits         6134     6161      +27     
- Misses       1031     1049      +18

Flag	Coverage Δ
cpu-python3.10-unit-test	`71.14% <70.00%> (-0.70%)`	⬇️
cpu-python3.7-unit-test	`71.11% <69.49%> (-0.70%)`	⬇️
cpu-python3.8-unit-test	`71.15% <70.00%> (-0.68%)`	⬇️
cuda-unit-test	`83.27% <70.00%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

superbench/benchmarks/micro_benchmarks/nvbandwidth.py

superbench/config/default.yaml

dpower4 · 2024-12-05T01:39:39Z

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived".
for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived:

abuccts

pls add test cases for the "N/A" and "Waived" cases

abuccts · 2024-12-05T08:07:45Z

superbench/benchmarks/micro_benchmarks/nvbandwidth.py

-                'Specify the test case(s) to run, either by name or index. By default, all test cases are executed. '
-                'Example: --test_cases 0,1,2,19,20'
+                'Specify the test case(s) to execute, either by name or index. '
+                'To view the available test case names or indices, run the command nvbandwidth on the host. '


can you provide the names directly or use a link to nvbandwidth document instead?

It seems that the test cases are not listed in the doc.

./nvbandwidth -l nvbandwidth Version: v0.6 Built from Git version: v0.6 Index, Name: Description ======================= 0, host_to_device_memcpy_ce: Host to device CE memcpy using cuMemcpyAsync 1, device_to_host_memcpy_ce: Device to host CE memcpy using cuMemcpyAsync 2, host_to_device_bidirectional_memcpy_ce: A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. 3, device_to_host_bidirectional_memcpy_ce: A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. 4, device_to_device_memcpy_read_ce: Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. 5, device_to_device_memcpy_write_ce: Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. 6, device_to_device_bidirectional_memcpy_read_ce: Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context. 7, device_to_device_bidirectional_memcpy_write_ce: Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context. 8, all_to_host_memcpy_ce: Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host. 9, all_to_host_bidirectional_memcpy_ce: A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interferring traffic. 10, host_to_all_memcpy_ce: Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices. 11, host_to_all_bidirectional_memcpy_ce: A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interferring traffic. 12, all_to_one_write_ce: Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. 13, all_to_one_read_ce: Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. 14, one_to_all_write_ce: Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. 15, one_to_all_read_ce: Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. 16, host_to_device_memcpy_sm: Host to device SM memcpy using a copy kernel 17, device_to_host_memcpy_sm: Device to host SM memcpy using a copy kernel 18, device_to_device_memcpy_read_sm: Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. 19, device_to_device_memcpy_write_sm: Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. 20, device_to_device_bidirectional_memcpy_read_sm: Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context. 21, device_to_device_bidirectional_memcpy_write_sm: Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context. 22, all_to_host_memcpy_sm: Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host. 23, all_to_host_bidirectional_memcpy_sm: A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interferring traffic using copy kernels. 24, host_to_all_memcpy_sm: Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices. 25, host_to_all_bidirectional_memcpy_sm: A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interferring traffic using copy kernels. 26, all_to_one_write_sm: Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. 27, all_to_one_read_sm: Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. 28, one_to_all_write_sm: Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. 29, one_to_all_read_sm: Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. 30, host_device_latency_sm: Host - device SM copy latency using a ptr chase kernel 31, device_to_device_latency_sm: Measures latency of a pointer derefernce operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency.

this is from v0.6, The list might change though with the next version having multinode test cases. for eg: multinode_device_to_device_...

Since the list might change in the future, using the index to select test cases is not a reliable choice.
I will let the benchmark accept the testcase name only.

superbench/benchmarks/micro_benchmarks/nvbandwidth.py

tests/benchmarks/micro_benchmarks/test_nvbandwidth.py

polarG · 2024-12-05T20:10:05Z

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results
`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Good point! I will try to catch this in the code.
For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

dpower4 · 2024-12-06T17:08:00Z

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results
`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 
Good point! I will try to catch this in the code. For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

Its better to show the waived test in the report in line with how other failed benchmarks are treated.

hongtaozhang and others added 9 commits October 30, 2024 11:40

Init cpu copy.

3459eac

Revert "Init cpu copy."

4c9546c

This reverts commit 3459eac.

Merge branch 'microsoft:main' into main

2a115dd

Merge branch 'microsoft:main' into main

2778e37

Merge branch 'microsoft:main' into main

63ffaba

Merge branch 'microsoft:main' into main

0079f63

Merge branch 'microsoft:main' into main

9796474

Fix bug: nabandwidth benchmark need to handle 'N/A' valules in nvband…

fea87c9

…width output.

Merge branch 'microsoft:main' into bugfix/nvbandwidth-handle-na-value

92b75e6

polarG added bug Something isn't working configuration Benchmark configurations labels Dec 2, 2024

polarG requested a review from a team as a code owner December 2, 2024 06:21

guoshzhao reviewed Dec 2, 2024

View reviewed changes

superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved

guoshzhao reviewed Dec 2, 2024

View reviewed changes

superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved

guoshzhao reviewed Dec 2, 2024

View reviewed changes

superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved

guoshzhao reviewed Dec 2, 2024

View reviewed changes

superbench/config/default.yaml Show resolved Hide resolved

Fix comments.

d444c63

abuccts reviewed Dec 5, 2024

View reviewed changes

Fix comments.

c7d7eff

Add func to handle waied and unsupported test cases.

bd6aab2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix - nvbandwidth benchmark need to handle N/A value #675

Bugfix - nvbandwidth benchmark need to handle N/A value #675

polarG commented Dec 2, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

dpower4 commented Dec 5, 2024 •

edited

Loading

abuccts left a comment

abuccts Dec 5, 2024

polarG Dec 5, 2024

dpower4 Dec 5, 2024

dpower4 Dec 5, 2024

polarG Dec 11, 2024

polarG commented Dec 5, 2024

dpower4 commented Dec 6, 2024

Bugfix - nvbandwidth benchmark need to handle N/A value #675

Are you sure you want to change the base?

Bugfix - nvbandwidth benchmark need to handle N/A value #675

Conversation

polarG commented Dec 2, 2024

codecov bot commented Dec 2, 2024 • edited Loading

Codecov Report

dpower4 commented Dec 5, 2024 • edited Loading

abuccts left a comment

Choose a reason for hiding this comment

abuccts Dec 5, 2024

Choose a reason for hiding this comment

polarG Dec 5, 2024

Choose a reason for hiding this comment

dpower4 Dec 5, 2024

Choose a reason for hiding this comment

dpower4 Dec 5, 2024

Choose a reason for hiding this comment

polarG Dec 11, 2024

Choose a reason for hiding this comment

polarG commented Dec 5, 2024

dpower4 commented Dec 6, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

dpower4 commented Dec 5, 2024 •

edited

Loading