Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix - nvbandwidth benchmark need to handle N/A value #675

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

polarG
Copy link
Contributor

@polarG polarG commented Dec 2, 2024

Description

  1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output.
  2. Replaced the input format of test cases with a list.
  3. Add nvbandwidth configuration example in default config files.

@polarG polarG added bug Something isn't working configuration Benchmark configurations labels Dec 2, 2024
@polarG polarG requested a review from a team as a code owner December 2, 2024 06:21
Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 85.45%. Comparing base (249e21c) to head (bd6aab2).

Files with missing lines Patch % Lines
...erbench/benchmarks/micro_benchmarks/nvbandwidth.py 70.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #675      +/-   ##
==========================================
- Coverage   85.61%   85.45%   -0.16%     
==========================================
  Files          99       99              
  Lines        7165     7210      +45     
==========================================
+ Hits         6134     6161      +27     
- Misses       1031     1049      +18     
Flag Coverage Δ
cpu-python3.10-unit-test 71.14% <70.00%> (-0.70%) ⬇️
cpu-python3.7-unit-test 71.11% <69.49%> (-0.70%) ⬇️
cpu-python3.8-unit-test 71.15% <70.00%> (-0.68%) ⬇️
cuda-unit-test 83.27% <70.00%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dpower4
Copy link
Contributor

dpower4 commented Dec 5, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived".
for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Copy link
Member

@abuccts abuccts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add test cases for the "N/A" and "Waived" cases

'Specify the test case(s) to run, either by name or index. By default, all test cases are executed. '
'Example: --test_cases 0,1,2,19,20'
'Specify the test case(s) to execute, either by name or index. '
'To view the available test case names or indices, run the command nvbandwidth on the host. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide the names directly or use a link to nvbandwidth document instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the test cases are not listed in the doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./nvbandwidth -l
nvbandwidth Version: v0.6
Built from Git version: v0.6

Index, Name:
        Description
=======================
0, host_to_device_memcpy_ce:
        Host to device CE memcpy using cuMemcpyAsync

1, device_to_host_memcpy_ce:
        Device to host CE memcpy using cuMemcpyAsync

2, host_to_device_bidirectional_memcpy_ce:
        A host to device copy is measured while a device to host copy is run simultaneously.
        Only the host to device copy bandwidth is reported.

3, device_to_host_bidirectional_memcpy_ce:
        A device to host copy is measured while a host to device copy is run simultaneously.
        Only the device to host copy bandwidth is reported.

4, device_to_device_memcpy_read_ce:
        Measures bandwidth of cuMemcpyAsync between each pair of accessible peers.
        Read tests launch a copy from the peer device to the target using the target's context.

5, device_to_device_memcpy_write_ce:
        Measures bandwidth of cuMemcpyAsync between each pair of accessible peers.
        Write tests launch a copy from the target device to the peer using the target's context.

6, device_to_device_bidirectional_memcpy_read_ce:
        Measures bandwidth of cuMemcpyAsync between each pair of accessible peers.
        A copy in the opposite direction of the measured copy is run simultaneously but not measured.
        Read tests launch a copy from the peer device to the target using the target's context.

7, device_to_device_bidirectional_memcpy_write_ce:
        Measures bandwidth of cuMemcpyAsync between each pair of accessible peers.
        A copy in the opposite direction of the measured copy is run simultaneously but not measured.
        Write tests launch a copy from the target device to the peer using the target's context.

8, all_to_host_memcpy_ce:
        Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously
        running copies from all other devices to the host.

9, all_to_host_bidirectional_memcpy_ce:
        A device to host copy is measured while a host to device copy is run simultaneously.
        Only the device to host copy bandwidth is reported.
        All other devices generate simultaneous host to device and device to host interferring traffic.

10, host_to_all_memcpy_ce:
        Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously
        running copies from the host to all other devices.

11, host_to_all_bidirectional_memcpy_ce:
        A host to device copy is measured while a device to host copy is run simultaneously.
        Only the host to device copy bandwidth is reported.
        All other devices generate simultaneous host to device and device to host interferring traffic.

12, all_to_one_write_ce:
        Measures the total bandwidth of copies from all accessible peers to a single device, for each
        device. Bandwidth is reported as the total inbound bandwidth for each device.
        Write tests launch a copy from the target device to the peer using the target's context.

13, all_to_one_read_ce:
        Measures the total bandwidth of copies from all accessible peers to a single device, for each
        device. Bandwidth is reported as the total outbound bandwidth for each device.
        Read tests launch a copy from the peer device to the target using the target's context.

14, one_to_all_write_ce:
        Measures the total bandwidth of copies from a single device to all accessible peers, for each
        device. Bandwidth is reported as the total outbound bandwidth for each device.
        Write tests launch a copy from the target device to the peer using the target's context.

15, one_to_all_read_ce:
        Measures the total bandwidth of copies from a single device to all accessible peers, for each
        device. Bandwidth is reported as the total inbound bandwidth for each device.
        Read tests launch a copy from the peer device to the target using the target's context.

16, host_to_device_memcpy_sm:
        Host to device SM memcpy using a copy kernel

17, device_to_host_memcpy_sm:
        Device to host SM memcpy using a copy kernel

18, device_to_device_memcpy_read_sm:
        Measures bandwidth of a copy kernel between each pair of accessible peers.
        Read tests launch a copy from the peer device to the target using the target's context.

19, device_to_device_memcpy_write_sm:
        Measures bandwidth of a copy kernel between each pair of accessible peers.
        Write tests launch a copy from the target device to the peer using the target's context.

20, device_to_device_bidirectional_memcpy_read_sm:
        Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run
        in both directions between each pair, and the sum is reported.
        Read tests launch a copy from the peer device to the target using the target's context.

21, device_to_device_bidirectional_memcpy_write_sm:
        Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run
        in both directions between each pair, and the sum is reported.
        Write tests launch a copy from the target device to the peer using the target's context.

22, all_to_host_memcpy_sm:
        Measures bandwidth of a copy kernel between a single device and the host while simultaneously
        running copies from all other devices to the host.

23, all_to_host_bidirectional_memcpy_sm:
        A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously.
        Only the device to host copy bandwidth is reported.
        All other devices generate simultaneous host to device and device to host interferring traffic using copy kernels.

24, host_to_all_memcpy_sm:
        Measures bandwidth of a copy kernel between the host to a single device while simultaneously
        running copies from the host to all other devices.

25, host_to_all_bidirectional_memcpy_sm:
        A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously.
        Only the host to device copy bandwidth is reported.
        All other devices generate simultaneous host to device and device to host interferring traffic using copy kernels.

26, all_to_one_write_sm:
        Measures the total bandwidth of copies from all accessible peers to a single device, for each
        device. Bandwidth is reported as the total inbound bandwidth for each device.
        Write tests launch a copy from the target device to the peer using the target's context.

27, all_to_one_read_sm:
        Measures the total bandwidth of copies from all accessible peers to a single device, for each
        device. Bandwidth is reported as the total outbound bandwidth for each device.
        Read tests launch a copy from the peer device to the target using the target's context.

28, one_to_all_write_sm:
        Measures the total bandwidth of copies from a single device to all accessible peers, for each
        device. Bandwidth is reported as the total outbound bandwidth for each device.
        Write tests launch a copy from the target device to the peer using the target's context.

29, one_to_all_read_sm:
        Measures the total bandwidth of copies from a single device to all accessible peers, for each
        device. Bandwidth is reported as the total inbound bandwidth for each device.
        Read tests launch a copy from the peer device to the target using the target's context.

30, host_device_latency_sm:
        Host - device SM copy latency using a ptr chase kernel

31, device_to_device_latency_sm:
        Measures latency of a pointer derefernce operation between each pair of accessible peers.
        Memory is allocated on a GPU and is accessed by the peer GPU to determine latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from v0.6, The list might change though with the next version having multinode test cases. for eg: multinode_device_to_device_...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the list might change in the future, using the index to select test cases is not a reliable choice.
I will let the benchmark accept the testcase name only.

superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved
superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved
tests/benchmarks/micro_benchmarks/test_nvbandwidth.py Outdated Show resolved Hide resolved
@polarG
Copy link
Contributor Author

polarG commented Dec 5, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Good point! I will try to catch this in the code.
For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

@dpower4
Copy link
Contributor

dpower4 commented Dec 6, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Good point! I will try to catch this in the code. For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

Its better to show the waived test in the report in line with how other failed benchmarks are treated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working configuration Benchmark configurations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants