Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Software path not available for running benchmark. #45

Open
luminus7 opened this issue Jul 13, 2024 · 1 comment
Open

Software path not available for running benchmark. #45

luminus7 opened this issue Jul 13, 2024 · 1 comment
Assignees

Comments

@luminus7
Copy link

luminus7 commented Jul 13, 2024

Hi :)

I'm new to this project and am trying to run some benchmarks provided by DML.

I've successfully configured DSA and run the hardware path.
But somehow, the software path fails while I try to run the performance test with the provided benchmark framework.

  1. I first checked if my server could run the software and hardware paths.
    Both match the requirements and the tests are successfully finished. I've included some samples below.
$ sudo ./build/examples/high-level-api/hl_mem_move_example software_path
Executing using dml::software path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.

$ sudo ./build/examples/high-level-api/hl_mem_move_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.
  1. I've found that Benchmarking section of the documentation says that, path:cpu is used for benchmark running on the CPU.
    However I cannot get any CPU-related results.
    Here is the sample below.
$ sudo ./build/bin/dml_benchmarks --benchmark_filter="copy/.*/exec:sync/.*/size:4096/.*" --benchmark_min_time=0.1
== Host:   XXX
== Kernel: 6.6.38-XXX
== CPU:    Intel(R) Xeon(R) Gold 6454S (143)
  --> Microcode: 0x2b0005c0
  --> Stepping:  8
  --> Logical:   64
  --> Physical:  64
  --> Socket:    32
  --> Cluster:   8
== Accelerators: 4
  --> NUMA 0: 2
  --> NUMA 1: 2
== Affinity Map [device_index: thread_index(cpu_index)...]:
  --> 0(32) 1(40) 2(48) 3(56) 4(33) 5(41) 6(49) 7(57)
2024-07-13T18:15:11+00:00
Running ./build/bin/dml_benchmarks
Run on (64 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 2048 KiB (x64)
  L3 Unified 61440 KiB (x2)
Load Average: 0.02, 0.05, 0.07
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                           Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
copy/api:c/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time          5393 ns         5394 ns        25908 Latency=5.39321us Latency/Op=337.076ns Throughput=12.1516G/s
copy/api:cpp/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time        7773 ns         7775 ns        20004 Latency=7.77254us Latency/Op=485.784ns Throughput=8.43174G/s
copy/api:c/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time          6148 ns         6150 ns        23922 Latency=6.14786us Latency/Op=384.241ns Throughput=10.66G/s
copy/api:cpp/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time        7457 ns         7460 ns        17897 Latency=7.45672us Latency/Op=466.045ns Throughput=8.78886G/s
copy/api:c/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time          9363 ns         9364 ns        14903 Latency=9.36282us Latency/Op=292.588ns Throughput=13.9992G/s
copy/api:cpp/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time       12516 ns        12526 ns        11455 Latency=12.5165us Latency/Op=391.14ns Throughput=10.4719G/s
copy/api:c/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time         10308 ns        10309 ns        13652 Latency=10.308us Latency/Op=322.126ns Throughput=12.7155G/s
copy/api:cpp/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time       12214 ns        12218 ns        11123 Latency=12.2144us Latency/Op=381.699ns Throughput=10.731G/s
copy/api:c/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time         18702 ns        18711 ns         7474 Latency=18.702us Latency/Op=292.219ns Throughput=14.0169G/s
copy/api:cpp/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time       23672 ns        23688 ns         5919 Latency=23.6724us Latency/Op=369.881ns Throughput=11.0738G/s
copy/api:c/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time         20185 ns        20202 ns         6897 Latency=20.1849us Latency/Op=315.389ns Throughput=12.9871G/s
copy/api:cpp/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time       22601 ns        22613 ns         5871 Latency=22.6012us Latency/Op=353.144ns Throughput=11.5987G/s
copy/api:c/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time        34582 ns        34600 ns         4046 Latency=34.5817us Latency/Op=270.17ns Throughput=15.1608G/s
copy/api:cpp/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time      44488 ns        44524 ns         3149 Latency=44.4884us Latency/Op=347.566ns Throughput=11.7848G/s
copy/api:c/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time        38350 ns        38386 ns         3648 Latency=38.3497us Latency/Op=299.607ns Throughput=13.6712G/s
copy/api:cpp/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time      44497 ns        44534 ns         3135 Latency=44.4972us Latency/Op=347.634ns Throughput=11.7825G/s
  1. Also, I've tried to run some cpu path work directly. But it fails due to regex.
$ sudo ./build/bin/dml_benchmarks --benchmark_filter="copy/api:c/path:cpu/exec:sync/.*" --no_hw
== Host:   XXX
== Kernel: 6.6.38-XXX
== CPU:    Intel(R) Xeon(R) Gold 6454S (143)
  --> Microcode: 0x2b0005c0
  --> Stepping:  8
  --> Logical:   64
  --> Physical:  64
  --> Socket:    32
  --> Cluster:   8
== Accelerators: 4
  --> NUMA 0: 2
  --> NUMA 1: 2
== Affinity Map [device_index: thread_index(cpu_index)...]:
  --> 0(32) 1(40) 2(48) 3(56) 4(33) 5(41) 6(49) 7(57)
Failed to match any benchmarks against regex: copy/api:c/path:cpu/exec:sync/.*

In the Introduction section, it claims that software path will be used in case if hardware accelerator is not available.
So, I've disabled DSA with the 'accel-config' library and re-run the work. But software path still doesn't work.

I'm sorry for the multiple questions. But the point is that I'm trying to run the software path on cpu, to compare performance with the hardware path(DSA).

If there is any further information needed. Please ask me.

Thank you :)

@abdelrahim-hentabli
Copy link
Contributor

Hey @luminus7 , sorry for the delay. I think this is actually an issue with the way we set up the default out_mem location for the benchmarks.

Could you try adding this to the command --out_mem=def to run cpu path benchmarks?

@abdelrahim-hentabli abdelrahim-hentabli self-assigned this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants