-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Threads time taken report. #1834
Comments
i don't see how we can safely change this behaviour without breaking a lot of existing users. |
With an optional argument from the command-line to turn it on you wouldn't break existing users. I'm testing with the following code removed. Ideally it should be enabled / disabled by that command-line argument. diff --git a/src/benchmark_runner.cc b/src/benchmark_runner.cc
index 3a8c307..9e3bbc0 100644
--- a/src/benchmark_runner.cc
+++ b/src/benchmark_runner.cc
@@ -289,18 +289,12 @@ BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
// And get rid of the manager.
manager.reset();
- // Adjust real/manual time stats since they were reported per thread.
- i.results.real_time_used /= b.threads();
- i.results.manual_time_used /= b.threads();
- // If we were measuring whole-process CPU usage, adjust the CPU time too.
- if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();
-
BM_VLOG(2) << "Ran in " << i.results.cpu_time_used << "/"
<< i.results.real_time_used << "\n";
// By using KeepRunningBatch a benchmark can iterate more times than
// requested, so take the iteration count from i.results.
- i.iters = i.results.iterations / b.threads();
+ i.iters = i.results.iterations;
// Base decisions off of real time if requested by this benchmark.
i.seconds = i.results.cpu_time_used;
I'm testing it now in our environment, I will post the results. |
i'm not thrilled about the idea of fundamentally changing how we report on threaded benchmarks with a flag. it's the sort of thing that should be an opinion (and it is today) rather than having to a) maintain two ways of doing the same thing and b) explaining to users which they should use and when. do you think this is so fundamental a problem that we should consider changing the default behaviour? |
Yes, I believe it's a fundamental problem. I was executing the benchmark on a code that was "divide and conquer" on a 128 core processor (with hyperthread / turbo boost disabled), I was dialing up the number of threads, I was expecting 128 threads to be roughly the "sweet spot", but the times were still drastically improving with 1024 threads. So I went for the manual creation and destruction of threads. Then I was getting the expected results. Sorry for taking a long time... I'm will do the tests and report back, in the next hour or so. (It's been a busy day.) |
I'm happy with the test results.
CPU Caches:
L1 Data 48 KiB (x128)
L1 Instruction 32 KiB (x128)
L2 Unified 2048 KiB (x128)
L3 Unified 327680 KiB (x2)
Load Average: 1.33, 5.90, 4.25
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
simple_run0/threads:1 1000110269 ns 21783 ns 10
simple_run0/threads:2 1000113257 ns 20000 ns 40
simple_run0/threads:4 1000130917 ns 22819 ns 4
simple_run0/threads:8 1000117407 ns 23134 ns 8
simple_run0/threads:16 1000171957 ns 23078 ns 16
simple_run0/threads:32 1000184530 ns 22492 ns 32
simple_run0/threads:64 1000242465 ns 25044 ns 64
simple_run0/threads:128 1000154692 ns 13733 ns 128
simple_run0/threads:256 1000116723 ns 11508 ns 256
simple_run0/threads:512 1000151504 ns 13930 ns 512
simple_run0/threads:1024 1000097065 ns 23422 ns 1024 A more process intensive benchmark (monte-carlo simulation): simple_run1/8192/16/threads:1 23921 ms 23919 ms 1
simple_run1/8192/16/threads:2 11999 ms 11999 ms 2
simple_run1/8192/16/threads:4 6021 ms 6021 ms 4
simple_run1/8192/16/threads:8 3020 ms 3020 ms 8
simple_run1/8192/16/threads:16 1519 ms 1519 ms 16
simple_run1/8192/16/threads:32 765 ms 765 ms 32
simple_run1/8192/16/threads:64 395 ms 388 ms 64
simple_run1/8192/16/threads:128 325 ms 199 ms 128
simple_run1/8192/16/threads:256 230 ms 103 ms 256
simple_run1/8192/16/threads:512 250 ms 54.4 ms 512
simple_run1/8192/16/threads:1024 244 ms 29.7 ms 1024 I believe that the code change is in the "right direction". At least for our purposes we will have to use the forked / altered version. |
are you using |
No, in the latest run (above), I'm not using |
it's been a while since i looked at this, but iirc the real time divides by threads, as does iterations. manual time also divides both by iterations and threads. process CPU time does the same, but non-process CPU time doesn't. ie, this is why the CPU column stays ~constant for your sleep test and shows improvement for your monte carlo test (though i'm curious why the CPU time continues to shrink for larger numbers of threads but the real-time flattens out. your change manipulates iterations which i think is incorrect based on the comment -- ie we get an invalid iterations when running in batches and need to correct by taking from results, and as you can see from your results the number of iterations of a single loop is removing the thread division from the real time being stored in results gives similar results to what you have (1s timings for sleeping for 1s). given the previous discussion on this, i wonder if we should add a column. i'm not sure what to call it though. maybe "per-thread time"? |
Yes, I've noticed something was slightly off. Thanks! |
i'm going to see what it looks like to add this to reports. |
Ok
…On Wed, 7 Aug 2024, 13:08 dominic, ***@***.***> wrote:
i'm going to see what it looks like to add this to reports.
—
Reply to this email directly, view it on GitHub
<#1834 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIDKORI55N4WRDGHPTGFYTZQIE2RAVCNFSM6AAAAABMB7NU7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGMYTMNBTHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
See #1834 for detailed description of why this is useful.
See #1834 for detailed description of why this is useful.
Describe the bug
A benchmark that has only sleep_for(10s) gets reported / number of threads.
System
Which OS, compiler, and compiler version are you using:
To reproduce
Steps to reproduce the behavior:
create a benchmark that only hwas a std::this_thread::sleep_for(10s).
Execute it using
->Threads(2);
Expected behavior
It should report 10s, but only reports 5s.
Additional context
I understand that this issue may include several changes downstream.
However, I can't find it right the that a code, which is takes 1s to run in 1 thread (as per
sleep_for(1s)
), gets reported as 0.5s in 2 threads, 0.25 in 4 threads and so on."A question for you: Would you agree that a code that just has a wait time of 1s should be reported that it took 1s to run, regardless of the number of threads?"
Digging deeper into the code:
The following function
DoNIterations()
will callRunInThread()
:And
RunInThread()
will eventually post its resultsBut then further down in
RunInThread
you have:So, you're averaging the times with the number of threads, but the number of iterations doesn't get average causing the discrepancy in measurements that use the
->Threads([greater than 1])
functionality.Since the averaging happens later with the total number of iterations, eliminating the time adjusting (all of
=/ b.threads();
) from the code would be the best solution. Also, this could be an optional parameter (for initial testing purposes).Again, this would only be relevant for benchmarks that use the
->Threads
function.The text was updated successfully, but these errors were encountered: