Skip to content

Latest commit

 

History

History
704 lines (545 loc) · 46.3 KB

File metadata and controls

704 lines (545 loc) · 46.3 KB

Lock free debug tools

Concurrency is hard but can improve speed of programs. For that it must be well designed. As for all programs it must be with the less bugs as possible, therefore we use and need tools to guide and correct us.

There are many technics to do concurrent programs and at some point we have to synchronize our threads to make them communicate. These synchronizations can be effectively made with locks (eg: mutexes) or with lock-free technics allowing our threads to cooperate enabling maximum concurrency (some thread makes progress every step). This is also more robust, there is less dependencies with other threads (no locks).

Unfortunately lock-free programming has drawbacks and is more complicated than lock based counterpart. For instance to design lock-free algorithms you must be sure to not break invariants all the time as you can't exclude other threads when updating things. You must be aware of the order in which the data will be visible to other threads. And use atomics to avoid undefined behavior. I will discuss about these problems in the next sections.

If you want an introduction to lock-free programming I would recommend you to read this article by Jeff Preshing. And an in-depth book about concurrency: C++ Concurrency in action by Anthony Williams See references for more.

Table of Contents:

Problems in lock-free programming

Data race

A data race is a situation in which at least two threads access a shared variable at the same time (unsynchronized), and at least one thread tries to modify the variable.

See an example.

Race Condition

A race condition is a situation in which the result of an operation depends on the interleaving of certain individual operations.

In red the data races, in green the race conditions (code):

data race and race condition

Memory reordering

Memory reordering can lead to a completely different program from the source code. This is due to compiler optimizations but also during runtime the processor may reorder stores and loads.

For example, can you guess possible outputs of this program ? (Suppose operations on x and y atomic):

int main() {
  int x = 0;
  int y = 0;
  int r1;
  int r2;
  std::thread      t1([&] {
    y = 1;  // #1
    r1 = x; // #2
  });
  std::thread      t2([&] {
    x = 1;  // #3
    r2 = y; // #4
  });
  t1.join();
  t2.join();
  std::cout << "r1: " << r1 << ", r2: " << r2 << "\n";
  return 0;
}

/* it could be:
r1: 0, r2: 1
r1: 1, r2: 0
r1: 1, r2: 1
right ? but this could also be:
r1: 0, r2: 0
it can arise if #1 #2 or #3 #4 are reordered (store/load reorder)
to prevent memory reordering use barrier or std::atomic with memory_order
(by default it is sequentially consistent)
*/

Here is the reference of C++ memory ordering it explain with example different type of memory ordering.

I recommend you to read memory ordering at compile time by Jeff Preshing. And also memory barriers are like source control operations.

Livelock

Livelock, like deadlock, is when one thread is waiting for another, which is in turn waiting for the former. The difference is the wait is non blocking but active such as spin-lock.

Starvation

Starvation is when some thread(s) access shared part frequently preventing other thread(s) to make progress
Lock-free program does not necessarily prevent starvation.

False sharing

False sharing is when 2 variable accessed frequently be 2 different threads are on the same cache line, thus the threads must acquire exclusivity of that cache line preventing real parallelism.

ABA' problem

ABA can appear in lock-free program when using CAS (compare and swap), example or example trying to provoc ABA

aba problem

Multiple solutions exist:

  • Double length CAS with modification counter
  • Reference counter (shared_ptr with atomic, not necessarily lock-free but portable C++11, simplified in C++20 std::atomic<shared_ptr<>>)
  • Hazard pointers
  • lazy garbadge collection
  • garbage collector
  • Not freeing memory

note lock-free programs are not necessarily starvation free

Tools

These tools have been tested on:

  • Machine 1(M1): ArchLinux Linux 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux (Intel i7-6700HQ (8 cores) 3.5GHz)
    • gcc version 9.1.0-2 (GCC)
    • clang version 8.0.0-4 (tags/RELEASE_800/final)

Note: Outputs may differs depending on compilers and options:

  • Compilers:
    • gcc
    • clang
  • Standard library:
    • libstdc++ (gnu)
    • libc++ (llvm)
  • Options:
    • -g: debug info (file name, symbols, line:column number)

The tools will be tested with different options and differences will be reported.

The tools will be tested with small C++ programs (multiple times for dynamic tools), focused on lock-free.\

Each test with error will be run on a loop, for the first run it will not loop. Then if the tool does not detect the error, we will try to loop 10 times and try again with 100 times and 1000 times.

Each correct test will be run on a loop, we will do 4 run: no loop, loop 10 times, loop 100 times, loop 1000 times. For static tools we won't use loops of course

Tests list:

1: This program is just here to tries to provoc the ABA' problem.
It won't be tested with tools because the tools instruments malloc and other things that prevent the tweak to work.

Compilation, it will make 3 build (gcc, clang and clang with libc++):

./build_all.sh

Valgrind: DRD

M1 Version: valgrind-3.14.0
Type: dynamic on-the-fly
Plateform: Linux/MacOS/Android 32/64bits

DRD principal usage

DRD can detect data races, improper use of POSIX threads, false sharing, deadlock and monitor lock contention. But can't detect wrong lock order.
Also DRD support detached threads.

It is faster than Helgring but can be less precise.

By default DRD does not check for local variable (stack).

  1. Compile normally (with -g for debug info)
  2. Run: valgrind --tool=drd ./program

note: you can colorize the output: pip install colour-valgrind

--check-stack-var=<yes|no> [default: no]
Controls whether DRD detects data races on stack variables.

--segment-merging=<yes|no> [default: yes]
Segment merging is an algorithm to limit memory usage. Disabling segment merging may improve the accuracy displayed in race reports but can also trigger an out of memory error.

--suppressions=<suppressions-file>
Specify user suppressions file, example with atomics. doc

--gen-suppressions=all
Generate suppression for detected problem. Useful to rapidly suppress problem.

You can also annotate your code to help DRD understand happens-before relation example with std::atomic<>. Unfortunately this can silence errors if you get wrong with these, be sure to really understand you code before. It may be possible to wrap annotations and check for ordering(std::memory_order) in user defined atomic class.

=> Check other options

DRD tests

Run command:

valgrind --gen-suppressions=all --suppressions=valgrind.supp --check-stack-var=yes --tool=drd ./prog

The suppressions file if you want.

Sample with errors Gcc Clang Clang libc++ Details
Simple data race more
Data race on std::string more
Data race notify more
Data race on std::map more
Data race and race condition more
Data race on object destruction more
Data race on small string destruction more
Data race on string destruction more
ABA' problem in a stack DS more
ABA' problem in a stack DS Sync ✔10 ✔10 ✔10 more
Notification load relaxed ~ ~ ~ more
Notification load relaxed in loop ~ ~ ~ more
Notification load/store relaxed ~ ~ ~ more
Correct sample Gcc Clang Clang libc++ Details
Simple data race fix ~ more
Simple data race fix relaxed ~ ~ ~ more
Notification sequentially consistent ~ ~ ~ more
Notification acquire release ~ ~ ~ more
ABA' fixed more
store/load sequentially consistent DRD does nothing on that, not the purpose of it
store/load acquire release DRD does nothing on that, not the purpose of it
store/load relaxed DRD does nothing on that, not the purpose of it
  • ✔: The tool has correctly detected the error or correctly reported no error
  • ?: The tool has reported an error even though there was no error
  • ✘: The tool has not reported the error
  • ~: Out of scope
  • !: The tool has crashed
  • <n> if a number is specified it means that the error is manifesting when looping n times.

You can see output samples.

Valgrind: Helgrind

M1 Version: valgrind-3.14.0
Type: dynamic on-the-fly
Plateform: Linux/MacOS/Android 32/64bits

Helgrind principal usage

Helgrind can detect data races (concerned read/write position showed), improper use of POSIX threads, deadlock linked to lock ordering problems.

It is more precise than DRD but slower.

By default Helgrind checks for local variable (stack).

  1. Compile normally (with -g for debug info)
  2. Run: valgrind --tool=helgrind ./program

note: you can colorize the output: pip install colour-valgrind

--check-stack-refs=no|yes [default: yes]
This flag enables you to skip checking for accesses to thread stacks (local variables). This can improve performance, but comes at the cost of missing races on stack-allocated data.

--track-lockorders=no|yes [default: yes]
Helgrind performs lock order consistency checking. If you're only interested in race errors, you may want to disable lock order checking.

--history-level=none|approx|full [default: full]\

  • full: causes Helgrind collects enough information about "old" accesses that it can produce two stack traces in a race report: both the stack trace for the current access, and the trace for the older, conflicting access.
    Collecting such information is expensive in both speed and memory. You may not need it in situations where you just want to check for the presence or absence of races, for example, when doing regression testing of a previously race-free program.

  • none: is the opposite. It causes Helgrind not to collect any information about previous accesses.

  • approx: provides a compromise between these two. It causes Helgrind to show a full trace for the later access, and approximate information regarding the earlier access. This approximate information consists of two stacks, and the earlier access is guaranteed to have occurred somewhere between program points denoted by the two stacks. This is not as useful as showing the exact stack for the previous access but it is almost as fast as none.

You can also annotate your code, like DRD, to help Helgrind understand happens-before relation example with std::atomic<>. Unfortunately this can silence errors if you get wrong with these, be sure to really understand you code before. It may be possible to wrap annotations and check for ordering(std::memory_order) in user defined atomic class.

=> Check other options

Helgrind tests

Run command:

valgrind --gen-suppressions=all --suppressions=valgrind.supp --tool=helgrind ./prog

The suppressions file if you want.

Sample with errors Gcc Clang Clang libc++ Details
Simple data race more
Data race on std::string more
Data race notify more
Data race on std::map more
Data race and race condition more
Data race on object destruction more
Data race on small string destruction more
Data race on string destruction more
ABA' problem in a stack DS more
ABA' problem in a stack DS Sync ✔10 more
Notification load relaxed ~ ~ ~ more
Notification load relaxed in loop ~ ~ ~ more
Notification load/store relaxed ~ ~ ~ more
Correct sample Gcc Clang Clang libc++ Details
Simple data race fix ~ more
Simple data race fix relaxed ~ ~ ~ more
Notification sequentially consistent ~ ~ ~ more
Notification acquire release ~ ~ ~ more
ABA' fixed more
store/load sequentially consistent Helgrind does nothing on that, not the purpose of it though
store/load acquire release Helgrind does nothing on that, not the purpose of it though
store/load relaxed Helgrind does nothing on that, not the purpose of it though
  • ✔: The tool has correctly detected the error or correctly reported no error
  • ?: The tool has reported an error even though there was no error
  • ✘: The tool has not reported the error
  • ~: Out of scope
  • !: The tool has crashed
  • <n> if a number is specified it means that the error is manifesting when looping n times.

You can see output samples.

Type: dynamic on-the-fly
Plateform: Linux/MacOS/Android 64bits only

ThreadSanitizer principal usage

ThreadSanitizer can detect data races (concerned read/write position showed), improper use of POSIX threads, deadlock linked to lock ordering problems. see more.
It also support std::atomic<>

ThreadSanitizer as for other sanitizers is integrated in Gcc and Clang

  1. Compile with -fsanitize=thread (and with -g for debug info)
  2. Run normally: ./program, you could use ./path_colorizer.sh to colorize paths to source.
  3. you can pass options at runtime by setting TSAN_OPTIONS variable

There is compile time options and options that can be passed at runtime avoiding recompilation.
And also common flags with other sanitizer.

It is possible to annotate code or suppress warnings.

ThreadSanitizer tests

Run command:

PATH_COLORIZER_SEARCH=$HOME ./path_colorizer.sh ./program
Sample with errors Gcc Clang Clang libc++ Details
Simple data race more
Data race on std::string more
Data race notify more
Data race on std::map more
Data race and race condition more
Data race on object destruction more
Data race on small string destruction more
Data race on string destruction more
ABA' problem in a stack DS more
ABA' problem in a stack DS Sync ✔10 ✔10 ✔1000 more
Notification load relaxed more
Notification load relaxed in loop more
Notification load/store relaxed more
Correct sample Gcc Clang Clang libc++ Details
Simple data race fix more
Simple data race fix relaxed more
Notification sequentially consistent more
Notification acquire release more
ABA' fixed more
store/load sequentially consistent Tsan does nothing on that, not the purpose of it though
store/load acquire release Tsan does nothing on that, not the purpose of it though
store/load relaxed Tsan does nothing on that, not the purpose of it though
  • ✔: The tool has correctly detected the error or correctly reported no error
  • ?: The tool has reported an error even though there was no error
  • ✘: The tool has not reported the error
  • !: The tool has crashed
  • <n> if a number is specified it means that the error is manifesting when looping n times.

You can see output samples.

Intel Inspector (Free version)

M1 Version: 2019.3.199-3
Type: dynamic on-the-fly instrumentation, post-mortem analysis
Plateform: Linux/MacOS/Windows 32/64bits

Intel Inspector principal usage

Intel Inspector can detect data race, deadlock, it partially support atomics (MSVC/Intel(icc)/Clang).
Here is a more detailed list.

But it has some limitations, especially C++11 Atomics on Clang/Gcc aren't supported.

  1. Compile normally (and with -g for debug info)
  2. Run intel inspector or inspxe-cl command line
  3. Create/open a project
  4. Tell it where your binary is
    project
  5. Launch analyses
    project

When creating a new analysis you can specify options by clicking on Edit button:

options

It is pretty straightforward, by default data race on stack is disabled, you may want to enable it.

Intel Inspector tests

using the cli, I written a script.

inspxe-cl -collect=ti-3 -- ./prog; inspxe-cl -report=problem
# or
./lauch_inspxe.sh
Sample with errors Gcc Clang Clang libc++ Details
Simple data race more
Data race on std::string more
Data race notify more
Data race on std::map more
Data race and race condition more
Data race on object destruction more
Data race on small string destruction more
Data race on string destruction more
ABA' problem in a stack DS more
ABA' problem in a stack DS Sync ✔1000 ✔1000 more
Notification load relaxed ~ ~ ~ more
Notification load relaxed in loop ~ ~ ~ more
Notification load/store relaxed ~ ~ ~ more
Correct sample Gcc Clang Clang libc++ Details
Simple data race fix ~ more
Simple data race fix relaxed ~ ~ ~ more
Notification sequentially consistent ~ ~ ~ more
Notification acquire release ~ ~ ~ more
ABA' fixed more
  • ✔: The tool has correctly detected the error or correctly reported no error
  • ?: The tool has reported an error even though there was no error
  • ✘: The tool has not reported the error
  • ~: Out of scope
  • !: The tool has crashed
  • <n> if a number is specified it means that the error is manifesting when looping n times.

You can see output samples.

Type: static
Plateform: On the web

CppMem principal usage

CppMem has been designed to explore the C/C++ memory model. It support a tiny subset of pseudo C/C++ (no struct/classes)

  1. Go to http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
  2. Write/enter your "program"
  3. Run
    how to run cppmem
  4. See different consistent executions
    how to explore results cppmem

You can use std::thread or you could use this syntax:

{{{
  {
    // thread1
  }
|||
  {
    // thread2
  }
}}}

Output explanation

how to explore results cppmem

You could rapidly see on top global information about the current possible execution (races, locks, ...).
Then the graph describing that execution.

  • W: write
  • R: read
  • (R/W)na: non atomic R/W
  • (R/W)(sc|rlx): atomic R/W (seq_cst|relaxed)
  • Racq: atomic read acquire
  • Wrel: atomic write release
  • sb: sequenced before
  • mo: modification order
  • sw: synchronized-with
  • hb: happens-before
  • rf: read from
  • dr: data race
  • sc: sequentially consistent

More details on their help page.

Tests

We have to convert the C++ into CppMem syntax, some test aren't there because CppMem does not support features used in it.

Sample Result Details
Simple data race CppMem more
Data race on notify CppMem more
Notification load relaxed CppMem more
Notification load relaxed in loop CppMem more
Notification load/store relaxed CppMem more
Fix simple data race CppMem more
Fix simple data race relaxed CppMem more
Notification sequentially consistent CppMem more
Notification acquire release CppMem more
store/load sequentially consistent CppMem more
store/load acquire release CppMem more
store/load relaxed CppMem more

You can see output samples.

Other tools

There are some other tools not covered here with given reason:

  • Other sanitizers: these must be useful to catch other bugs not related to concurrency, check it out.
  • Debuggers (gdb, ...) of course, use it to control interleaving of threads or to explore the execution near detected errors.
  • Relacy race detector: The code must be instrumented (unit test) and compiled C++03 or C++11 but std::* must be changed by rl::* see example
  • CDSChecker: poor support of C++, old.
  • IFRit: too old, based on old llvm/clang

Notes

  • To initialize something in concurrency (double checked locking) think about using static local variable
  • Do not use volatile for synchronization!

Summary

Most tools have poor support of lock-free programming.
CppMem can shows us what is going on in lock free but has serious limitations (no struct, tiny subset of pseudo C supported).

ThreadSanitizer can work well in lock-free context but has not been seriously tested on that specific area[7] and there are no paper describing the new Thread Sanitizer (v2).

Even though we hadn't benchmark the time, during the tests Tsan appears to be the fastest.
Intel Inspector seems to be really slow, actually the analysis take a long time.
DRD even faster than Helgrind seems both to be significantly slower than Tsan.

We had ideas for further testing, but due to lack of time we don't have tested:

  • programs with more than 2 threads
  • real examples
    • boost lock-free
    • TBB
    • PPLP (program develop at Verimag)
  • More benchmark (on real examples)
    • Execution time
    • Memory overhead
  • Other hardware:
    • ARM64
    • Verimag PC

Finally it would be interesting to:

  1. Do further testing on real lock free programs with TSan, as this haven't been done, maybe by stressing the tool with generated correct code with corner cases.
    In order to potentially complete TSan specifically for lock-free programs.
  2. See to make an optimizer that would transform seq_cst into more relaxed barrier.
    That will simplify coder task to produce less error prone program, without losing performance due to relative cost of full memory barriers.
  3. Explore how we could detect the ABA' problem as no tools could simply detect it.

References

  1. CppReference C++ reference close to the standard but more digest.
  2. Book C++ concurrency in action by Anthony Williams good book about concurrency and lock-free
  3. Conference CppCon 2014 Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades)"
  4. Papers
    1. Memory Barriers: a Hardware View for Software Hackers, Paul E. Mckenney, 10.1.1.170.3279
    2. ThreadSanitizer – data race detection in practice, Konstantin Serebryany and Timur Iskhodzhanov, 10.1.1.308.3963
  5. Blogs
    1. ModernsCpp
    2. Jeff Preshing
    3. Concurrency Freaks
  6. Awesome lock-free resources
  7. TSan supports [...] C++ <atomic> operations are supported with llvm libc++ (not very thoroughly tested, though).