Skip to content

Commit

Permalink
doc: Update according to build system changes
Browse files Browse the repository at this point in the history
  • Loading branch information
lhmouse committed Jan 28, 2024
1 parent efd7f8e commit aa0a12a
Show file tree
Hide file tree
Showing 3 changed files with 150 additions and 46 deletions.
53 changes: 41 additions & 12 deletions MUTEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

### Synopsis

The `_MCF_mutex` type in MCF Gthread is a one-pointer-size structure with fields like follows:
The `_MCF_mutex` type in MCF Gthread is a one-pointer-size structure with fields
like follows:

```c
typedef struct __MCF_mutex _MCF_mutex;
Expand All @@ -16,30 +17,58 @@ struct __MCF_mutex
};
```
Initially all bits are zeroes, which means the mutex is not locked, and no thread is spinning or waiting. In order to acquire/lock a mutex, a thread has to change the `__locked` bit from zero to one. If it can't do so because the mutex has been locked by another thread, then it must wait until the `__locked` bit becomes zero.
Initially all bits are zeroes, which means the mutex is not locked, and no thread
is spinning or waiting. In order to acquire/lock a mutex, a thread has to change
the `__locked` bit from zero to one. If it can't do so because the mutex has been
locked by another thread, then it must wait until the `__locked` bit becomes zero.
In an attempt to lock a mutex, a thread must test and update the mutex with _atomic operations_, taking _exactly_ one of the three possible actions:
In an attempt to lock a mutex, a thread must test and update the mutex with _atomic
operations_, taking _exactly_ one of the three possible actions:
1. Change `__locked` from zero to one;
2. Change one bit of `__sp_mask` from zero to one;
3. Increment `__nsleep`.
If a thread takes _Action 1_, it will have locked the mutex. If a thread takes _Action 3_, it will go to sleep on the global _Keyed Event_, passing the mutex's address as the key, so it can be woken up later. These are straightforward.
If a thread takes _Action 1_, it will have locked the mutex. If a thread takes
_Action 3_, it will go to sleep on the global _Keyed Event_, passing the address
of the mutex as the key, so it can be woken up later. These are straightforward.
The most interesting and complex part is _Action 2_. If a thread has set a bit of `__sp_mask`, it gains a chance to perform some busy-waiting.
The most interesting and complex part is _Action 2_. If a thread has set a bit of
`__sp_mask`, it gains a chance to perform some busy-waiting.
### The Problems about Busy-waiting
A naive way of busy-waiting is to read and test the `__locked` bit repeatedly in a loop. This is not incorrect, but it can increase pressure of the CPU bus a lot, if there are a lot of threads running.
A naive way of busy-waiting is to read and test the `__locked` bit repeatedly in
a loop. This is not incorrect, but it can increase pressure of the CPU bus a lot,
if there are a lot of threads running.
This can be addressed by allocating a separate flag byte for each spinning thread, so the same location is not shared between CPU cores. An example is the [K42 lock algorithm](https://locklessinc.com/articles/locks/), and **is also what Microsoft has adopted in their `SRWLOCK`s**.
This can be addressed by allocating a separate flag byte for each spinning thread,
so the same location is not shared between CPU cores. An example is the [K42 lock
algorithm](https://locklessinc.com/articles/locks/), and **is also what Microsoft
has adopted in their `SRWLOCK`s**.
The fundamental problem about a linked list of waiters is that, it has to be _reliable_. Because a thread that attempts to unlock a mutex must notify a next waiter, it has to access this linked list. If a waiter could time out and remove a node from the linked list, the notifier could access deallocated memory which is catastrophic.
The fundamental problem about a linked list of waiters is that, it has to be
_reliable_. Because a thread that attempts to unlock a mutex must notify a next
waiter, it has to access this linked list. If a waiter could time out and remove
a node from the linked list, the notifier could access deallocated memory which
is catastrophic.
We have already noticed that a thread may stop spinning and go to sleep at any time. So, is it possible to trade reliability for some efficiency?
We have already noticed that a thread may stop spinning and go to sleep at any
time. So, is it possible to trade reliability for some efficiency?
### The Ultimate Solution
The ultimate solution by mcfgthread is to use a static hash table. A waiter is assigned a flag byte from the table, according to the hash value of the mutex and its bit index in `__sp_mask`. It then only needs to check its own flag byte. If the byte is zero, it continues spinning; otherwise, it resets the byte to zero and makes an attempt to lock the mutex. The flag bytes for all spinning threads of the same mutex are scattered in the hash table such that they never share the same cache line. The table will never be deallocated, so no notifier can access deallocated memory.
The last question is, **what about hash collisions**? Well, we ignore collisions, because they don't matter. This notification mechanism is not reliable, and never has to be, as the number of spinning iterations is capped. A spinning thread will use up its number of iterations sooner or later, and never will we get incorrect results because of accidental hash collisions.
The ultimate solution by mcfgthread is to use a static hash table. A waiter is
assigned a flag byte from the table, according to the hash value of the mutex
and its bit index in `__sp_mask`. It then only needs to check its own flag byte.
If the byte is zero, it continues spinning; otherwise, it resets the byte to
zero and makes an attempt to lock the mutex. The flag bytes for all spinning
threads of the same mutex are scattered in the hash table such that they never
share the same cache line. The table will never be deallocated, so no notifier
can access deallocated memory.
The last question is, **what about hash collisions**? Well, we ignore collisions,
because they don't matter. This notification mechanism is not reliable, and never
has to be, as the number of spinning iterations is capped. A spinning thread will
use up its number of iterations sooner or later, and never will we get incorrect
results because of accidental hash collisions.
117 changes: 88 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,64 @@
# The MCF Gthread Library

**MCF Gthread** is a threading support library for **Windows 7** and above that implements the _gthread interface set_, which is used internally both by **GCC** to provide synchronization of initialization of local static objects, and by **libstdc++** to provide C++11 threading facilities.
**MCF Gthread** is a threading support library for **Windows 7** and above that
implements the _gthread interface set_, which is used internally both by **GCC**
to provide synchronization of initialization of local static objects, and by
**libstdc++** to provide C++11 threading facilities.

I decide to recreate everything from scratch. Apologies for the trouble.

## How to Build

You need to run these commands in a native MSYS2 shell (**MINGW32** or **UCRT64** is recommended):
### Cross-compile from Debian, Ubuntu or Linux Mint

This performs cross compilation. At the moment, only x86-64 is being tested.

```sh
autoreconf -i # requires autoconf, automake and libtool
./configure
make -j$(nproc)
make -j$(nproc) check
sudo apt-get install -y --no-install-recommends mingw-w64-{x86-64-dev,tools} {gcc,binutils}-mingw-w64-x86-64 meson
meson setup --cross-file meson.cross.x86_64-w64-mingw32 build_dir
cd build_dir
ninja
```

Cross-compiling from Linux is also supported:
In order to run tests, Wine is required.

```sh
autoreconf -i # requires autoconf, automake and libtool
# Install cross-compilers first.
# On Debian this can be done with `sudo aptitude install gcc-mingw-w64-{i686,x86-64}`.
./configure --host=i686-w64-mingw32 # or `x86_64-w64-mingw32` for 64-bit builds
make -j$(nproc)
sudo apt-get install -y --no-install-recommends wine64
meson setup --cross-file meson.cross.x86_64-w64-mingw32 build_dir
cd build_dir
ninja
```

## Notes

In order for `__cxa_atexit()` (and the non-standard `__cxa_at_quick_exit()`) to conform to the Itanium C++ ABI, it is required 1) for a process to call `__cxa_finalize(NULL)` when exiting, and 2) for a DLL to call `__cxa_finalize(&__dso_handle)` when it is unloaded dynamically. This requires [hacking the CRT](https://github.com/lhmouse/MINGW-packages/blob/0274a6e7e0da258cf5e32efe6e4427454741fa32/mingw-w64-crt-git/9003-crt-Implement-standard-conforming-termination-suppor.patch). If you don't have the modified CRT, you may still get standard compliance by 1) calling `__MCF_exit()` instead of `exit()` from your program, and 2) calling `__cxa_finalize(&__dso_handle)` followed by `fflush(NULL)` upon receipt of `DLL_PROCESS_DETACH` in your `DllMain()`.

This project is developed and tested on x86 and x64 and hasn't been tested on other CPU architectures.
```sh
# For MSYS2 on Windows:
# UCRT64
pacman -S --noconfirm mingw-w64-ucrt-x86_64-{{headers,crt,tools}-git,gcc,binutils,meson}
# MINGW32
pacman -S --noconfirm mingw-w64-i686-{{headers,crt,tools}-git,gcc,binutils,meson}
```

## Notes

In order for `__cxa_atexit()` (and the non-standard `__cxa_at_quick_exit()`) to
conform to the Itanium C++ ABI, it is required 1) for a process to call
`__cxa_finalize(NULL)` when exiting, and 2) for a DLL to call
`__cxa_finalize(&__dso_handle)` when it is unloaded dynamically. This requires
[hacking the CRT](https://github.com/lhmouse/MINGW-packages/blob/0274a6e7e0da258cf5e32efe6e4427454741fa32/mingw-w64-crt-git/9003-crt-Implement-standard-conforming-termination-suppor.patch). If you don't
have the modified CRT, you may still get standard compliance by 1) calling
`__MCF_exit()` instead of `exit()` from your program, and 2) calling
`__cxa_finalize(&__dso_handle)` followed by `fflush(NULL)` upon receipt of
`DLL_PROCESS_DETACH` in your `DllMain()`.

This project uses some undocumented NT system calls and might be broken in future Windows versions. The author gives no warranty for this project. Use it at your own risk.
This project uses some undocumented NT system calls and is not guaranteed to
work on some Windows versions. The author gives no warranty for this project.
Use it at your own risk.

## Benchmarking

[The test program](mutex_performance.c) was compiled and run on a **Windows 10** machine with a 10-core **Intel i9 10900K** processor.
[The test program](mutex_performance.c) was compiled and run on a **Windows
10** machine with a 10-core **Intel i9 10900K** processor.

* **#THREADS**: number of threads
* **#ITERATIONS**: number of iterations per thread
Expand All @@ -59,26 +82,62 @@ This project uses some undocumented NT system calls and might be broken in futur

### The condition variable

A condition variable is implemented as an atomic counter of threads that are currently waiting on it. Initially the counter is zero, which means no thread is waiting.
A condition variable is implemented as an atomic counter of threads that are
currently waiting on it. Initially the counter is zero, which means no thread
is waiting.

When a thread is about to start waiting on a condition variable, it increments the counter and suspends itself using the global keyed event, passing the address of the condition variable as the key. Another thread may read the counter to tell how many threads that it will have to wake up (note this has to be atomic), and release them from the global keyed event, also passing the address of the condition variable as the key.
When a thread is about to start waiting on a condition variable, it increments
the counter and suspends itself using the global keyed event, passing the
address of the condition variable as the key. Another thread may read the
counter to tell how many threads that it will have to wake up (note this has to
be atomic), and release them from the global keyed event, also passing the
address of the condition variable as the key.

### The primitive mutex

A primitive mutex is just a condition variable with a boolean bit, which designates whether the mutex is LOCKED. A mutex is initialized to all-bit zeroes which means it is unlocked and no thread is waiting.
A primitive mutex is just a condition variable with a boolean bit, which
designates whether the mutex is LOCKED. A mutex is initialized to all-bit zeroes
which means it is unlocked and no thread is waiting.

When a thread wishes to lock a mutex, it checks whether the LOCKED bit is clear. If so, it sets the LOCKED bit and returns, having taken ownership of the mutex. If the LOCKED bit has been set by another thread, it goes to wait on the condition variable. If the thread wishes to unlock this mutex, it clears the LOCKED bit and wakes up at most one waiting thread on the condition variable, if any.
When a thread wishes to lock a mutex, it checks whether the LOCKED bit is clear.
If so, it sets the LOCKED bit and returns, having taken ownership of the mutex.
If the LOCKED bit has been set by another thread, it goes to wait on the
condition variable. If the thread wishes to unlock this mutex, it clears the
LOCKED bit and wakes up at most one waiting thread on the condition variable, if
any.

### The 'real' mutex

In reality, critical sections are fairly small. If a thread fails to lock a mutex, it might be able to do so soon, and we don't want it to give up its time slice as a syscall is an overkill. Therefore, it is reasonable for a thread to perform some spinning (busy waiting), before it actually decides to sleep.

This could however lead to severe problems in case of heavy contention. When there are hundreds of thread attempting to lock the same mutex, the system scheduler has no idea whether they are spinning or not. As it is likely that a lot of threads will eventually give up spinning and make a syscall to sleep, we are wasting a lot of CPU time and aggravating the situation.

This issue is ultimately solved by mcfgthread by encoding a spin failure counter in each mutex. If a thread gives up spinning because it couldn't lock the mutex within a given number of iterations, the spin failure counter is incremented. If a thread locks a mutex successfully while it is spinning, the spin failure counter is decremented. This counter provides a heuristic way to determine how heavily a mutex is seized. If there have been many spin failures, newcomers will not attempt to spin, but will make a syscall to sleep on the mutex directly.
In reality, critical sections are fairly small. If a thread fails to lock a
mutex, it might be able to do so soon, and we don't want it to give up its time
slice as a syscall is an overkill. Therefore, it is reasonable for a thread to
perform some spinning (busy waiting), before it actually decides to sleep.

This could however lead to severe problems in case of heavy contention. When
there are hundreds of thread attempting to lock the same mutex, the system
scheduler has no idea whether they are spinning or not. As it is likely that a
lot of threads will eventually give up spinning and make a syscall to sleep, we
are wasting a lot of CPU time and aggravating the situation.

This issue is ultimately solved by mcfgthread by encoding a spin failure counter
in each mutex. If a thread gives up spinning because it couldn't lock the mutex
within a given number of iterations, the spin failure counter is incremented. If
a thread locks a mutex successfully while it is spinning, the spin failure
counter is decremented. This counter provides a heuristic way to determine how
heavily a mutex is seized. If there have been many spin failures, newcomers will
not attempt to spin, but will make a syscall to sleep on the mutex directly.

### The once-initialization flag

A once-initialization flag contains a READY byte (this is the first one according to Itanium ABI) which indicates whether initialization has completed. The other bytes are used as a primitive mutex.

A thread that sees the READY byte set to non-zero knows initialization has been done, so it will return immediately. A thread that sees the READY byte set to zero will lock the bundled primitive mutex, and shall perform initialization thereafter. If initialization fails, it unlocks the primitive mutex without setting the READY byte, so the next thread that locks the primitive mutex will perform initialization. If initialization is successful, it sets the READY byte and unlocks the primitive mutex, releasing all threads that are waiting on it. (Do you remember that a primitive mutex actually contains a condition variable?)
A once-initialization flag contains a READY byte (this is the first one according
to Itanium ABI) which indicates whether initialization has completed. The other
bytes are used as a primitive mutex.

A thread that sees the READY byte set to non-zero knows initialization has been
done, so it will return immediately. A thread that sees the READY byte set to
zero will lock the bundled primitive mutex, and shall perform initialization
thereafter. If initialization fails, it unlocks the primitive mutex without
setting the READY byte, so the next thread that locks the primitive mutex will
perform initialization. If initialization is successful, it sets the READY byte
and unlocks the primitive mutex, releasing all threads that are waiting on it.
(Do you remember that a primitive mutex actually contains a condition variable?)
26 changes: 21 additions & 5 deletions patches/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,25 @@
These are patches that I use to build GCC 11 with mcfgthread support.

Normally, mingw-w64 CRT performs per-thread cleanup upon receipt of `DLL_PROCESS_DETACH` in a TLS callback (of an EXE) or the `DllMainCRTStartup()` function (of a DLL). There are some major issues in this approach:
Normally, mingw-w64 CRT performs per-thread cleanup upon receipt of
`DLL_PROCESS_DETACH` in a TLS callback (of an EXE) or the `DllMainCRTStartup()`
function (of a DLL). There are some major issues in this approach:

1. These callbacks are invoked after Windows has terminated all the other threads. If another thread is terminated while it has locked a mutex, the mutex will never get unlocked. If a destructor of a static object or a callback that has been registered with `atexit()` attempts to acquire the exact mutex, deadlocks can occur.
2. These callbacks are still invoked if the user calls `_Exit()` or `quick_exit()`, such as [in LLVM](https://reviews.llvm.org/D102944). As specified by the C++ standard, they shall not be called.
3. Per-thread cleanup may be performed after destructors of static objects. The C++ standard does not allow this behavior.
1. These callbacks are invoked after Windows has terminated all the other
threads. If another thread is terminated while it has locked a mutex, the
mutex will never get unlocked. If a destructor of a static object or a
callback that has been registered with `atexit()` attempts to acquire the
exact mutex, deadlocks can occur.
2. These callbacks are still invoked if the user calls `_Exit()` or
`quick_exit()`, such as [in LLVM](https://reviews.llvm.org/D102944). As
specified by the C++ standard, they shall not be called.
3. Per-thread cleanup may be performed after destructors of static objects. The
C++ standard does not allow this behavior.

GCC uses `atexit()` to register destructors for static objects. Therefore, the CRT has to be modified to forward such calls to `__MCF_cxa_atexit()`, passing the address of the module-specific `__dso_handle` as their third arguments. The modified CRT also forwards calls to `exit()`, `_Exit()`, `_exit()` and `quick_exit()` to standard-conforming ones in mcfgthread, which eventually calls `TerminateProcess()` instead of `ExitProcess()`, to address such issues. Per-thread and process cleanup is performed by `__cxa_finalize()`, in accordance with the Itanium ABI.
GCC uses `atexit()` to register destructors for static objects. Therefore,
the CRT has to be modified to forward such calls to `__MCF_cxa_atexit()`,
passing the address of the module-specific `__dso_handle` as their third
arguments. The modified CRT also forwards calls to `exit()`, `_Exit()`,
`_exit()` and `quick_exit()` to standard-conforming ones in mcfgthread, which
eventually calls `TerminateProcess()` instead of `ExitProcess()`, to address
such issues. Per-thread and process cleanup is performed by `__cxa_finalize()`,
in accordance with the Itanium ABI.

0 comments on commit aa0a12a

Please sign in to comment.