doc: Update according to build system changes

lhmouse · Jan 28, 2024 · aa0a12a · aa0a12a
1 parent efd7f8e
commit aa0a12a
Show file tree

Hide file tree

Showing 3 changed files with 150 additions and 46 deletions.
diff --git a/MUTEX.md b/MUTEX.md
@@ -2,7 +2,8 @@
 
 ### Synopsis
 
-The `_MCF_mutex` type in MCF Gthread is a one-pointer-size structure with fields like follows:
+The `_MCF_mutex` type in MCF Gthread is a one-pointer-size structure with fields
+like follows:
 
 ```c
 typedef struct __MCF_mutex _MCF_mutex;
@@ -16,30 +17,58 @@ struct __MCF_mutex
   };
 ```
 
-Initially all bits are zeroes, which means the mutex is not locked, and no thread is spinning or waiting. In order to acquire/lock a mutex, a thread has to change the `__locked` bit from zero to one. If it can't do so because the mutex has been locked by another thread, then it must wait until the `__locked` bit becomes zero.
+Initially all bits are zeroes, which means the mutex is not locked, and no thread
+is spinning or waiting. In order to acquire/lock a mutex, a thread has to change
+the `__locked` bit from zero to one. If it can't do so because the mutex has been
+locked by another thread, then it must wait until the `__locked` bit becomes zero.
 
-In an attempt to lock a mutex, a thread must test and update the mutex with _atomic operations_, taking _exactly_ one of the three possible actions:
+In an attempt to lock a mutex, a thread must test and update the mutex with _atomic
+operations_, taking _exactly_ one of the three possible actions:
 
 1. Change `__locked` from zero to one;
 2. Change one bit of `__sp_mask` from zero to one;
 3. Increment `__nsleep`.
 
-If a thread takes _Action 1_, it will have locked the mutex. If a thread takes _Action 3_, it will go to sleep on the global _Keyed Event_, passing the mutex's address as the key, so it can be woken up later. These are straightforward.
+If a thread takes _Action 1_, it will have locked the mutex. If a thread takes
+_Action 3_, it will go to sleep on the global _Keyed Event_, passing the address
+of the mutex as the key, so it can be woken up later. These are straightforward.
 
-The most interesting and complex part is _Action 2_. If a thread has set a bit of `__sp_mask`, it gains a chance to perform some busy-waiting.
+The most interesting and complex part is _Action 2_. If a thread has set a bit of
+`__sp_mask`, it gains a chance to perform some busy-waiting.
 
 ### The Problems about Busy-waiting
 
-A naive way of busy-waiting is to read and test the `__locked` bit repeatedly in a loop. This is not incorrect, but it can increase pressure of the CPU bus a lot, if there are a lot of threads running.
+A naive way of busy-waiting is to read and test the `__locked` bit repeatedly in
+a loop. This is not incorrect, but it can increase pressure of the CPU bus a lot,
+if there are a lot of threads running.
 
-This can be addressed by allocating a separate flag byte for each spinning thread, so the same location is not shared between CPU cores. An example is the [K42 lock algorithm](https://locklessinc.com/articles/locks/), and **is also what Microsoft has adopted in their `SRWLOCK`s**.
+This can be addressed by allocating a separate flag byte for each spinning thread,
+so the same location is not shared between CPU cores. An example is the [K42 lock
+algorithm](https://locklessinc.com/articles/locks/), and **is also what Microsoft
+has adopted in their `SRWLOCK`s**.
 
-The fundamental problem about a linked list of waiters is that, it has to be _reliable_. Because a thread that attempts to unlock a mutex must notify a next waiter, it has to access this linked list. If a waiter could time out and remove a node from the linked list, the notifier could access deallocated memory which is catastrophic.
+The fundamental problem about a linked list of waiters is that, it has to be
+_reliable_. Because a thread that attempts to unlock a mutex must notify a next
+waiter, it has to access this linked list. If a waiter could time out and remove
+a node from the linked list, the notifier could access deallocated memory which
+is catastrophic.
 
-We have already noticed that a thread may stop spinning and go to sleep at any time. So, is it possible to trade reliability for some efficiency?
+We have already noticed that a thread may stop spinning and go to sleep at any
+time. So, is it possible to trade reliability for some efficiency?
 
 ### The Ultimate Solution
 
-The ultimate solution by mcfgthread is to use a static hash table. A waiter is assigned a flag byte from the table, according to the hash value of the mutex and its bit index in `__sp_mask`. It then only needs to check its own flag byte. If the byte is zero, it continues spinning; otherwise, it resets the byte to zero and makes an attempt to lock the mutex. The flag bytes for all spinning threads of the same mutex are scattered in the hash table such that they never share the same cache line. The table will never be deallocated, so no notifier can access deallocated memory.
-
-The last question is, **what about hash collisions**? Well, we ignore collisions, because they don't matter. This notification mechanism is not reliable, and never has to be, as the number of spinning iterations is capped. A spinning thread will use up its number of iterations sooner or later, and never will we get incorrect results because of accidental hash collisions.
+The ultimate solution by mcfgthread is to use a static hash table. A waiter is
+assigned a flag byte from the table, according to the hash value of the mutex
+and its bit index in `__sp_mask`. It then only needs to check its own flag byte.
+If the byte is zero, it continues spinning; otherwise, it resets the byte to
+zero and makes an attempt to lock the mutex. The flag bytes for all spinning
+threads of the same mutex are scattered in the hash table such that they never
+share the same cache line. The table will never be deallocated, so no notifier
+can access deallocated memory.
+
+The last question is, **what about hash collisions**? Well, we ignore collisions,
+because they don't matter. This notification mechanism is not reliable, and never
+has to be, as the number of spinning iterations is capped. A spinning thread will
+use up its number of iterations sooner or later, and never will we get incorrect
+results because of accidental hash collisions.
diff --git a/README.md b/README.md
@@ -1,41 +1,64 @@
 # The MCF Gthread Library
 
-**MCF Gthread** is a threading support library for **Windows 7** and above that implements the _gthread interface set_, which is used internally both by **GCC** to provide synchronization of initialization of local static objects, and by **libstdc++** to provide C++11 threading facilities.
+**MCF Gthread** is a threading support library for **Windows 7** and above that
+implements the _gthread interface set_, which is used internally both by **GCC**
+to provide synchronization of initialization of local static objects, and by
+**libstdc++** to provide C++11 threading facilities.
 
 I decide to recreate everything from scratch. Apologies for the trouble.
 
 ## How to Build
 
-You need to run these commands in a native MSYS2 shell (**MINGW32** or **UCRT64** is recommended):
+### Cross-compile from Debian, Ubuntu or Linux Mint
+
+This performs cross compilation. At the moment, only x86-64 is being tested.
 
 ```sh
-autoreconf -i  # requires autoconf, automake and libtool
-./configure
-make -j$(nproc)
-make -j$(nproc) check
+sudo apt-get install -y --no-install-recommends mingw-w64-{x86-64-dev,tools} {gcc,binutils}-mingw-w64-x86-64 meson
+meson setup --cross-file meson.cross.x86_64-w64-mingw32 build_dir
+cd build_dir
+ninja
 ```
 
-Cross-compiling from Linux is also supported:
+In order to run tests, Wine is required.
 
 ```sh
-autoreconf -i  # requires autoconf, automake and libtool
-# Install cross-compilers first.
-# On Debian this can be done with `sudo aptitude install gcc-mingw-w64-{i686,x86-64}`.
-./configure --host=i686-w64-mingw32  # or `x86_64-w64-mingw32` for 64-bit builds
-make -j$(nproc)
+sudo apt-get install -y --no-install-recommends wine64
+meson setup --cross-file meson.cross.x86_64-w64-mingw32 build_dir
+cd build_dir
+ninja
 ```
 
-## Notes
 
-In order for `__cxa_atexit()` (and the non-standard `__cxa_at_quick_exit()`) to conform to the Itanium C++ ABI, it is required 1) for a process to call `__cxa_finalize(NULL)` when exiting, and 2) for a DLL to call `__cxa_finalize(&__dso_handle)` when it is unloaded dynamically. This requires [hacking the CRT](https://github.com/lhmouse/MINGW-packages/blob/0274a6e7e0da258cf5e32efe6e4427454741fa32/mingw-w64-crt-git/9003-crt-Implement-standard-conforming-termination-suppor.patch). If you don't have the modified CRT, you may still get standard compliance by 1) calling `__MCF_exit()` instead of `exit()` from your program, and 2) calling `__cxa_finalize(&__dso_handle)` followed by `fflush(NULL)` upon receipt of `DLL_PROCESS_DETACH` in your `DllMain()`.
 
-This project is developed and tested on x86 and x64 and hasn't been tested on other CPU architectures.
+```sh
+# For MSYS2 on Windows:
+# UCRT64
+pacman -S --noconfirm mingw-w64-ucrt-x86_64-{{headers,crt,tools}-git,gcc,binutils,meson}
+# MINGW32
+pacman -S --noconfirm mingw-w64-i686-{{headers,crt,tools}-git,gcc,binutils,meson}
+```
+
+## Notes
+
+In order for `__cxa_atexit()` (and the non-standard `__cxa_at_quick_exit()`) to
+conform to the Itanium C++ ABI, it is required 1) for a process to call
+`__cxa_finalize(NULL)` when exiting, and 2) for a DLL to call
+`__cxa_finalize(&__dso_handle)` when it is unloaded dynamically. This requires
+[hacking the CRT](https://github.com/lhmouse/MINGW-packages/blob/0274a6e7e0da258cf5e32efe6e4427454741fa32/mingw-w64-crt-git/9003-crt-Implement-standard-conforming-termination-suppor.patch). If you don't
+have the modified CRT, you may still get standard compliance by 1) calling
+`__MCF_exit()` instead of `exit()` from your program, and 2) calling
+`__cxa_finalize(&__dso_handle)` followed by `fflush(NULL)` upon receipt of
+`DLL_PROCESS_DETACH` in your `DllMain()`.
 
-This project uses some undocumented NT system calls and might be broken in future Windows versions. The author gives no warranty for this project. Use it at your own risk.
+This project uses some undocumented NT system calls and is not guaranteed to
+work on some Windows versions. The author gives no warranty for this project.
+Use it at your own risk.
 
 ## Benchmarking
 
-[The test program](mutex_performance.c) was compiled and run on a **Windows 10** machine with a 10-core **Intel i9 10900K** processor.
+[The test program](mutex_performance.c) was compiled and run on a **Windows
+10** machine with a 10-core **Intel i9 10900K** processor.
 
 * **#THREADS**: number of threads
 * **#ITERATIONS**: number of iterations per thread
@@ -59,26 +82,62 @@ This project uses some undocumented NT system calls and might be broken in futur
 
 ### The condition variable
 
-A condition variable is implemented as an atomic counter of threads that are currently waiting on it. Initially the counter is zero, which means no thread is waiting.
+A condition variable is implemented as an atomic counter of threads that are
+currently waiting on it. Initially the counter is zero, which means no thread
+is waiting.
 
-When a thread is about to start waiting on a condition variable, it increments the counter and suspends itself using the global keyed event, passing the address of the condition variable as the key. Another thread may read the counter to tell how many threads that it will have to wake up (note this has to be atomic), and release them from the global keyed event, also passing the address of the condition variable as the key.
+When a thread is about to start waiting on a condition variable, it increments
+the counter and suspends itself using the global keyed event, passing the
+address of the condition variable as the key. Another thread may read the
+counter to tell how many threads that it will have to wake up (note this has to
+be atomic), and release them from the global keyed event, also passing the
+address of the condition variable as the key.
 
 ### The primitive mutex
 
-A primitive mutex is just a condition variable with a boolean bit, which designates whether the mutex is LOCKED. A mutex is initialized to all-bit zeroes which means it is unlocked and no thread is waiting.
+A primitive mutex is just a condition variable with a boolean bit, which
+designates whether the mutex is LOCKED. A mutex is initialized to all-bit zeroes
+which means it is unlocked and no thread is waiting.
 
-When a thread wishes to lock a mutex, it checks whether the LOCKED bit is clear. If so, it sets the LOCKED bit and returns, having taken ownership of the mutex. If the LOCKED bit has been set by another thread, it goes to wait on the condition variable. If the thread wishes to unlock this mutex, it clears the LOCKED bit and wakes up at most one waiting thread on the condition variable, if any.
+When a thread wishes to lock a mutex, it checks whether the LOCKED bit is clear.
+If so, it sets the LOCKED bit and returns, having taken ownership of the mutex.
+If the LOCKED bit has been set by another thread, it goes to wait on the
+condition variable. If the thread wishes to unlock this mutex, it clears the
+LOCKED bit and wakes up at most one waiting thread on the condition variable, if
+any.
 
 ### The 'real' mutex
 
-In reality, critical sections are fairly small. If a thread fails to lock a mutex, it might be able to do so soon, and we don't want it to give up its time slice as a syscall is an overkill. Therefore, it is reasonable for a thread to perform some spinning (busy waiting), before it actually decides to sleep.
-
-This could however lead to severe problems in case of heavy contention. When there are hundreds of thread attempting to lock the same mutex, the system scheduler has no idea whether they are spinning or not. As it is likely that a lot of threads will eventually give up spinning and make a syscall to sleep, we are wasting a lot of CPU time and aggravating the situation.
-
-This issue is ultimately solved by mcfgthread by encoding a spin failure counter in each mutex. If a thread gives up spinning because it couldn't lock the mutex within a given number of iterations, the spin failure counter is incremented. If a thread locks a mutex successfully while it is spinning, the spin failure counter is decremented. This counter provides a heuristic way to determine how heavily a mutex is seized. If there have been many spin failures, newcomers will not attempt to spin, but will make a syscall to sleep on the mutex directly.
+In reality, critical sections are fairly small. If a thread fails to lock a
+mutex, it might be able to do so soon, and we don't want it to give up its time
+slice as a syscall is an overkill. Therefore, it is reasonable for a thread to
+perform some spinning (busy waiting), before it actually decides to sleep.
+
+This could however lead to severe problems in case of heavy contention. When
+there are hundreds of thread attempting to lock the same mutex, the system
+scheduler has no idea whether they are spinning or not. As it is likely that a
+lot of threads will eventually give up spinning and make a syscall to sleep, we
+are wasting a lot of CPU time and aggravating the situation.
+
+This issue is ultimately solved by mcfgthread by encoding a spin failure counter
+in each mutex. If a thread gives up spinning because it couldn't lock the mutex
+within a given number of iterations, the spin failure counter is incremented. If
+a thread locks a mutex successfully while it is spinning, the spin failure
+counter is decremented. This counter provides a heuristic way to determine how
+heavily a mutex is seized. If there have been many spin failures, newcomers will
+not attempt to spin, but will make a syscall to sleep on the mutex directly.
 
 ### The once-initialization flag
 
-A once-initialization flag contains a READY byte (this is the first one according to Itanium ABI) which indicates whether initialization has completed. The other bytes are used as a primitive mutex.
-
-A thread that sees the READY byte set to non-zero knows initialization has been done, so it will return immediately. A thread that sees the READY byte set to zero will lock the bundled primitive mutex, and shall perform initialization thereafter. If initialization fails, it unlocks the primitive mutex without setting the READY byte, so the next thread that locks the primitive mutex will perform initialization. If initialization is successful, it sets the READY byte and unlocks the primitive mutex, releasing all threads that are waiting on it. (Do you remember that a primitive mutex actually contains a condition variable?)
+A once-initialization flag contains a READY byte (this is the first one according
+to Itanium ABI) which indicates whether initialization has completed. The other
+bytes are used as a primitive mutex.
+
+A thread that sees the READY byte set to non-zero knows initialization has been
+done, so it will return immediately. A thread that sees the READY byte set to
+zero will lock the bundled primitive mutex, and shall perform initialization
+thereafter. If initialization fails, it unlocks the primitive mutex without
+setting the READY byte, so the next thread that locks the primitive mutex will
+perform initialization. If initialization is successful, it sets the READY byte
+and unlocks the primitive mutex, releasing all threads that are waiting on it.
+(Do you remember that a primitive mutex actually contains a condition variable?)
diff --git a/patches/README.md b/patches/README.md
@@ -1,9 +1,25 @@
 These are patches that I use to build GCC 11 with mcfgthread support.
 
-Normally, mingw-w64 CRT performs per-thread cleanup upon receipt of `DLL_PROCESS_DETACH` in a TLS callback (of an EXE) or the `DllMainCRTStartup()` function (of a DLL). There are some major issues in this approach:
+Normally, mingw-w64 CRT performs per-thread cleanup upon receipt of
+`DLL_PROCESS_DETACH` in a TLS callback (of an EXE) or the `DllMainCRTStartup()`
+function (of a DLL). There are some major issues in this approach:
 
-1. These callbacks are invoked after Windows has terminated all the other threads. If another thread is terminated while it has locked a mutex, the mutex will never get unlocked. If a destructor of a static object or a callback that has been registered with `atexit()` attempts to acquire the exact mutex, deadlocks can occur.
-2. These callbacks are still invoked if the user calls `_Exit()` or `quick_exit()`, such as [in LLVM](https://reviews.llvm.org/D102944). As specified by the C++ standard, they shall not be called.
-3. Per-thread cleanup may be performed after destructors of static objects. The C++ standard does not allow this behavior.
+1. These callbacks are invoked after Windows has terminated all the other
+   threads. If another thread is terminated while it has locked a mutex, the
+   mutex will never get unlocked. If a destructor of a static object or a
+   callback that has been registered with `atexit()` attempts to acquire the
+   exact mutex, deadlocks can occur.
+2. These callbacks are still invoked if the user calls `_Exit()` or
+   `quick_exit()`, such as [in LLVM](https://reviews.llvm.org/D102944). As
+   specified by the C++ standard, they shall not be called.
+3. Per-thread cleanup may be performed after destructors of static objects. The
+   C++ standard does not allow this behavior.
 
-GCC uses `atexit()` to register destructors for static objects. Therefore, the CRT has to be modified to forward such calls to `__MCF_cxa_atexit()`, passing the address of the module-specific `__dso_handle` as their third arguments. The modified CRT also forwards calls to `exit()`, `_Exit()`, `_exit()` and `quick_exit()` to standard-conforming ones in mcfgthread, which eventually calls `TerminateProcess()` instead of `ExitProcess()`, to address such issues. Per-thread and process cleanup is performed by `__cxa_finalize()`, in accordance with the Itanium ABI.
+GCC uses `atexit()` to register destructors for static objects. Therefore,
+the CRT has to be modified to forward such calls to `__MCF_cxa_atexit()`,
+passing the address of the module-specific `__dso_handle` as their third
+arguments. The modified CRT also forwards calls to `exit()`, `_Exit()`,
+`_exit()` and `quick_exit()` to standard-conforming ones in mcfgthread, which
+eventually calls `TerminateProcess()` instead of `ExitProcess()`, to address
+such issues. Per-thread and process cleanup is performed by `__cxa_finalize()`,
+in accordance with the Itanium ABI.