This lesson will examine how to provide efficient synchronization among cores and threads that are working on the same program.
(N.B. The figure shown above erroneously uses R
rather than R1
in the relevant "downstream" instructions. This error is not subsequently repeated in the following text, which uses R1
exclusively/correctly accordingly.)
Consider an example (as in the figure shown above) to demonstrate why synchronization is necessary.
The system in question is characterized as follows:
- Two threads count the occurrences of different letters in a document
- Thread
A
counts the first half of the document - Thread
B
counts the second half of the document - Lastly, the counting of the threads is combined in the end
The corresponding program code is as follows:
(thread A
)
LW L, 0(R1) # load a letter into `L`
LW R1, Count[L] # load count for letter `L` into memory
ADD R1, R1, 1 # increment the count
SW R1, Count[L] # load the new count back into memory
(thread B
- same work as thread A
)
LW L, 0(R1) # load a letter into `L` -- N.B. `R1` is distinct from that of thread A
LW R1, Count[L] # load count for letter `L` into memory
ADD R1, R1, 1 # increment the count
SW R1, Count[L] # load the new count back into memory
As long as the letters processed are different between the threads, then this program will work normally.
However, if both threads encounter the same letter (e.g., 'A'
), then both threads attempt to load the same counter value (i.e., Count[L]
, having current value 15
).
On increment (i.e., ADD R1, R1, 1
), both threads update the value (i.e., 16
) and store it into memory (i.e., SW R1, Count[L]
). However, since there were two occurrences of the letter (i.e., 'A'
), this count is incorrect (i.e., 16
rather than 17
is stored in memory).
Therefore, to ensure correct program behavior, the count-incrementing operations must be performed sequentially across the two threads when handling the same data.
- For example, if thread
A
increments first, then cache coherence ensures that the value16
is read by threadB
. - Thread
B
then subsequently increments the value to17
.
N.B. The ordering of these two operations among the threads is not significant (i.e., the equivalent holds if Thread B
had written first instead).
These thread-coordinated operations are called atomic (or critical) sections. Synchronization is therefore necessary to perform such operations accordingly (i.e., additional code which ensures such thread-wise operation accordingly).
Continuing from the previous example (cf. Section 2), the type of synchronization used for atomic sections is called mutual exclusion (or lock). Such a lock is used to "flank" the atomic section in order to coordinate accordingly among the threads.
To perform this locking, an explicit mechanism of lock
and unlock
is used (as in the figure shown above).
The lock CountLock[L]
has a status of open/closed at any given time.
- When the lock
CountLock[L]
is open, then the atomic section can be entered. - Otherwise, when the lock
CountLock[L]
is closed, then spinning will occur by the thread, until the lock is once again opened.
On acquisition of the lock (and consequently closing the lock to other threads), the thread enters the atomic section and performs its operations. On exit of the atomic section, the lock becomes unlocked and subsequently available to another thread.
By having the lock present in this manner, this enforces mutual exclusion of the atomic-section code. However, this does not otherwise impose order among execution by the respective threads (it simply prevent simultaneous execution at any given time).
Consider the following code fragment denoting an atomic section:
lock(CountLock[L]);
count[L]++;
unlock(CountLock[L]);
How is CountLock[L]
described in this context? (Select the correct option.)
- Just another location in shared memory
CORRECT
- A location in a special synchronization memory
INCORRECT
- A special variable without a memory address
INCORRECT
Explanation:
A lock is essentially just another "variable" like any other, having a memory address, which in turn can be loaded, modified, etc.
This lesson will subsequently explore the nature of these lock variables and associated functions lock()
and unlock()
accordingly.
To further examine lock-based synchronization, consider the following function definitions:
typedef int mutex_type;
void lock_init(mutex_type &lock_var) {
lock_var = 0;
}
void lock(mutex_type &lock_var) {
while (lock_var == 1);
lock_var = 1;
}
void unlock(mutex_type &lock_var) {
lock_var = 0;
}
For simplicity, an integer is used here to represent "a location in (shared) memory" (cf. Section 4).
The function lock_init()
initializes the lock variable lock_var
to 0
(unlocked).
The function lock()
"spins" on value 1
(locked) via while
loop until the lock_var
is set to 0
. On exit of the while
loop, lock()
sets lock_var
to 1
(i.e., the lock is acquired, for subsequent entry into the critical section).
The function unlock()
sets the lock lock_var
to 0
on exit from the critical section, thereby opening/freeing the lock for subsequent use.
- Coherence ensures that the other thread(s) waiting on
lock_var
within functionlock()
at this point observe this update to valuelock_var
accordingly (i.e., for subsequent lock acquisition)
However, note that the function lock()
does not work in practice as implemented here.
- Suppose there are two threads (one purple, one green), which both initially encounter the
while
loop withlock_var
having value0
and subsequently both acquire the lock via setting oflock_var
to1
. - Now, both threads are simultaneously accessing the critical section.
Therefore, in order to correctly implement the function lock()
, both the checking and setting of the lock value lock_var
must be atomic operations (i.e., performed in its own critical section accordingly).
This gives rise to an apparent paradox: A critical section is needed in order to implement a critical-section-based lock.
Per the apparent "paradox" identified in the previous section (cf. Section 5), some sort of "magic lock" modification is necessary, as follows:
void lock(mutex_type &lock_var) {
Wait:
lock_magic();
if (lock_var == 0) {
lock_var = 1;
unlock_magic();
goto Exit;
}
unlock_magic();
goto Wait;
Exit:
}
Of course, there is no such "magic." Instead, the correspondingly available resolution measures for this issue are as follows:
- Lamport's bakery algorithm (or another comparable algorithm) which is able to use normal load/store instructions in this manner
- This approach is fairly expensive and complicated to implement (i.e., tens of instructions), however.
- Use special atomic read and write instructions
To implement atomic instructions for locks relatively easily (i.e., without otherwise resorting to complicated algorithms), which of the following properties is necessary? (Select the correct option.)
- A load instruction (read memory)
INCORRECT
- A store instruction (write memory)
INCORRECT
- An instruction that both reads and writes memory
CORRECT
- An instruction that does not access memory at all
INCORRECT
Explanation:
Recall (cf. Section 6) that an instruction of the following general form must be performed:
if (lock_var == 0) {
lock_var = 1;
}
This entails both a read (i.e., lock_var == 0
) and a write (i.e., lock_var = 1
).
Recall (cf. Section 7) that in order to implement atomic instructions, both read and write operations are necessary. There are three main types of such atomic instructions accordingly.
The first such atomic instruction is an atomic exchange instruction, which perform the following transformation:
(instruction)
EXCH R1, 78(R2)
(transformation)
R1 = 1;
while (R1 == 1)
EXCH R1, lock_var;
The instruction EXCH
resembles a load (LW
) or store (SW
) instruction, however, it essentially performs both simultaneously.
In the transformed version, the value stored in R1
is managed (i.e., exchanged) via lock_var
. In this manner, looping persists until R1
succeeds in obtaining the value 0
, which in turn atomically sets the variable lock_var
to 1
accordingly (i.e., thereby acquiring the lock and precluding other threads from doing so at this point as well).
A key drawback of this atomic exchange instruction is that it writes persistently to the memory location, even while the lock is busy (i.e., locked by another thread).
The second type of atomic instruction is a family of instructions which are generally classified as test-and-write.
- First, the location is tested, and then if it satisfies some conditions, then (and only then) writing occurs.
For example, consider a representative instruction TSTSW
(i.e., test-and-store word), as follows:
(instruction)
TSTSW R1, 78(R2)
(transformation)
if (Mem[78 + R2] == 0) {
Mem[78 + R2] = R1;
R1 = 1;
} else {
R1 = 0;
}
The idea here is to test whether the lock is free (i.e., via condition Mem[78 + R2] == 0
), and if so, then write a value 1
to the lock; otherwise, if not, then simply proceed as usual (i.e., without otherwise attempting to access the lock).
Consider a test-and-set atomic instruction of the general form TSET R1, Addr
, defined equivalently as follows:
if (Mem[Addr] == 0) {
Mem[Addr] = R1;
R1 = 1;
} else {
R1 = 0;
}
N.B. Addr
is a memory address (typically in the form of a register and its corresponding offset).
For the corresponding implementation for function lock()
:
lock(mutex_type &lock_var) {
R1 = 0;
while (R1 == 0) {
// TODO: Specify instruction here
}
}
How can this atomic instruction be used here accordingly?
Answer and Explanation:
lock(mutex_type &lock_var) {
R1 = 0;
while (R1 == 0) {
TSET R1, lock_var; // Answer
}
}
Here, TSET R1, lock_var
checks the status of lock_var
, and will only perform action if the value of lock_var
is currently 0
, otherwise it will simply "fall through" on read of value 1
.
Returning to the topic of atomic instructions (cf. Section 8), with respect to test-and-store atomic instruction, a representative instruction TSTSW
(i.e., test-and-store word) can be implemented as follows:
(instruction)
TSTSW R1, 78(R2)
(transformation)
do {
R1 = 1;
TSTSW R1, lock_var;
} while (R1 == 0)
N.B. This is an alternate implementation to that shown in Section 8.
Given instruction TESTSW
, as long as the lock is occupied, then as long as the store "sub-operation" is only reading the lock variable, then it is not otherwise writing to it.
- With respect to coherence (cf. Lesson 19), this is desirable, as this prevents superfluous broadcasts of write invalidations across the respective threads' copies of the data. Instead, the "waiting" threads simply "share" the lock variable, iterating on their respective cached copies of the data at this point.
- Subsequently on freeing of the lock, these shared copies are consequently invalidated via write to the lock variable using corresponding
unlock()
, at which point the update is broadcasted to the other threads. - Therefore, cross-thread communication with respect to the lock occurs only on acquisition.
This test-and-write approach solves the problem of continuously writing to the lock variable, however, it still involves the otherwise "strange" instruction TSTSW
(and similar), i.e., an instruction which is neither a "pure load" nor a "pure store."
- From a processor design standpoint, it would be more ideal to perform an operation that is more similar to correspondingly "pure" load and store operations, but providing the corresponding atomic functionality.
To address this particular issue, there is a third type of atomic instruction, comprising a pair called load linked / store conditional (LL/SC) (as discussed in the next section).
Atomic read and atomic write operations in the same instruction (even if the atomic write only occurs on detection of appropriate condition via atomic read) is very bad for pipelining insofar as processor design is concerned.
Consider a classical five-stage pipeline (as in the figure shown above).
- A load instruction is fetched (F), decoded/read (D/R), has its address computed (A), is accessed via this computed address from memory (M), and then finally written to the register (W).
- Given an atomic read/write operation, on reaching stage M, it cannot perform all necessary tasks in only one memory access operation (i.e., reading or writing are mutually exclusive in any given cycle at this point in the pipeline, without otherwise complicating the stage M). Therefore, just for this special case of an atomic read/write operation, the stage M would have to be expanded to a two- or three-stage (or more) memory stage (and correspondingly increasing the size of the pipeline accordingly) in order to perform appropriate checks immediately prior to writing to memory.
- N.B. Such an expanded pipeline would impact all other instructions, not just these particular atomic read/write operations!
Consequently, the paired atomic instructions load linked (LL) and store conditional (SC) resolve this issue accordingly by splitting these atomic operations into two distinct instructions.
- Load linked (LL) implements the "read" atomic operation
- Load linked (LL) behaves as a "normal load" instruction (i.e.,
LW
), simply reading from a memory location and placing the corresponding value into the target register - However, load linked (LL) additionally saves the address from which it loaded into a special link register
- Load linked (LL) behaves as a "normal load" instruction (i.e.,
- Store conditional (SC) implements the "write" atomic operation
- Store conditional (SC) first checks if the address that it computes is present in the link register
- If the address is present in the link register, then store conditional (SC) performs a "normal store" instruction (i.e.,
SW
) to the memory location in question, and consequently returns the value1
in its register - Conversely, if the address is not present in the link register, then store conditional (SC) simply returns the value
0
in its register (without otherwise storing a value)
- If the address is present in the link register, then store conditional (SC) performs a "normal store" instruction (i.e.,
- Store conditional (SC) first checks if the address that it computes is present in the link register
Collectively, these instruction pairs effectively form a "single/composite" atomic operation via the link register.
So, then, how exactly do load linked (LL) and store conditional (SC) behave in an atomic manner, such that intermediate writes from other processors/cores do not otherwise interfere with this paired operation?
To achieve this atomicity, the instruction pairs are used as follows:
LL R1, lock_var
SC R2, lock_var
If load linked (LL) detects the variable lock_var
in the appropriate state (i.e., 1
), then the subsequent store conditional (SC) should only succeed if the lock is still available at the point of execution, otherwise, if the lock has already been written to, then store conditional (SC) should fail.
In order to ensure this desired behavior, the key to accomplish this is to snoop the writes to lock_var
and to set the link register to value 0
if the lock is already acquired accordingly. Consequently, matching of the addresses by store conditional (i.e., R2
and lock_var
) will fail.
- Therefore, there is a reliance on cache coherence (cf. Lesson 19) to accomplish synchronization in this manner.
Note that both load linked (LL) and store conditional (SC) are both individually atomic operations. Therefore, it is not necessary to use a "sub-lock" to further ensure atomicity with respect to either of these instructions.
Correspondingly, observe the following transformation:
(instruction)
lock(...);
var++;
unlock(...);
(transformation)
Try:
LL R1, var;
R1++;
SC R1, var;
if (R1 == 0)
goto Try;
If multiple threads perform these operations, they will simply "compete" for R1
in this manner, with the first to reach R1
"succeeding," and the other threads consequently "failing" the condition R1 == 0
.
- N.B. This specific example increments
var
atomically, however, the same general principle/construct applies for other critical-section instructions performed in this manner as well.
Therefore, load linked / store conditional give rise to relatively simple/direct implementation of a critical section with corresponding atomic operations accordingly, otherwise obviating the need for locks in the transformed implementation (but rather using these atomic instructions load linked and stored conditional directly on the variable in question itself [i.e., var
]).
- N.B. For more complicated critical sections, such as those accessing multiple variables simultaneously, this does not work as well, however.
Given the following function lock()
:
void lock(mutex_type &lock_var) {
try_lock:
MOV R1, 1
LL R2, lock_var
SC R1, lock_var
// TODO: Select instructions
}
The intended behavior of function lock()
is such that on exit of the function, the lock (i.e., lock_var
) has been acquired exclusively (i.e., setting R1
to value 1
accordingly).
How can load linked (LL) / store conditional (SC) be used to implement this function as follows (where LINK
is the link register, as applicable)? (Select the applicable set of instructions.)
(Instructions 1)
BEQZ R1, try_lock
INCORRECT
(Instructions 2)
BEQZ R2, try_lock
INCORRECT
(Instructions 3)
BNEZ R2, try_lock
BEQZ LINK, try_lock
INCORRECT
(Instructions 4)
BNEZ R2, try_lock
BEQZ R1, try_lock
CORRECT
(Instructions 5)
BNEZ R2, try_lock
BNEZ LINK, try_lock
INCORRECT
Explanation:
All of these prospective instructions sets repeatedly attempt to acquire the lock lock_var
via try_lock
via appropriately provided conditions.
In order to select the correct conditions to accomplish this, two conditions must be checked, as follows:
- 1 - Was a free lock observed with respect to
LL R2, lock_var
? If so, then rather than jumping back totry_lock
, instead proceed out of the functionlock()
accordingly. - 2 - Otherwise, if a free lock is not observed, then jump back to
try_lock
.
Instruction BNEZ R2, try_lock
satisfies both conditions.
- A non-zero value in
R2
indicates that the lock is busy, necessitating going back totry_lock
.
Additionally, it must be determined whether R1
has been stored successfully, prior to another thread already accomplishing this. R1
is stored successfully in this manner if R1
contains the value 1
subsequently to (given) instruction SC R1, lock_var
.
Instruction BEQZ R1, try_lock
satisfies this determination accordingly.
- If
R1
fails to store0
via upstream instructionSC
, then the function must jump back totry_lock
accordingly.
Therefore, the correct set of instructions is Instructions 4. In summary, in the implementation of lock()
using this pair of instructions:
- If the lock is observed as free (via
BNEZ R2, try_lock
withR2
having value0
) and managing to store it at this point (viaBEQZ R1, try_lock
withR1
having value non-zero), then the lock is acquired and the functionlock()
consequently exits accordingly. - Otherwise, if the lock is observed as busy/acquired (via
BNEZ R2, try_lock
withR2
having value non-zero), then the functionlock()
jumps back totry_lock
for subsequent attempt to acquire the lock. - Otherwise, if the lock is observed as free (via
BNEZ R2, try_lock
withR2
having value0
) but having been stored by another thread already at this point (viaBEQZ R1, try_lock
withR1
having value0
), then the functionlock()
jumps back totry_lock
for subsequent attempt to acquire the lock.
N.B. Instructions 3 and 5 check the link register (LINK
) directly, however, note that this is a "hidden/abstracted" register, which generally is not "checked directly" in this manner. The corresponding access is performed implicitly by the atomic operations linked load (LL) / store conditional (SC) accordingly.
Consider now how different lock implementations interact with the coherence protocols (cf. Lesson 19), and what the corresponding implications are with respect to performance.
Consider a system of three cores numbered 0
through 2
, as in the figure shown above.
The following sequence is performed involving atomic operation EXCH
:
Sequence | Cache 0 |
Cache 1 |
Cache 2 |
---|---|---|---|
S1 |
EXCH R1, lock_var |
(N/A) | (N/A) |
S2 |
(N/A) | EXCH R1, lock_var |
(N/A) |
S3 |
(N/A) | (N/A) | EXCH R1, lock_var |
⋮ | ⋮ | ⋮ | ⋮ |
SN |
lock_var = 0 |
(N/A) | (N/A) |
In sequence S1
:
lock_var
transitions to the modified (M) state with respect to cache0
lock_var
is absent (i.e., invalid) in the other two cores
Eventually, cache 0
(which acquires the lock in sequence S1
) will unlock by setting lock_var = 0
(i.e., in downstream sequence SN
). However, several events transpire in the intervening sequences prior to this.
In sequence S2
:
- Since
R1
has value1
immediately prior to this (via cache0
's prior acquisition in sequenceS1
), on attempted acquisition by cache1
, there is a corresponding transfer of the lock to cache1
. - Cache
0
writes tolock_var
and broadcasts this update, and consequently the modified (M) state is transferred from cache0
to cache1
accordingly.
Similarly, in sequence S3
:
- Since
R1
has value1
immediately prior to this (via cache1
's prior acquisition in sequenceS2
), on attempted acquisition by cache2
, there is a corresponding transfer of the lock to cache2
. - Cache
1
writes tolock_var
and broadcasts this update, and consequently the modified (M) state is transferred from cache1
to cache2
accordingly.
Subsequently to sequence S3
, as long as the lock is busy in any given sequence, the other cores will spin accordingly. Consequently, the cache block will move among cores many times before the lock is finally freed/unlocked (i.e., via lock_var = 0
in downstream sequence SN
). This additionally incurs communication on the shared bus, with corresponding power consumption accordingly, since none of this activity succeeds in reacquiring the lock (i.e., until lock_var
resets to 0
).
If at some later point cache 1
is the core which acquires the lock most recently (i.e., immediately following sequence SN
, during which cache 0
is in the modified [M] state), the corresponding write will consequently transfer the modified (M) state back to cache 1
accordingly. Downstream of this, similar superfluous spinning and power consumption will commence as before.
Therefore, this is a very heavily power-consumptive process, as each such movement of the cache block requires a lot of energy per transfer operation. Furthermore, the interconnection among the cores (e.g., shared bus) experiences a high volume of traffic, adding corresponding "throttling" to the system accordingly (e.g., the one "active" core at any given time, if experiencing cache misses, will also handle these more slowly as a result, thereby effectively reducing the useful work performed by the system as a whole accordingly).
Recall (cf. Section 14) that performing atomic read and atomic write operations on the lock repeatedly (even while waiting for the lock) is inefficient with respect to power consumption, and correspondingly can even have a deleterious effect on the operation of the thread currently performing useful work (i.e., inside of the critical section).
In order to reduce this inefficiency, a test-and-atomic-op lock approach can be used accordingly. A representative example of this can be implemented as follows:
(original version - repeated atomic op)
R1 = 1;
while (R1 == 1)
EXCH R1, lock_var;
(improved version - test-and-atomic-op approach)
R1 = 1;
while (R1 == 1) {
while (lock_var == 1); // use "normal" load operations
EXCH R1, lock_var;
}
In the original version, "bare" EXCH R1, lock_var
yields persistent invalidations (and corresponding cache misses) while waiting on the lock to be freed.
Conversely, in the improved version, "normal" load operations are used while the lock (i.e., lock_var
) is busy. Once it is observed that the lock is free (i.e., lock_var == 0
), then (and only then) there is a subsequent attempt to acquire the lock using EXCH
(but not via "normal" store operation, as an atomic operation is necessary to acquire the exclusively lock at this point).
With this corresponding update, the following sequence is performed involving atomic operation EXCH
(cf. similar sequence from previously in Section 14):
Sequence | Cache 0 |
Cache 1 |
Cache 2 |
---|---|---|---|
S1 |
EXCH R1, lock_var |
(N/A) | (N/A) |
S2 |
(N/A) | EXCH R1, lock_var |
(N/A) |
S3 |
(N/A) | (N/A) | EXCH R1, lock_var |
⋮ | ⋮ | ⋮ | ⋮ |
In sequence S1
, cache 0
acquires the lock in the modified (M) state (the other two caches are in the invalid [I] state at this point).
In sequence S2
, cache 1
attempts to acquire the lock and transfers the cache block to itself via the shared (S) state, while cache 0
transitions to the owned (O) state with respect to this cache block. While cache 1
is in the shared (S) state, subsequent loads will yield cache hits, rather than cache misses, thereby obviating the need to access the bus.
Similarly, in sequence S3
, cache 1
attempts to acquire the lock and transfers the cache block to itself via the shared (S) state. While cache 2
is in the shared (S) state, subsequent loads will similarly yield cache hits, rather than cache misses, thereby obviating the need to access the bus.
Therefore, while cache 0
owns the lock, the other two caches do not correspondingly generate superfluous bus traffic during this time; effectively, cache 0
has exclusive access over the coherence bus during this time, thereby allowing it to service cache misses much more quickly and efficiently. Furthermore, cache hits on cache 0
are also correspondingly much more efficient compared to coherence bus accesses accordingly.
The system proceeds in this manner until cache 0
ultimately releases the lock, at which point there is a consequent invalidation and subsequent read by the other two cores in the system (which now compete for the lock). However, now, these "activity bursts" are confined strictly to these (relatively brief) "lock exchange" periods only. Otherwise, during periods when the lock is "busy," the other caches simply persistently use "normal" loads in the meantime, which is correspondingly much more energy-efficient.
Recall (cf. Section 13) the following implementation for function lock()
:
void lock(mutex_type &lock_var) {
try_lock:
MOV R1, 1
LL R2, lock_var
SC R1, lock_var // Instruction A
BNEZ R2, try_lock // Instruction B
BEQZ R1, try_lock // Instruction C
}
For the corresponding test-and-atomic-op operations, reorder instructions A
, B
, and C
appropriately.
Answer and Explanation:
void lock(mutex_type &lock_var) {
try_lock:
MOV R1, 1
LL R2, lock_var
BNEZ R2, try_lock // Instruction B′
SC R1, lock_var // Instruction A′
BEQZ R1, try_lock // Instruction C
}
The idea with the test-and-atomic-op version is to simply refrain from writing until the lock is observed to be free. Therefore, prior to attempting to store (Instruction A), first the corresponding check of the lock is performed accordingly (Instruction B).
- On loading the value into
R2
(i.e.,LL R2, lock_var
), if the value is1
, then the lock is not free and correspondingly the function jumps back totry_lock
. This pattern continues as long as the lock busy. Correspondingly, the operation load linked (LL) simply performs a "normal" load if the corresponding link register is not populated (and correspondingly there is no further "noisy traffic" among the caches). - Subsequently, when
R2
eventually loads value0
(i.e., the lock becomes free), then (and only then) a subsequent store conditional (SC) operation is performed (i.e.,SC R1, lock_var
). Additionally, the subsequent check is performed to ensure that the lock has actually been acquired (i.e.,BEQZ R1, try_lock
).
As a follow-up, now consider the corresponding test-and-atomic-op implementation for the function unlock()
, as follows:
void unlock(mutex_type &lock_var) {
// TODO: Select appropriate instruction
}
Which of the following is the appropriate choice for this implementation? (Select the correct option.)
SW 0, lock_var
CORRECT
LL
, check if1
, thenSC
INCORRECT
- Additional atomic instructions are required
INCORRECT
Explanation:
void unlock(mutex_type &lock_var) {
SW 0, lock_var;
}
Unlike the function lock()
(cf. Section 16), which required two checks (whether the lock is available, and then whether it can be acquired), the function unlock()
is only performed exclusively by the thread possessing the lock at any given time. Therefore, to free the lock, lock_var
is simply set to 0
accordingly, without otherwise requiring any additional atomic operations to accomplish this.
Another type of synchronization that is often needed in programs is called barrier synchronization. This often occurs when there is a parallel section wherein several threads perform independent actions simultaneously (as in the figure shown above), e.g., each thread adds the numbers in a designated sub-array section of a shared array.
Eventually, this separate work must be coordinated (e.g., determining the total sum for the entire array, across all of the threads). In order to do this, it is necessary to ensure that each thread has concluded performing its own individual work; this is provided by the barrier accordingly. Only on reaching the barrier in this manner can the final result be used (e.g., print(sum);
).
Another possibility is that on computation, all of the threads must access the sum simultaneously (as in the figure shown above). For this purpose, the barrier also serves as a corresponding "checkpoint" in this manner, serving as a global wait which ensures that all threads have first entered the barrier prior to commencing with access of this "global" value. The barrier synchronization ensures that all threads have arrived first, prior to commencing with departure from the barrier.
A barrier implementation typically consists of two variables, as follows:
- counter → counts how many threads have arrived
- Each arriving thread increments this value accordingly on arrival to the barrier
- flag → sets when the counter finally reaches the total threads count (i.e., condition
counter == N
)- Each arriving thread checks the flag on arrival, and sets accordingly (i.e., set the flag to "all have arrived," or otherwise spin while pending arrival of the remaining threads)
However, it is not quite so straightforward to implement this in practice; the remainder of this section will discuss matter accordingly.
Consider the following relatively simple implementation of a barrier in a program:
lock(counter_lock); // begin critical section
if (count == 0) release = 0; // re(initialize) release flag to `0` (acquired)
count++; // count the arrivals
unlock(counter_lock); // end critical section
if (count == total) {
count = 0; // (re)initialize the counter to `0`
release = 1; // (re)set release flag to `1` (released)
} else {
spin(release == 1); // wait for `release` to attain value `1` (released)
}
The counter count
is a shared variable, which is contended over by the threads (for subsequent incrementing).
On exit of the critical section, this implies that the thread in question is either the last thread, or that the last thread has arrived.
- N.B. These two conditions are distinct. For example, in general, the currently incrementing thread might not necessarily be the last-arriving. However, by the point of checking the total immediately following exit of the critical section, the last thread has arrived.
Subsequently:
- If
count == total
at this point, then the countercounter
is re-initialized (i.e., reset tocount = 0
) for subsequent thread-wise access to the critical section. Furthermore, the release flagrelease
is set to1
, thereby informing the waiting threads that the barrier can now be released accordingly. - Otherwise, the thread waits via
spin()
, pending release by the last thread.
Furthermore, on entry into the critical section, the first-arriving thread resets the release flag release
to 0
(i.e., acquired) to notify the other threads accordingly.
This simple implementation is not entirely correct, however. Consider the scenario of two threads synchronizing on this barrier (as in the figure shown above).
- In this case,
total == 2
for the total number of threads in the system. - On first pass, the barrier works correctly, as intended.
- However, on subsequent arrivals to the barrier, a deadlock condition can result (as discussed in the next section).
Recall (cf. Section 19) the simple barrier implementation as follows:
lock(counter_lock); // begin critical section
if (count == 0) release = 0; // re(initialize) release flag to `0` (acquired)
count++; // count the arrivals
unlock(counter_lock); // end critical section
if (count == total) {
count = 0; // (re)initialize the counter to `0`
release = 1; // (re)set release flag to `1` (released)
} else {
spin(release == 1); // wait for `release` to attain value `1` (released)
}
Furthermore, consider two cores (numbered 0
/blue and 1
/magenta) running the same program simultaneously (as in the figure shown above).
On arrival at the critical section, assume that core 0
arrives first for the sake of example. Thread 0
(corresponding to core 0
) passes through the critical section, and eventually waits at spin()
.
- The analogous operation is
LW release
(yielding0
) at this point. It continues to spin in this manner, checking the status ofrelease
.
Subsequently, on arrival at the critical section, thread 1
(corresponding to core 1
), ideally it should pass through the critical section (setting count
to 2
in the process), and then subsequently set count
to 0
and release
to 1
accordingly.
- The analogous operation is
SW release
(set to1
) at this point, which in turn causes thread1
to exit fromspin()
(thereby exiting the barrier accordingly).
However, consider if while thread 0
is checking the status of release
, when the subsequent release by thread 1
occurs, there is a delay in this checking by thread 0
(e.g., an intermediate interrupt is suspended at this point, precluding a "sufficiently fast" exit from spin()
).
- In this case, rather than exiting
spin()
(via corresponding detection of value1
viaLW release
), instead thread1
proceeds to re-enter the critical section (detectingcount == 0
at this point from its own preceding release) and subsequently resettingrelease
to0
(i.e., not released). - Consequently, on subsequent check by thread
0
inspin()
/waiting, it no longer sees a "released" system state.
Now, both threads proceed similarly as before with their respective operations, however, this time, since the counter count
is "inaccurate," both threads will eventually end up in the spin()
/waiting state. This condition is called a deadlock.
To better understand the problem with the simple barrier implementation, consider the same code from previously (cf. Section 20), as follows:
lock(counter_lock); // begin critical section
if (count == 0) release = 0; // re(initialize) release flag to `0` (acquired)
count++; // count the arrivals
unlock(counter_lock); // end critical section
if (count == total) {
count = 0; // (re)initialize the counter to `0`
release = 1; // (re)set release flag to `1` (released)
} else {
spin(release == 1); // wait for `release` to attain value `1` (released)
}
Now, assume that on exit from the barrier by both threads (i.e., 0
and 1
), only thread 0
proceeds to do subsequent work, while thread 1
simply "terminates."
In this case, is the simple barrier implementation effective/correct?
YES
Explanation:
If the barrier is only used once (i.e., via corresponding critical-section code and release
mechanism), then eventually when they both perform the "first passthrough" of the critical section, the system will reach a state whereby thread 0
waits on spin()
, and then thread 1
resets count to 0
and release
to 1
. Eventually, due to coherence, the wait will resolve, and thread 1
simply exits the barrier (and subsequently terminates), while thread 0
continues onto subsequent work.
- Alternatively, neither may end up waiting via
spin()
but rather simply both release instead, however, the net effect is the same (i.e., thread1
terminates, and thread0
proceeds onto subsequent work).
Therefore, the incorrectness of the simple barrier implementation does not stem from this "first passthrough" scenario, but rather that this implementation gives rise to a barrier-based release mechanism which is not reusable (i.e., on subsequent entries by threads into the barrier).
So, then, how can such a reusable barrier be implemented? This can be done as follows:
local_sense != local_sense;
lock(counter_lock);
count++;
if (count == total) {
count = 0;
release = local_sense;
}
unlock(counter_lock);
spin(release == local_sense);
The idea here is that the value for releasing the barrier (i.e., release
) in general is not the same for all instances of the barrier itself (i.e., even in those instances for which release
becomes 0
, all other instances release on release
reaching 1
).
Therefore, it is not actually necessary to ever reinitialize release
to 0
, but rather it is simply "flipped" at any given time. Correspondingly, each thread has local_sense
, which at any given time indicates "which" release to detect in order to exit from waiting.
For example, consider two threads (numbered 0
and 1
), as in the figure shown above. Assume that local_sense
is initialized to 0
(respectively in each thread, with each owning this independently as its own local variable), and then subsequently detects a value of 1
immediately prior to entry of either thread into the critical section.
As the threads pass through the critical section, one thread will eventually detect the condition count == total
and subsequently set release = local_sense
(i.e., 1
in this case), which is the pending value for the particular barrier in question. At this point, the other thread has exited the critical section and is waiting (i.e., via spin(release == local_sense)
). Here, the thread waits on the release condition (i.e., local_sense
having value 1
).
Consider the situation where the setting of release
to 1
is not detected by the waiting thread (i.e., due to "delayed" detection prior to exiting spin()
).
- In this case, the other thread proceeds onto the next iteration and re-enters the critical section. However, on this re-entry, it will invert its own local version of
local_sense
back to0
and set this accordingly. - Now, it will fall through the critical section and itself reach the waiting condition (i.e.,
spin(release == local_sense)
), however, having setrelease
to0
, the other thread now detects this as the "release" condition (i.e., per its ownlocal_sense
having value0
from previously), which in turn releases this thread for subsequent work (i.e., at this point, they are waiting on a "disparate" values ofrelease
).- N.B. At this point, "globally"
release
still has a value of1
, which is how it was set most recently.
- N.B. At this point, "globally"
Generalizing this manner, the released thread will subsequently "invert" the value again, thereby releasing the currently waiting thread, and so on. Therefore, this inversion allows to reuse the barrier, without otherwise risking a deadlock condition occurring in the program during execution.
This lesson has demonstrated that synchronization operations (e.g., locks and barriers) are necessary in order to coordinate the activities among multiple threads. Furthermore, special atomic operations are necessary in order to implement locks efficiently.
The next lesson will examine how synchronization can be perturbed when a processor is capable of reordering load and store operations, and consequently how to guard against this accordingly.