We have seen (cf. Lesson 8) that an out-of-order processor will track when one instruction uses a register value produced by another.
However, load and store instructions can access the same memory location without using the same register to perform the access; this lesson explores how exactly this can be done correctly in an out-of-order processor.
We have seen (cf. Lessons 7 and 8) that reorder buffer (ROB) and Tomasulo's algorithm can be used to enforce the order of dependencies on registers between instructions.
However, a question still remains: What about memory access ordering?
- There are load and store instructions available; do they really need to be performed strictly in program order, or can these be reordered, too?
So far, we have eliminated control dependencies using branch prediction.
Furthermore, we have eliminated false dependencies (i.e., dependencies via registers) using register renaming.
Next, we will see how to obey read-after-write (RAW) register dependencies using Tomasulo-like scheduling.
- Here, we have instructions waiting in reservation stations until their dependencies have been satisfied, at which point they can proceed to execution out of program order.
Note that the data-dependency handling performed thus far (i.e., false dependencies and RAW register dependencies) only pertain to register dependencies, in which one instruction produces a value in a register used by another register. But what about memory dependencies?
Consider the following program:
SW R1, 0(R3)
LW R2, 0(R4)
Here, there may be a dependency between the memory value written by SW
(store) and that read by LW
(load).
- For example, if registers
R3
andR4
hold the same address, thenLW
must get the value fromSW
, in which case these instructions must be performed in order. - Conversely, if
R3
andR4
hold different addresses, then (as previously seen with registers) it is not strictly necessary to maintain program-order between these memory accesses.
Nevertheless, we have not yet seen how to handle out-of-order memory access; correspondingly, this topic will be the focal point of this lesson.
The first point we must consider is: For store instructions, how does the memory write operation actually happen? This occurs at commit.
- It is necessary for write to memory to occur at commit, because it is otherwise unsafe to update memory at any point before the instruction commits.
- Any instruction that has not yet committed is subject to being cancelled (e.g., due to branch misprediction or occurrence of an exception).
- Therefore, performing a memory write prematurely may later result in a necessary memory "unaccess" (i.e., restore the old memory value), which is extremely difficult to perform. To avoid this, memory writes and stores are delayed until commit.
Does this mean that memory load also must wait until commit before retrieving the data (i.e., at which point only then is the data present in memory)? In other words, if we write/store instructions at commit, then where does data get loaded from? (Furthermore, it is desirable for loads to occur as early as possible, in order to complete these loads in order to supply the data to subsequent instructions depending on the data.)
- For this purpose, we introduce the structure called the load-store queue, which maintains all of the loads and stores.
Next, we will see how loads get data from the stores in this manner.
Because the loads in our processor must be performed as soon as possible (i.e., for immediate availability to downstream instructions which are dependent on the data), while stores do not write to memory until commit, the load-store queue (LSQ) is required in order to supply the values from stores to loads.
This load-store queue (LSQ) (shown in the figure above) is just like the reorder buffer (ROB), meaning that it has entries placed in order, which are subsequently removed at commit. However, in the load-store queue (LSQ), we only place load and store instructions there.
- The bit
L/S
indicates whether the instruction is a load (L
) or store (S
). - The field
Addr
specifies which address the instruction is accessing. - The field
Val
indicates the value that the instruction stores or loads. - The field
C
indicates whether or not the instruction has been completed.
Consider the following program:
LW R1, 0(R1) # I1
SW R2, 0(R3) # I2
LW R4, 0(R4) # I3
SW R5, 0(R0) # I4
LW R5, 0(R8) # I5
Initially, the load-store queue (LSQ) is empty, as shown above. It is subsequently populated in program-order.
The first instruction (I1
) accesses the computed memory address 104
. Because there are no previous store instructions, this instruction fetches directly from memory (MEM
).
The second instruction (I2
) stores the value 15
at computed address of 204
. The instruction is marked in field C
; essentially, the instruction is effectively "delayed," and therefore it remains in the load-store queue (LSQ).
The third instruction (I3
) accesses the computed memory address 204
. In general, for every load instruction, the load-store queue (LSQ) is checked to determine if any (upstream) instruction matches the computed address.
- If there is no matching store instruction for that address (e.g., as seen previously for instruction
I1
), then the load proceeds to access memory (MEM
) directly. - Conversely, if there is a matching store instruction for that address, memory (
MEM
) is not directly access, but rather store-to-load forwarding is performed (i.e., as in this case for instructionI3
), whereby the corresponding value from the previous store instruction (i.e.,15
in the case of instructionI2
forwarded toI3
) is also produced by the current load instruction in question, without a corresponding memory (MEM
) access.
N.B. Store-to-load forwarding assumes that at the time of the load instruction, the upstream store instruction(s) has/have already been completed, and therefore the computed addresses are known/available at the point of the load instruction.
Proceeding similarly to the subsequent instructions (i.e., I4
and I5
), observe that it is entirely possible for the store instruction (I4
) to not have a computed address available at the point at which the load instruction (I5
) produces its own computed address (e.g., 174
).
The possible resolution options in such a scenario are as follows:
- 1 - in-order execution of load and store (i.e., the load instruction is not allowed to execute until all previous instructions have been completed).
- This guarantees that at the point of execution of the load instruction, all of the store addresses and values are available.
- However, in reality, loads cannot proceed without resolved upstream stores. For example, if the load instruction in
I1
were a cache miss (requiring a corresponding resolution to proceed), then all of the downstream instructions would have to wait for this resolution to occur.
- 2 - Such waiting can be reduced by not simply waiting for all upstream instructions to resolve, but rather simply waiting for only all previous stored addresses to become known/computed.
- Once all stored addresses are known, the load address can be compared, and the instruction can be consequently determined (i.e., either memory [
MEM
] is accessed if nothing matches, the value is retrieved from the corresponding store if it is available, or we wait for the store to produce the value if the address matches but the store has not yet computed the value).
- Once all stored addresses are known, the load address can be compared, and the instruction can be consequently determined (i.e., either memory [
- 3 - Alternatively, the most aggressive option is to simply proceed with the load instruction anyways.
Following the third option, once the address is computed for the load instruction (e.g., 174
), the computed addresses of the previous store instructions are checked as follows:
- If one if the store instructions matches, then we wait for it (i.e., clearly, fetching from memory [
MEM
] is inappropriate in this case). - Otherwise, if none of the store instructions match, then we ignore the fact that some of them might match once their addresses are resolved; in this case, the load instruction will fetch from memory (
MEM
) (denoted by red arrow in the figure shown above). At this point, one of two things can occur:- 1 - The store instruction resolves to an address (e.g.,
74
, denoted by green font in the figure shown above), in which case we confirm that the load instruction was correspondingly completed. - 2 - The store instruction computes the same address (i.e.,
174
, denoted by magenta font in the figure shown above), in which case the load instruction has loaded the incorrect value from memory; in this case, we must recover accordingly (i.e., the load instruction loaded the incorrect value, thereby possibly inadvertently supplying it to other downstream instructions).- In this case, when the store instructions produce an address, we then check whether any of the downstream load instructions match the address and have subsequently completed execution; if so, then the load instruction is requested to be "redone" (and correspondingly for the downstream instructions), thereby resolving the issue of the load instruction loading the incorrect value.
- 1 - The store instruction resolves to an address (e.g.,
Most modern processors today use the third option ("go anyway") because it yields the best performance. In practice, if we permit the load instructions to proceed without otherwise waiting for the store instructions, it turns out that most of the time this results in the correct instructions being executed, achieving a net speedup in performance (i.e., less overall "downtime" due to waiting on store instructions, despite the occasionally incurred cost of recovery in the case of an invalid assumption regarding the validity of the upstream store instruction's address).
- Furthermore, there are also entire schemes around attempting to predict when such an incorrect assumption will occur, thereby proceeding with load instructions only if this occurrence is unlikely.
Now, consider what occurs with out-of-order execution of load and store instructions, following the "aggressive" approach of load instructions fetching from memory as soon as their computed address is known, if there is no preceding store instruction that has also computed the same address at the point of execution (which later turns out to resolve to the same address after all).
Consider the following example program, a sequence of instructions in the load-store queue (LSQ):
LOAD R3 = 0(R6) # I1
ADD R7 = R3 + R9 # I2
STORE R4 → 0(R7) # I3
SUB R1 = R1 - R2 # I4
LOAD R8 = 0(R1) # I5
In an out-of-order processor, when attempting to execute these instructions, the following might occur. (Assume for purposes of discussion that all of these instructions have already been fetched, decoded, etc.)
For instruction I1
(as in the figure shown above), because the LOAD
only depends on R6
(which is not otherwise produced by any other instructions), it can dispatch. Therefore, it fetches from memory, and eventually returns the corresponding value for R3
.
However, in this scenario, there is a cache miss. Consequently, the LOAD
instruction will be delayed, pending reconciliation of the correct memory value, and therefore instruction I1
cannot be dispatched until this occurs, thereby impacting the execution of downstream instructions I2
and I3
(via corresponding dependency on R3
).
Meanwhile, instruction I4
can be dispatched (as in the figure shown above), thereby producing R1
very quickly.
At this point, I5
can also subsequently dispatch (as in the figure shown above), which in this scenario results in a cache hit. Consequently, the LOAD
instruction is able to produce R8
very quickly, thereby supplying it to downstream instructions accordingly.
At this point in the program (as in the figure shown above):
- Instructions
I4
andI5
have completed, and are pending commits - Meanwhile, upstream instructions
I1
,I2
, andI3
are still pending execution, waiting on the instructionLOAD
(I1
) to complete
Eventually, the instruction LOAD
(I1
), completes (as in the figure shown above). The resulting value R3
is subsequently fed into instruction I2
(which in turn completes execution promptly thereafter), and similarly R7
is fed from instruction I2
to I3
for its subsequent execution.
Assuming the resulting address from instruction I5
was X
, then let's also assume in this scenario that the resulting address from instruction X
is not X
(i.e., the address addressed by R7
in instruction I3
is different from the address addressed by R1
in downstream instruction I5
). In this case, there will be no issue/conflict, since the instruction STORE
in I3
stores a different/unrelated address.
Conversely, consider the situation in which instruction I3
does compute the value X
for R7
(as in the figure shown above). In this case, the value stored in R4
in instruction I3
is the value which downstream instruction I5
should actually be using (i.e., in operand R1
), however, I5
has already executed by this point, using a stale value loaded from memory.
We will next discuss the resolution measures for this scenario.
The first solution to the problem encountered in the previous section is to simply avoid the strategy of out-of-order load-store execution altogether, and instead perform in-order load-store execution.
In the case of in-order load-store execution (as in the figure shown above):
- The
LOAD
in instructionI1
is a cache miss, resulting in a stall of the subsequent two instructions (ADD
/I2
andSTORE
/I3
, respectively) pending completion of instructionI1
. - The
SUB
in instructionI4
completes, providing the addressR1
for subsequentLOAD
in instructionI5
. At this point, the load-store queue checks to determine whether any preceding instructions might resolve to the same address; or, even worse, whether all of the preceding instructions have even completed in the first place (i.e., if this is not the case, then the downstreamLOAD
will be stalled anyhow, pending completion of the upstream instructions). Consequently, theLOAD
in instructionI5
does not fetch from memory, despite having a resolved address available (i.e.,R1
).
Eventually, the cache miss in instruction I1
is resolved, and the instruction LOAD
is able to proceed (as in the figure shown above). Consequently, instruction I2
is able to compute R7
promptly thereafter, and similarly instruction I3
can now perform its corresponding instruction STORE
. Furthermore, as instruction I1
completes, instruction I4
completes subsequently thereafter.
Note that the STORE
in instruction I3
is not considered "completed" when it commits, but rather when its address operand R7
becomes known, as well as computing the target address (i.e., R4
); only at this point can the LOAD
in instruction I5
proceed, by checking the preceding instructions to determine whether any STORE
s produced the necessary operand values (i.e., R1
). Therefore, instruction I5
can only execute after I3
, which in turn only completes execution after the LOAD
in instruction I1
is completed.
Effectively, most of the instructions are executed out-of-order, however, load and store instructions still proceed in-order. Of course, it is readily apparent that this is suboptimal if different/distinct addresses are involved among the load and store instructions, as this introduces otherwise unnecessary execution delays for downstream instructions not otherwise dependent on the upstream instructions' resolved addresses. This problem is further exacerbated if in addition to this delay, the downstream load instructions (e.g., I5
) result in cache misses, thereby introducing additional execution delays.
Therefore, this strictly in-order load-store execution is not a very high-performance resolution strategy, however, it does ensure program correctness.
Consider the following program:
LW R1, 0(R2) # I1
SW R1, 4(R2) # I2
LW R1, 0(R3) # I3
SW R1, 4(R3) # I4
LW R1, 0(R4) # I5
SW R1, 4(R4) # I6
Assume that all of the load instructions (LW
) are cache misses, resulting in a 40
-cycle delay per cache miss.
The execution proceeds as follows:
Instruction | Cycle of request to memory (MEM ) |
Cycle of response from memory (MEM ) |
Cycle of store execution |
---|---|---|---|
I1 |
C1 |
C41 |
(N/A) |
I2 |
(N/A) | (N/A) | C42 |
I3 |
C2 |
C42 |
(N/A) |
I4 |
(N/A) | (N/A) | C43 |
I5 |
C3 |
C43 |
(N/A) |
I6 |
(N/A) | (N/A) | C44 |
Therefore, the overall execution requires 44
cycles in total. However, observe that load instructions (LW
) are performed "prematurely" with respect to store instructions (SW
), e.g., in the case of I3
, since R3 + 0 == R2 + 4
via R2
in upstream instruction I2
, then in actuality instruction I3
must wait on instruction I2
to obtain this correct memory value first (i.e., R2
) before proceeding with I3
's own execution; and so on with respect to the remaining LW
-SW
instructions pairs.
Therefore, modify this sequence accordingly for in-order execution to ensure program correctness (i.e., in which cycle does each instruction send a request to memory, and when is the corresponding result returned).
Answer and Explanation:
Instruction | Cycle of request to memory (MEM ) |
Cycle of response from memory (MEM ) |
Cycle of store execution |
---|---|---|---|
I1 |
C1 |
C41 |
(N/A) |
I2 |
(N/A) | (N/A) | C42 |
I3 |
C43 |
C83 |
(N/A) |
I4 |
(N/A) | (N/A) | C84 |
I5 |
C85 |
C125 |
(N/A) |
I6 |
(N/A) | (N/A) | C126 |
As before, instruction I1
sends a request to memory (MEM
) in cycle C1
, with a corresponding response in cycle C41
. However, instruction I2
cannot proceed fully, pending R1
via upstream instruction I1
. This resolution occurs in cycle C42
.
Consequently, per the in-order execution requirement, instruction I3
therefore cannot proceed until cycle 43
, with a corresponding response from MEM
in cycle 83
. At this point, instruction I4
can commence execution in cycle C84
.
Analogously, instruction I5
does not proceed until cycle C85
, with a corresponding response from MEM
in cycle C126
. Finally, instruction I6
commences execution in cycle C126
.
Therefore, with in-order execution, this program executes in 126
total cycles. This is a nearly 3× delay relative to out-of-order execution (cf. 44
total cycles). This demonstrates that there is a marked advantage in reordering load-store instructions, however, this carries a risk of potentially requiring recovery in the event of loading an incorrect memory value.
Before discussing how to recover from load operations performed prematurely, first consider the store-to-load forwarding problem.
For a load instruction, we must consider: Which upstream store instruction does it retrieve the value from?
- Generally, there can be multiple store instructions corresponding to the same address, any/all of which pertain to the load instruction in question. Therefore, it is necessary to determine which of these in particular will supply the necessary value to the load instruction.
- Furthermore, if none of the upstream store instructions matches, then this must be determined accordingly, in order to correspondingly fetch the instruction from memory (
MEM
).
When a store instruction is finally resolved, we then must consider: Which downstream load instruction does it supply to value to?
- There might be a load instruction whose address is already determined; once the store instruction in question determines the corresponding address and value, it should correspondingly relay this value to the load instruction.
Furthermore, note that there can be multiple downstream load instructions pending this store-instruction-produced value; in this case, how is this determined? This is done so via the load-store queue (LSQ), as discussed next.
To illustrate the operation of the load-store queue (LSQ), consider the example in the figure shown above. The load-store queue (LSQ) itself is ordered from oldest to newest instructions (stored in program-order), and consists of the following entries:
L/S
- denotes aLOAD
(L
) orSTORE
(S
) instructionPC
- the address (per the program counter) from which the load or store instruction was fetchedSeq
- the sequence number, which is simply incremented with each instruction- N.B. This section's text will reference instructions unambiguously by
Seq
(e.g., "instruction41773
").
- N.B. This section's text will reference instructions unambiguously by
Addr
- the address to which the load or store instruction resolves toValue
- the resulting value computed by the load or store instruction
Furthermore, the data cache content is also recorded here for additional reference, whose content is initialized as in the figure shown above.
Instruction 41773
executes first, accessing address 0x3290
(via the data cache), producing the corresponding value 42
.
Subsequently, instruction 41774
(prematurely) computes its result as 25
(i.e., tentative target of the store instruction going into memory), however, this value is not yet placed in the data cache, pending its own final commit prior to doing so.
Similarly, instruction 41775
(prematurely) computes its result as -17
.
Instruction 41776
accesses address 0x3418
and determines if any of the upstream store instructions matches this address. Since no upstream store instruction matches, instruction 41776
consequently loads value 1234
from the data cache.
Instruction 41777
accesses address 0x3290
and determines if any of the upstream store instructions matches this address. It determines that instruction 41775
matches (the most-recently-occurring upstream store instruction per this address), and correspondingly loads the value -17
directly, rather than retrieving it from the data cache.
Instruction 41778
accesses address 0x3300
and determines if any of the upstream store instructions matches this address. Since no upstream store instruction matches, instruction 41778
consequently loads value 1
from the data cache.
Instruction 41779
(prematurely) computes its result as 0
(i.e., tentative target of the store instruction going into memory), however, this value is not yet placed in the data cache, pending its own final commit prior to doing so.
Instruction 41780
accesses address 0x3410
and determines if any of the upstream store instructions matches this address. It determines that instruction 41774
matches (the most-recently-occurring upstream store instruction per this address), and correspondingly loads the value 25
directly, rather than retrieving it from the data cache.
Instruction 41781
accesses address 0x3290
and determines if any of the upstream store instructions matches this address. It determines that instruction 41779
matches (the most-recently-occurring upstream store instruction per this address), and correspondingly loads the value 0
directly, rather than retrieving it from the data cache.
- Observe that by this point, there are several upstream store instructions corresponding to address
0x3290
, however, it is imperative to retrieve the most recent one. This ensures an in-order-like processing of the instructions (i.e., this would be the expected memory-location value at a given point in the program's execution). However, insofar as the data cache is concerned, the corresponding value is still42
, despite several intermediate modifications having occurred up to this point since commencement of the first instruction (i.e., instruction41773
), none of which have yet sent their corresponding value to the data cache up to this point.
Finally, instruction 41782
accesses address 0x3300
and determines if any of the upstream store instructions matches this address. Since no upstream store instruction matches, instruction 41778
consequently loads value 1
from the data cache.
- N.B. Upstream instruction
41778
, which also accesses address0x3300
, is a load instruction rather than a store instruction, and therefore does not impact this current instruction (i.e., instruction41782
).
At some later point, the load and store instructions commence with committing.
Instruction 41773
commits, copying its value to the corresponding register, thereby advancing the oldest pointer to subsequent instruction 41774
.
Instruction 41774
commits, copying its value 25
to the corresponding data-cache address (overriding its existing value 38
), thereby advancing the oldest pointer to subsequent instruction 41775
.
Instruction 41775
commits, copying its value 17
(sic) to the corresponding data-cache address (overriding its existing value 42
), thereby advancing the oldest pointer to subsequent instruction 41776
.
- N.B. In the lecture video,
-17
is not explicitly used in the data cache, presumably a transposition error (as correspondingly suggested in the figure shown above).
Recall that values are sent to memory or to cache at the point of commit (i.e., but not at the point of execution). The reason for this is as follows. Consider the scenario in which an exception has occurred at this point in the program (i.e., with oldest pointer currently pointing to store instruction 41776
). In such a case, we can simply flush this instruction from the load-store queue (LSQ), thereby maintaining the integrity of the current value in the data cache (i.e., this data-cache value is the intended value at this point in the program's execution, and more generally so at any given point in the program's execution as of the most-recent commit).
- This corresponds analogously to the architecture register file (ARF) seen previously (cf. Lesson 8), which similarly involved copying and committing of values to registers, thereby ensuring that at any given time, there is no ambiguity with respect to intended (i.e., committed) register value at that point in the program's execution.
Proceeding accordingly, instruction 41776
commits, copying its value to the corresponding register, thereby advancing the oldest pointer to subsequent instruction 41777
, and correspondingly so for the subsequent two load instructions (i.e., load instructions 41777
and 41778
).
Instruction 41779
commits, copying its value 0
to the corresponding data-cache address (overriding its existing value 17
), thereby advancing the oldest pointer to subsequent instruction 41780
.
Finally, the last-remaining load instructions commit.
Consider the relationship between the load-store queue (LSQ), the reorder buffer (ROB), and reservation stations (RSes).
When issuing a load or store instruction, this requires the following:
- A ROB entry, which is required for every instruction in general
- An LSQ entry, which corresponds analogously (i.e., for load and store instructions specifically) to the RS in a ROB-based configuration
When issuing instructions other than a load or store instruction, this requires the following:
- A ROB entry, which (as before) is required for every instruction in general
- An RS of the corresponding instruction type
Note than a load or store instruction cannot be issued unless both a ROB entry and an LSQ entry are available for the instruction in question. Correspondingly, for instructions other than load or store instructions, the instruction in question cannot be issued unless both a ROB entry and an RS (of corresponding instruction type) are available.
When executing a load or store instruction, this is comprised of two steps:
- 1 - Compute the address
- 2 - Produce the value
For a load instruction, this entails first computing the address, and then subsequently retrieving the value from memory. Conversely, for a store instruction, these steps could be performed in either order; either way, a store instruction computes the address while also attempting to obtain the value of the register to target for storage of this computed-address value.
The operation write result only occurs for load instructions; conversely, for a store instruction, there is no such corresponding operation (i.e., a store instruction does not write the result, but rather maintains the address and the value in the load-store queue [LSQ] for subsequent use by downstream load instructions, as well as for eventual committing to memory).
- As soon as a load instruction gets a result from an upstream store instruction, the load instruction subsequently broadcasts this result to downstream dependent instructions, thereby ensuring that all reservation stations (RSes) awaiting this register value can then proceed accordingly. In this manner, the LSQ provides an analogous role to the RSes.
To subsequently commit the load or store instruction, the ROB head is advanced (thereby freeing the ROB entry accordingly), and correspondingly the LSQ head is advanced as well (thereby freeing the LSQ entry accordingly).
Additionally, for store instructions, the write must be sent to memory (MEM
).
- On commit, the (up to this point) retained address and value must now be finally updated in memory for the program itself.
Given the following consecutive program instructions:
SW R1 → 0(R2)
LW R2 ← 0(R2)
Does the instruction LW
access cache or memory? (Indicate Yes
or No
.)
No
Answer and Explanation:
The instruction LW
does not access cache or memory. Since R2
refers to the same address, the instruction LW
will retrieve this value from the store.
As a follow up to the quiz in the previous section, given that the instruction LW
does not retrieve its value from the cache or memory (but rather the store), where exactly does the instruction LW
get its value from? (Select all applicable choices.)
- A result broadcast
DOES NOT APPLY
- The store does not broadcast its resulting value; results are only broadcasted in this manner for instructions producing a register result (which does not apply for a store instruction).
- A reservation station (RS)
DOES NOT APPLY
- RSes never provide any results to subsequent instructions, but rather only capture values for the current instruction. Furthermore, even in such a case, store instructions do not interact with RSes in this manner anyhow.
- A reorder buffer (ROB) entry
DOES NOT APPLY
- A ROB entry would maintain a result for a register-producing instruction between the time of broadcast and the time of commit, however, because a store instruction is not a register-value-producing instruction, it does not correspondingly place its result in a ROB entry. (In fact, the store does not even technically have such a "result" to place in such a ROB entry in the first place.)
- A load-store queue (LSQ) entry
APPLIES
- The store instruction does maintain the value in the load-store queue. This is where the downstream load instruction(s) searches for a value when attempting to match its address to the corresponding upstream store instruction(s).
In this lesson, we have learned that a load-store queue is required to track dependencies through memory, thereby ensuring correct execution of memory instructions in an otherwise out-of-order processor.
Modern processors have fairly sophisticated load-store queues which facilitate a lot of reordering, including for memory instructions.