Skip to content

Commit

Permalink
[hardware] 🐛 Suboptimal fix to reshuffle with LMUL > 1
Browse files Browse the repository at this point in the history
If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register
within Vn and V(n+X-1) that has an EEW mismatch.
All these reshuffles are reshuffling different Vm with LMUL_1, but also
the same register (Vn with LMUL_X) from the point of view of the hazard
checks on the next instruction that has a dependency on Vn with LMUL_X.

We cannot just inject one macro reshuffle since the registers between
Vn and V(n+X-1) can have different encodings. So, we need finer-grain
reshuffles that messes up the dependency tracking.

For example,
vst @, v0 (LMUL_8)
will use the registers from v0 to v7. If they are all reshuffled, we
will end up with 8 reshuffle instructions that will get IDs from
0 to 7. The store will then see a dependency on the reshuffle ID that
targets v0 only. This is wrong, since if the store opreq is faster than
the slide opreq once the v0-reshuffle is over, it will violate the RAW
dependency.

Not to mess this up, the safest and most suboptimal fix is to just
wait in WAIT_IDLE after a reshuffle with LMUL > 1.

There are many possible optimizations to this:
 1) Check if, when LMUL > 1, we reshuffled more than 1 register.
If we reshuffle 1 reg only, we can also skip the WAIT_IDLE.
 2) Check if all the X registers need to be reshuffled (common case).
If this is the case, inject a large reshuffle with LMUL_X only and
skip WAIT_IDLE.
 3) Not to wait until idle, instead of WAIT_IDLE we can inject the
reshuffles starting from V(n+X-1) instead than Vn. This will automatically
adjust the dependency check and will speed up a bit the whole operation.
  • Loading branch information
mp-17 committed Jun 18, 2024
1 parent 0eb5a76 commit 9cad8c8
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion hardware/src/ara_dispatcher.sv
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,20 @@ module ara_dispatcher import ara_pkg::*; import rvv_pkg::*; #(
default:;
endcase

if (reshuffle_req_d == 3'b0) state_d = NORMAL_OPERATION;
if (reshuffle_req_d == 3'b0) begin
// If LMUL_X has X > 1, Ara can inject different reshuffle ops during RESHUFFLE,
// one per LMUL_1-register that needs to be reshuffled. In mixed cases, we have
// multiple instructions that reshuffle parts of the original LMUL_X-register
// (e.g., LMUL_8, vd = v0, eew = 64, and only v1 and v5 have eew = 64). In this
// case, the dependency of the next LMUL_8 instruction on v0 should be on all
// the reshuffle micro operations. This is not possible with the current architecture.
// Therefore, we either set the dependency on the very last instruction only, or
// we just wait until the reshuffle is over.
// The best optimization would be injecting contiguous reshuffles with X > 1 and
// an extended vl. If we injected only one reshuffle, we can skip the wait idle.
if (csr_vtype_q.vlmul != LMUL_1) state_d = WAIT_IDLE;
else state_d = NORMAL_OPERATION;
end
// The register is not completely reshuffled (LMUL > 1)
end else begin
// Count up
Expand Down

0 comments on commit 9cad8c8

Please sign in to comment.