[hardware] 🐛 Suboptimal fix to reshuffle with LMUL > 1

If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register within Vn and V(n+X-1) that has an EEW mismatch. All these reshuffles are reshuffling different Vm with LMUL_1, but also the same register (Vn with LMUL_X) from the point of view of the hazard checks on the next instruction that has a dependency on Vn with LMUL_X. We cannot just inject one macro reshuffle since the registers between Vn and V(n+X-1) can have different encodings. So, we need finer-grain reshuffles that messes up the dependency tracking. For example, vst @, v0 (LMUL_8) will use the registers from v0 to v7. If they are all reshuffled, we will end up with 8 reshuffle instructions that will get IDs from 0 to 7. The store will then see a dependency on the reshuffle ID that targets v0 only. This is wrong, since if the store opreq is faster than the slide opreq once the v0-reshuffle is over, it will violate the RAW dependency. Not to mess this up, the safest and most suboptimal fix is to just wait in WAIT_IDLE after a reshuffle with LMUL > 1. There are many possible optimizations to this: 1) Check if, when LMUL > 1, we reshuffled more than 1 register. If we reshuffle 1 reg only, we can also skip the WAIT_IDLE. 2) Check if all the X registers need to be reshuffled (common case). If this is the case, inject a large reshuffle with LMUL_X only and skip WAIT_IDLE. 3) Not to wait until idle, instead of WAIT_IDLE we can inject the reshuffles starting from V(n+X-1) instead than Vn. This will automatically adjust the dependency check and will speed up a bit the whole operation.
pulp-platform · Jun 18, 2024 · 9cad8c8 · 9cad8c8
1 parent 0eb5a76
commit 9cad8c8
Showing 1 changed file with 14 additions and 1 deletion.
diff --git a/hardware/src/ara_dispatcher.sv b/hardware/src/ara_dispatcher.sv
@@ -401,7 +401,20 @@ module ara_dispatcher import ara_pkg::*; import rvv_pkg::*; #(
               default:;
             endcase
 
-            if (reshuffle_req_d == 3'b0) state_d = NORMAL_OPERATION;
+            if (reshuffle_req_d == 3'b0) begin
+              // If LMUL_X has X > 1, Ara can inject different reshuffle ops during RESHUFFLE,
+              // one per LMUL_1-register that needs to be reshuffled. In mixed cases, we have
+              // multiple instructions that reshuffle parts of the original LMUL_X-register
+              // (e.g., LMUL_8, vd = v0, eew = 64, and only v1 and v5 have eew = 64). In this
+              // case, the dependency of the next LMUL_8 instruction on v0 should be on all
+              // the reshuffle micro operations. This is not possible with the current architecture.
+              // Therefore, we either set the dependency on the very last instruction only, or
+              // we just wait until the reshuffle is over.
+              // The best optimization would be injecting contiguous reshuffles with X > 1 and
+              // an extended vl. If we injected only one reshuffle, we can skip the wait idle.
+              if (csr_vtype_q.vlmul != LMUL_1) state_d = WAIT_IDLE;
+              else state_d = NORMAL_OPERATION;
+            end
           // The register is not completely reshuffled (LMUL > 1)
           end else begin
             // Count up