Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[hardware] 🐛 Suboptimal fix to reshuffle with LMUL > 1
If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register within Vn and V(n+X-1) that has an EEW mismatch. All these reshuffles are reshuffling different Vm with LMUL_1, but also the same register (Vn with LMUL_X) from the point of view of the hazard checks on the next instruction that has a dependency on Vn with LMUL_X. We cannot just inject one macro reshuffle since the registers between Vn and V(n+X-1) can have different encodings. So, we need finer-grain reshuffles that messes up the dependency tracking. For example, vst @, v0 (LMUL_8) will use the registers from v0 to v7. If they are all reshuffled, we will end up with 8 reshuffle instructions that will get IDs from 0 to 7. The store will then see a dependency on the reshuffle ID that targets v0 only. This is wrong, since if the store opreq is faster than the slide opreq once the v0-reshuffle is over, it will violate the RAW dependency. Not to mess this up, the safest and most suboptimal fix is to just wait in WAIT_IDLE after a reshuffle with LMUL > 1. There are many possible optimizations to this: 1) Check if, when LMUL > 1, we reshuffled more than 1 register. If we reshuffle 1 reg only, we can also skip the WAIT_IDLE. 2) Check if all the X registers need to be reshuffled (common case). If this is the case, inject a large reshuffle with LMUL_X only and skip WAIT_IDLE. 3) Not to wait until idle, instead of WAIT_IDLE we can inject the reshuffles starting from V(n+X-1) instead than Vn. This will automatically adjust the dependency check and will speed up a bit the whole operation.
- Loading branch information