-
Notifications
You must be signed in to change notification settings - Fork 271
Home
This page is intended to present design rationale and notes on future extensions that are currently out of scope.
-
Statically encoding SEW and LMUL
-
Predicates
- Predicating instructions with the complement of v0
- Predicating instructions with a register other than v0
Note, for straightforward implementations, this feature adds another regfile read port (or map-table read port for renamed implementations)
- 2 input predicates? - useful in SIMT emulation (aggressive, interleaving diverged)
-
memory addressing modes
- Indexed memory accesses that implicitly scale the index by SEW/8
- Indexed memory accesses that decouple index width from data width
- BaseReg + scale * IndexReg + offset
-
Combinatoric explosion of operand types
-
This has historically been the biggest reason why I (Ag) want more than 32 bits of instruction for vectors - all of the following are fairly simple and could fit in the RV32 format but there are just too many of them!
-
Mixed width, widening
- e.g. vs1.8[i] * vs2.16[i] =+ vd.32[i]
- signed X signed, signed X unsigned, unsigned X unsigned
- e.g. vs1.8[i] * vs2.16[i] =+ vd.32[i]
-
DSP datatypes, with saturation
- SS: saturate signed N bits --> signed M bits, M < N
- UU: saturate unsigned N bits --> unsigned M bits, M < N
- US: saturate unsigned N bits --> signed M bits, M < N
- SU: saturate signed N bits --> unsigned M bits, M < N
- this is ReLU, a common function in DL
- although this particular saturation would mainly be used at the end of a dot product
- e.g. in a reduction, or in an actual dot product
-
New FP types including instructions with Mixed FP types
- single X single =+ double
- FP16, BFLOAT16
- fp16 X fp16 =+ {single, fp16}
- bfloat16 X bfloat16 =+ {single, bfloat16}
- fp16 X single =+ single
- bfloat16 X single =+ single
- eight bit floating-point types...
- emerging standards? e.g. se4m3
- https://en.wikipedia.org/wiki/Minifloat
-
Mixed integer/fixed/floating point instructions
-
-
unums ??
-
complex
- chunky or interleaved (re,im) vs (im,re)
- planar or SOA
- most common for existing GPU and/or vectors without complex support
- e.g. planar vector vector ops like add needs four inputs and two outputs
- but doing it as one instruction rather than decomposing improves ratio of compute to data movement
-
Improved "scalar" support in vector registers
- e.g. instead of having reductions always write vd[0], and "wasting" rest of vd, specify which vector element the reduction "scalar" should be written to
- both static, and dynamic determined by another scalar
- similarly for "large scalars" that occupy more than one vector element * LMUL max, as occurs in some crypto instruction proposals
- e.g. instead of having reductions always write vd[0], and "wasting" rest of vd, specify which vector element the reduction "scalar" should be written to