-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for ZCMT Extension for Code-Size Reduction in CVA6 #2659
base: master
Are you sure you want to change the base?
Conversation
❌ failed run, report available here. |
Hi @JeanRochCoulon how can we know which line in the code this spyglass failure refers to? |
@ASintzoff do you know how to help ? |
❌ failed run, report available here. |
❌ failed run, report available here. |
❌ failed run, report available here. |
5973264
to
645d24f
Compare
❌ failed run, report available here. |
@JeanRochCoulon |
is the extension completely optional? All the RTL added for Zcmt should be removed when the extension is not set. If not, it can increase the gate count. |
❌ failed run, report available here. |
❌ failed run, report available here. |
); | ||
end | ||
if (CVA6Cfg.RVZCMP) begin | ||
if (CVA6Cfg.RVZCMP || (CVA6Cfg.RVZCMT & ~CVA6Cfg.MmuPresent)) begin //MMU should be off when using ZCMT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet 2 cents that the gate count increase comes from this condition: decoder_macro and zcmt_decoder are both inferred by the same if condition. To me, decoder_macro depends on zcmp and zcmt_decoder depends on zcmt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeanRochCoulon almost ~1k gate count is increased in issue_stage, which I think is because we added one signal in scoreboard struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we increase the expected_synth gate count. What do you think, @JeanRochCoulon?
❌ failed run, report available here. |
❌ failed run, report available here. |
❌ failed run, report available here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. Here are some suggestions.
By the way, do you think it would be feasible to make it compatible with superscalar?
Also, there is still the area increase which is really impacting even when it is disabled at elaboration time.
input logic clk_i, // Clock | ||
input logic rst_ni, // Synchronous reset | ||
input logic [31:0] instr_i, // instruction | ||
input logic [CVA6Cfg.VLEN-1:0] pc_i, // PC | ||
input logic is_zcmt_instr_i, // Intruction is of macro extension | ||
input logic illegal_instr_i, // From compressed decoder | ||
input logic is_compressed_i, // is compressed instruction | ||
input jvt_t jvt_i, | ||
input dcache_req_o_t req_port_i, // Data cache request ouput - CACHE | ||
|
||
output logic [31:0] instr_o, // Instruction out | ||
output logic illegal_instr_o, // Illegel instruction | ||
output logic is_compressed_o, // is compressed instruction | ||
output logic fetch_stall_o, // Wait while address fetched from table | ||
output dcache_req_i_t req_port_o // Data cache request input - CACHE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most port documentations here miss the "connexion" field, used to produce tables such as https://cva6.readthedocs.io/en/latest/04_cv32a65x/design/design.html#frontend-description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cathales I have updated the description and added the IO port connection table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot see it as of 2124fa6.
The expected format is:
- A line with only a comment, before the port declaration
- In the comment, the port description followed by " - " and the connection
For instance:
// Data cache request input - CACHE
output dcache_req_i_t req_port_o
Not:
output dcache_req_i_t req_port_o // Data cache request input - CACHE
Nor:
// Data cache request input
output dcache_req_i_t req_port_o
jump_addr[10:1], | ||
jump_addr[11], | ||
jump_addr[19:12], | ||
5'h1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved_branch
does not currently update RAS in frontend
.
So this instruction does not update RAS, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cathales The resolved_branch signal is updated in the branch unit, ensuring that the RAS is also updated and enabling jump on calculated address instead of predicted address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, resolved_branch
has no influence on the RAS as of f4355fa
cva6/core/frontend/frontend.sv at f4355fa49b6ed592579146dbfc65a34f3bbd6ed2 · openhwgroup/cva6
Instead, the frontend could be modified:
intsr_scan
could detect that it is a call- this call detection would rise
ras_push
- this
ras_push
signal would push the correct address into the RAS
end | ||
TABLE_JUMP: begin | ||
if (req_port_i.data_rvalid) begin | ||
jump_addr = $unsigned($signed(req_port_i.data_rdata) - $signed(pc_i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work if the target address is too far from the current program counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cathales This mechanism supports jumps within a range of ±1 MB from the current program counter, as the addr_jump field uses 20 bits as an offset immediate from the PC, it is similar to the JAL instruction in 32-bit RISC-V.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this limitation, this implementation is not compliant with Zcmt, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cathales according to ZCMT spec cm.jt and cm.jalt is equivalent to
• 32-bit j calls
• 32-bit jal ra calls
and J / jal ra calls only jump to ±1 MB
Thus implementation is complaint with zcmt as it is only for 32 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At https://github.com/riscv/riscv-isa-manual/blob/main/src/zc.adoc#table-jump-overview RISC-V specification states
cm.jt and cm.jalt encodings index the table, giving access to functions within the full XLEN wide address space.
Therefore, the target addresses shall be any address and not only addresses near the current PC.
@cathales Currently, it is not compatible with a superscalar architecture because it stalls the pipeline in the decode stage until the implicit fetch is completed. Compatibility can be addressed in a future update alongside the ZCMP extension. |
@@ -48,6 +48,8 @@ module decoder | |||
input logic is_last_macro_instr_i, | |||
// Is mvsa01/mva01s macro instruction - macro_decoder | |||
input logic is_double_rd_macro_instr_i, | |||
//zcmt instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//zcmt instruction | |
// Zcmt instruction - FRONTEND |
@@ -373,6 +376,7 @@ package config_pkg; | |||
assert (Cfg.NrPMPEntries <= 64); | |||
assert (!(Cfg.SuperscalarEn && Cfg.RVF)); | |||
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMP)); | |||
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT && ~CVA6Cfg.MmuPresent)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cfg.SuperscalarEn && Cfg.RVZCMT
won’t trigger this assertion. You could split into two distinct assertions:
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT && ~CVA6Cfg.MmuPresent)); | |
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT)); | |
assert (!(Cfg.RVZCMT && ~Cfg.MmuPresent)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the jobs fail because it should be Cfg
instead of CVA6Cfg
here.
resolved_branch_o.is_mispredict = 1'b1; // miss prediction for ZCMT | ||
resolved_branch_o.cf_type = ariane_pkg::Jump; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could send a JumpR
instead of Jump
. It would update BTB in the frontend to correctly predict the destination next time.
We could then calculate is_mispredict
accordingly, to use this prediction when we later encounter the same cm.jt
or cm.jalt
.
end | ||
TABLE_JUMP: begin | ||
if (req_port_i.data_rvalid) begin | ||
jump_addr = $unsigned($signed(req_port_i.data_rdata) - $signed(pc_i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this limitation, this implementation is not compliant with Zcmt, is it?
❌ failed run, report available here. |
Introduction
This PR implements the ZCMT extension in the CVA6 core, targeting the 32-bit embedded-class platforms. ZCMT is a code-size reduction feature that utilizes compressed table jump instructions (cm.jt and cm.jalt) to reduce code size for embedded systems
Note: Due to implementation complexity, ZCMT extension is primarily targeted at embedded class CPUs. Additionally, it is not compatible with architecture class profiles.(Ref. Unprivilege spec 27.20)
Key additions
Added zcmt_decoder module for compressed table jump instructions: cm.jt (jump table) and cm.jalt (jump-and-link table)
Implemented the Jump Vector Table (JVT) CSR to store the base address of the jump table in csr_reg module
Implemented a return address stack, enabling cm.jalt to behave equivalently to jal ra (jump-and-link with return address), by pushing the return address onto the stack in zcmt_decoder module
Implementation in CVA6
The implementation of the ZCMT extension involves the following major modifications:
compressed decoder
The compressed decoder scans and identifies the cm.jt and cm.jalt instructions, and generates signals indicating that the instruction is both compressed and a ZCMT instruction.
zcmt_decoder
A new zcmt_decoder module was introduced to decode the cm.jt and cm.jalt instructions, fetch the base address of the JVT table from JVT CSR, extract the index and construct jump instructions to ensure efficient integration of the ZCMT extension in embedded platforms. Table.1 shows the IO port connection of zcmt_decoder module. High-level block diagram of zcmt implementation in CVA6 is shown in Figure 1.
Table. 1 IO port connection with zcmt_decoder module
branch unit condition
A condition is implemented in the branch unit to ensure that ZCMT instructions always cause a misprediction, forcing the program to jump to the calculated address of the newly constructed jump instruction.
JVT CSR
A new JVT csr is implemented in csr_reg which holds the base address of the JVT table. The base address is fetched from the JVT CSR, and combined with the index value to calculate the effective address.
No MMU
Embedded platform does not utilize the MMU, so zcmt_decoder is connected with cache through port 0 of the Dcache module for implicit read access from the memory.
Figure. 1 High level block diagram of ZCMT extension implementation
Known Limitations
The implementation targets 32-bit instructions for embedded-class platforms without an MMU. Since the core does not utilize an MMU, it is leveraged to connect the zcmt_decoder to the cache via port 0.
Testing and Verification
Test Plan
A test plan is developed to test the functionality of ZCMT extension along with JVT CSR. Directed Assembly test executed to check the functionality.
Table. 2 Test plan
Note: Please find the test under CVA6_REPO_DIR/verif/tests/custom/zcmt"