From 06dcb966dac06ed204153f9ebf394fe7d5c6b248 Mon Sep 17 00:00:00 2001
From: Matteo Perotti <mperotti@iis.ee.ethz.ch>
Date: Tue, 18 Jun 2024 14:25:04 +0200
Subject: [PATCH] doc: Update tutorial and add trace analysis page (#149)

* doc: Update tutorial

* docs: Add trace analysis page

* requirements.txt: Upgrade torch after numpy>2.0 upgrade

---------

Co-authored-by: Luca Colagrande <luca.colagrande3@gmail.com>
---
 docs/ug/benchmarking.md         |  69 ----------------
 docs/ug/trace_analysis.md       | 135 ++++++++++++++++++++++++++++++++
 mkdocs.yml                      |   6 +-
 sw/dnn/requirements.txt         |   2 +-
 target/snitch_cluster/README.md |  76 +++++++++---------
 5 files changed, 178 insertions(+), 110 deletions(-)
 delete mode 100644 docs/ug/benchmarking.md
 create mode 100644 docs/ug/trace_analysis.md

diff --git a/docs/ug/benchmarking.md b/docs/ug/benchmarking.md
deleted file mode 100644
index 63f01f784..000000000
--- a/docs/ug/benchmarking.md
+++ /dev/null
@@ -1,69 +0,0 @@
-# Benchmarking
-
-## Choosing the right platform
-
-To start developing and benchmarking applications on Snitch, we advise to start developing on `banshee` and in a later step benchmark them on the RTL for cycle-accurate results. `banshee` is a good starting point to functionaly verify your application and to get a first impression of the performance. `banshee` can generate traces for you which allows to already have a rough estimate of the FPU utilization for instance. However, `banshee` does not model delays for instructions and memory accesses. Therefore, the cycle-accurate results are only possible with RTL simulations or on the FPGA.
-
-## Generating traces
-
-To generate traces, `spike-dasm` must be installed and available in the `PATH`. Using the source from this repository supports disassembly of Snitch-custom instructions. We refer to the [Quick Start](./getting_started.md#Quick-Start) to install `spike-dasm`.
-
-traces are automatically generated if you run the simulation when running the following target for RTL simulations from the `build` folder:
-
-```bash
-make run-rtl-my_binary
-```
-
-respectively for `banshee`:
-
-```bash
-make run-banshee-my_binary
-```
-
-Alternatively you can also generate traces for the RTL by running the following target in the `hw/system/my_platform` folder:
-
-```bash
-make traces
-```
-
-## RTL Traces
-
-The traces will be stored in the `logs` folder. The traces are generated for each core. The trace for core 0 is stored in `trace_hart_00000000.txt`. The trace for core 1 is stored in `trace_hart_00000001.txt` and so on. A trace file contains a summary of few statistics for the specific core that is appended at the end of the trace file. The following example shows such a summary:
-
-```
-## Performance metrics
-
-Performance metrics for section 0 @ (11, 3459):
-snitch_loads                                    89
-snitch_stores                                   89
-fpss_loads                                       0
-snitch_avg_load_latency                    22.9888
-snitch_occupancy                            0.1334
-snitch_fseq_rel_offloads                    0.0650
-fseq_yield                                     1.0
-fseq_fpu_yield                                 1.0
-fpss_section_latency                             0
-fpss_avg_fpu_latency                           1.0
-fpss_avg_load_latency                            0
-fpss_occupancy                              0.0093
-fpss_fpu_occupancy                          0.0093
-fpss_fpu_rel_occupancy                         1.0
-cycles                                        3449
-total_ipc                                   0.1427
-```
-
-The trace script also allows to split the execution into multiple sections. The sections are defined by reading from the `mcycle` CSR register. This register will return the current cycle count, but also serves as a trigger for the trace script, to start a new section. The following example shows how to split the execution into two sections:
-
-```c
-#include "sw/vendor/riscv-opcodes/encoding.h"
-size_t benchmark_get_cycle() { return read_csr(mcycle); }
-
-// End of section 0, Start of section 1
-benchmark_get_cycle();
-
-// Execute kernel to be benchmarked
-my_kernel();
-
-// End of section 1, Start of section 2
-benchmark_get_cycle();
-```
diff --git a/docs/ug/trace_analysis.md b/docs/ug/trace_analysis.md
new file mode 100644
index 000000000..d1eddf633
--- /dev/null
+++ b/docs/ug/trace_analysis.md
@@ -0,0 +1,135 @@
+# Trace Analysis
+
+## Trace generation
+
+During RTL simulation, the Snitch core complex (CC) dumps a wide set of information to the `logs/trace_hart_XXXXX.dasm` file (see [snitch_cc.sv](../../hw/snitch_cluster/src/snitch_cc.sv)), `XXXXX` denoting the index of the Snitch core in the system.
+
+The [gen_trace.py](../../util/trace/gen_trace.py) script can be used to elaborate this information into a human-readable form, and is invoked by the `make traces` target to generate `logs/trace_hart_XXXXX.txt`.
+
+## Trace walkthrough
+
+Here is an extract of a trace generated by the previous script:
+```
+ 1147000     1118        M 0x8000022c li      a1, 47                 #; (wrb) a1  <-- 47
+ 1148000     1119        M 0x80000230 scfgwi  a1, 64                 #; a1  = 47
+ 1149000     1120        M 0x80000234 scfgwi  a3, 192                #; a3  = 8
+ 1150000     1121        M 0x80000238 scfgwi  a1, 65                 #; a1  = 47
+ 1151000     1122        M 0x8000023c scfgwi  a3, 193                #; a3  = 8
+ 1152000     1123        M 0x80000240 scfgwi  a1, 66                 #; a1  = 47
+ 1153000     1124        M 0x80000244 scfgwi  a3, 194                #; a3  = 8
+ 1154000     1125        M 0x80000248 mul     a2, a2, a7             #; a2  = 0, a7  = 384
+ 1157000     1128        M                                           #; (acc) a2  <-- 0
+ 1158000     1129        M 0x8000024c add     a0, a0, a2             #; a0  = 0x10000000, a2  = 0, (wrb) a0  <-- 0x10000000
+ 1159000     1130        M 0x80000250 scfgwi  a0, 768                #; a0  = 0x10000000
+ 1160000     1131        M 0x80000254 add     a0, a5, a2             #; a5  = 0x10000c00, a2  = 0, (wrb) a0  <-- 0x10000c00
+ 1161000     1132        M 0x80000258 scfgwi  a0, 769                #; a0  = 0x10000c00
+ 1162000     1133        M 0x8000025c auipc   a0, 0x5                #; (wrb) a0  <-- 0x8000525c
+ 1163000     1134        M 0x80000260 addi    a0, a0, -1396          #; a0  = 0x8000525c, (wrb) a0  <-- 0x80004ce8
+ 1165000     1136        M 0x80000268 add     a0, a6, a2             #; a6  = 0x10001800, a2  = 0, (wrb) a0  <-- 0x10001800
+ 1166000     1137        M 0x8000026c scfgwi  a0, 898                #; a0  = 0x10001800
+                         M 0x80000264 fld     ft3, 0(a0)             #; ft3  <~~ Doub[0x80004ce8]
+ 1168000     1139        M 0x80000270 csrsi   ssr, 1                 #; 
+ 1169000     1140        M 0x80000274 frep    48, 1                  #; outer, 48 issues
+ 1176000     1147        M                                           #; (f:lsu) ft3  <-- -32.7504962
+```
+
+From left to right, the columns contain the following information: simulation time (in picoseconds), CPU cycle, privilege mode, program counter (or memory address of the current instruction), instruction mnemonic, instruction arguments, a comment block reporting additional information.
+
+We will now break down the previous trace and understand what is going on piece by piece.
+
+```
+ 1147000     1118        M 0x8000022c li      a1, 47                 #; (wrb) a1  <-- 47
+ 1148000     1119        M 0x80000230 scfgwi  a1, 64                 #; a1  = 47
+ 1149000     1120        M 0x80000234 scfgwi  a3, 192                #; a3  = 8
+ 1150000     1121        M 0x80000238 scfgwi  a1, 65                 #; a1  = 47
+ 1151000     1122        M 0x8000023c scfgwi  a3, 193                #; a3  = 8
+ 1152000     1123        M 0x80000240 scfgwi  a1, 66                 #; a1  = 47
+ 1153000     1124        M 0x80000244 scfgwi  a3, 194                #; a3  = 8
+```
+
+Snitch is a single-issue single-stage in-order core implementing the RV32I instruction set. Instructions are fetched in order and complete in a single cycle. As can be seen from cycle 1118 to 1124, all instructions execute in a single cycle, and the program counter increases regularly in steps of 4 bytes (all instructions are 32-bit wide). The comment on line 1118 `(wrb) a1 <-- 47` tells us that in that same cycle, where the `li` instruction is issued and executed, the core also writes back (`wrb`) a value of `47` to destination register `a1`, as a result of that instruction. The comments on lines 1119 to 1124 report the values of the source registers involved in the respective instructions at the time they are issued.
+
+```
+ 1154000     1125        M 0x80000248 mul     a2, a2, a7             #; a2  = 0, a7  = 384
+ 1157000     1128        M                                           #; (acc) a2  <-- 0
+ 1158000     1129        M 0x8000024c add     a0, a0, a2             #; a0  = 0x10000000, a2  = 0, (wrb) a0  <-- 0x10000000
+```
+
+Additional instructions are supported by means of accelerators connected to Snitch's accelerator interface. When Snitch fetches and decodes an instruction which is supported through one of the accelerators, it offloads the instruction to the corresponding accelerator. This is the case for instructions in the RISC-V "M" Standard Extension for Integer Multiplication and Division, which are supported through an external MULDIV unit, shared by all cores in a Snitch cluster. Offloaded instructions need to travel a longer distance to reach the dedicated functional unit, and may be pipelined. To tolerate this latency, the Snitch core may fetch, issue and execute successive instructions even before the offloaded instruction commits its result to the register file. Only instructions which do not have RAW or WAW dependencies with any outstanding offloaded instruction are allowed to execute. As an example, at cycle 1125 Snitch issues the `mul` instruction, which is offloaded to the shared MULDIV unit. The next trace line skips three cycles, as Snitch is stalled waiting for the result of the `mul`. Indeed, if you peek at line 1129, you can see the next instruction presents a RAW dependency on the previous instruction through register `a2`. The trace line at 1128 is also unique in that it does not contain any instruction. In fact, no instruction can be issued yet, but the comment informs us that the accelerator interface (`acc`) is writing back a value of `0` to register `a2`. Therefore, in the next cycle the `add` instruction can read the content of register `a2`, which reflects the value written in the previous cycle.
+
+```
+ 1159000     1130        M 0x80000250 scfgwi  a0, 768                #; a0  = 0x10000000
+ 1160000     1131        M 0x80000254 add     a0, a5, a2             #; a5  = 0x10000c00, a2  = 0, (wrb) a0  <-- 0x10000c00
+ 1161000     1132        M 0x80000258 scfgwi  a0, 769                #; a0  = 0x10000c00
+ 1162000     1133        M 0x8000025c auipc   a0, 0x5                #; (wrb) a0  <-- 0x8000525c
+ 1163000     1134        M 0x80000260 addi    a0, a0, -1396          #; a0  = 0x8000525c, (wrb) a0  <-- 0x80004ce8
+ 1165000     1136        M 0x80000268 add     a0, a6, a2             #; a6  = 0x10001800, a2  = 0, (wrb) a0  <-- 0x10001800
+ 1166000     1137        M 0x8000026c scfgwi  a0, 898                #; a0  = 0x10001800
+                         M 0x80000264 fld     ft3, 0(a0)             #; ft3  <~~ Doub[0x80004ce8]
+ 1168000     1139        M 0x80000270 csrsi   ssr, 1                 #; 
+ 1169000     1140        M 0x80000274 frep    48, 1                  #; outer, 48 issues
+ 1176000     1147        M                                           #; (f:lsu) ft3  <-- -32.7504962
+```
+
+All proceeds regularly up to cycle 1136. Here we observe a jump in the program counter, which is not justified by a branch or other control-flow instruction. At the same time, we can observe a lost cycle between this and the previous instruction. What happened? As mentioned earlier, instructions are fetched and issued in order. We would therefore expect instruction `0x80000264` to be issued at cycle 1135. If we peek a little bit further in the trace, we find the "missing" instruction reported in cycle 1137. This delay is explained by the fact that the RISC-V "D" Standard Extension for Double-Precision Floating-Point is also implemented by an accelerator: the FPU subsystem (FPSS) which, differently from the integer MULDIV unit, is private to each core and located in the core complex. Differently from the offloading instance at cycle 1125, here the two successive instructions are allowed to execute before the offloaded instruction completes, since they do not have any dependency on the previous instruction. When, in cycle 1137, the `fld` instruction is issued in the FPSS, the `scfgwi` instruction is simultaneously issued in the base Snitch core. This should not be seen as to contradict our assumption that Snitch is a single-issue core, as the `fld` instruction was indeed issued by Snitch in cycle 1135, whereas in cycle 1137 it issues the `scfgwi` instruction alone. Hence, it can only occur that two instructions execute simultaneously in the trace if a cycle was lost in advance.
+
+Note that the comment next to the `fld` instruction uses a similar syntax to the one we previously encountered for writebacks, with the difference that 1) it presents a "wavy" arrow, 2) the agent of the writeback is not specified and 3) on the right-hand side of the arrow we don't have a numerical value but a memory address. This notation informs us that a load from that memory address has been initiated, but the corresponding writeback only occurs on line 1147. The agent of the writeback in this case is the floating-point LSU (`f:lsu`). The value of the source register `a0` is omitted in the comment on line 1138, as it can be easily recovered from the memory address and the offset in the instruction arguments. If you were particularly careful, the last value which was written back to `a0` was `0x10001800` on line 1137, so why does it now evaluate to `0x80004ce8`? The reason is, once again, that the `fld` was actually issued in cycle 1136, and thus observes the value which was then last written back (`0x80004ce8` on line 1135), in accordance with the original program order.
+
+One last note should be made about `frep` loops. While not visible from this trace, under `frep` operation Snitch becomes a pseudo-dual issue processor, i.e. instructions can be issued by the Snitch core, while the FREP sequencer repeatedly issues floating-point instructions. It is possible in this situation to consistently observe the core execute two instructions per cycle in the trace. The sequencer however does not have its own instruction fetch unit, but only a buffer where it caches loop body instructions, to fetch and issue these instructions from. The Snitch core must issue these instructions for the first loop iteration, when they will also be cached in the FREP sequencer, and only from the second iteration can the FREP sequencer operate independently. Thus, to observe the pseudo-dual issue behaviour, you must make sure in your code that the `frep` loop appears before any other operation you want to overlap with it in program order.
+
+## Performance metrics
+
+Finally, at the end of the trace, a collection of performance metrics automatically computed from the trace is reported. The performance metrics are associated to regions defined in your code. More information on how to define these regions can be found in the Snitch [tutorial](../../target/snitch_cluster/README.md).
+
+```
+## Performance metrics
+
+Performance metrics for section 1 @ (1118, 1203):
+tstart                                   1146.0000
+fpss_loads                                       1
+tend                                     1233.0000
+snitch_loads                                     0
+snitch_avg_load_latency                          0
+snitch_occupancy                            0.2069
+snitch_fseq_rel_offloads                      0.25
+fseq_yield                                  8.6667
+fseq_fpu_yield                              8.6667
+fpss_section_latency                             1
+fpss_avg_fpu_latency                        3.0204
+fpss_avg_load_latency                         10.0
+fpss_occupancy                              0.5977
+fpss_fpu_occupancy                          0.5632
+fpss_fpu_rel_occupancy                      0.9423
+cycles                                          87
+total_ipc                                   0.8046
+```
+
+The trace will contain the most relevant performance metrics for manual inspection. These and additional performance metrics can also be dumped to a JSON file for further processing (see [gen_trace.py](../../util/trace/gen_trace.py)).
+In the following table you can find a complete list of all the performance metrics extracted from the trace along with their description:
+
+|Metric                    |Description                                                                                                                                                                                         |
+|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|`tstart`                  |The global simulation time when the `mcycle` instruction opening the current measurement region is issued.                                                                                          |
+|`tend`                    |The global simulation time when the `mcycle` instruction closing the current measurement region is issued.                                                                                          |
+|`start`                   |The core complex' cycle count when the `mcycle` instruction opening the current measurement region is issued.                                                                                       |
+|`end`                     |The core complex' cycle count when the `mcycle` instruction closing the current measurement region is issued.                                                                                       |
+|`end_fpss`                |The core complex' cycle count when the last FP operation issued in the current measurement region retires.                                                                                          |
+|`cycles`                  |Overall cycles spent in the current measurement region, calculated as `max(end, end_fpss)` - `start` + 1                                                                                            |
+|`snitch_load_latency`     |Cumulative latency of all loads issued by Snitch's own LSU. The latency of a load is measured from the cycle the load is issued to the cycle it is retired, i.e. it writes back to the RF.          |
+|`snitch_avg_load_latency` |Average latency of a load issued by Snitch's own LSU (see `snitch_load_latency`).                                                                                                                   |
+|`fpss_load_latency`       |Similar to `snitch_load_latency` but for loads issued by the FP subsystem's LSU.                                                                                                                    |
+|`fpss_avg_load_latency`   |Similar to `snitch_avg_load_latency` but for loads issued by the FP subsystem's LSU.                                                                                                                |
+|`fpss_fpu_latency`        |Cumulative latency of all FPU instructions. The latency of an FPU instruction is measured from the cycle the instruction is issued to the cycle it is retired, i.e. it writes back to the RF.       |
+|`fpss_avg_fpu_latency`    |Average latency of an FPU instruction (see `fpss_fpu_latency`).                                                                                                                                     |
+|`fpss_section_latency`    |`max(end_fpss - end, 0)`                                                                                                                                                                            |
+|`snitch_issues`           |Total number of instructions issued by Snitch, excluding those offloaded to the FPSS (see `snitch_fseq_offloads`).                                                                                  |
+|`snitch_fseq_offloads`    |Total number of instructions offloaded by Snitch to the FPSS.                                                                                                                                       |
+|`snitch_fseq_rel_offloads`|The ratio between `snitch_fseq_offloads` and the total number of instructions issued by Snitch core proper, i.e. `snitch_issues + snitch_fseq_offloads`.                                            |
+|`fpss_issues`             |Total number of instructions issued by the FPSS. It counts repeated issues from the FREP sequencer.                                                                                                 |
+|`fpss_fpu_issues`         |Similar to `fpss_issues`, but counts only instructions destined to the FPU proper. It does not for instance include instructions issued to the FPSS's LSU.                                          |
+|`fseq_yield`              |The ratio between `fpss_issues` and `snitch_fseq_offloads`. The difference lies in the FREP sequencer possibly replicating instructions. If the sequencer is not used this ratio should amount to 1.|
+|`fseq_fpu_yield`          |Currently identical to `fseq_yield`, probably a bug. Most likely originally intended to be the ratio between `fpss_fpu_issues` and `snitch_fseq_offloads`.                                          |
+|`fpss_occupancy`          |IPC of the FPSS, calculated as `fpss_issues / cycles`.                                                                                                                                              |
+|`fpss_fpu_occupancy`      |IPC of the FPU, calculated as `fpss_fpu_issues / cycles`.                                                                                                                                           |
+|`fpss_fpu_rel_occupancy`  |The ratio between `fpss_fpu_occupancy` and `fpss_occupancy`, equal to `fpss_fpu_issues / fpss_issues`.                                                                                              |
+|`total_ipc`               |The IPC of the core complex, calculated as `snitch_occupancy + fpss_occupancy`.                                                                                                                     |
diff --git a/mkdocs.yml b/mkdocs.yml
index 6ad869786..a87fa2dc5 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -37,10 +37,12 @@ nav:
   - User Guide:
       - Getting Started: ug/getting_started.md
       - Repository Structure: ug/directory_structure.md
-      - Tutorial: ug/tutorial.md
+      - Tutorial:
+          - Base: ug/tutorial.md
+          - Advanced:
+              - Trace Analysis: ug/trace_analysis.md
       - Documentation: ug/documentation.md
       # Remove
-      # - Benchmarking: ug/benchmarking.md
       # - Trace Example: ug/example_trace.html
   - Reference Manual:
       - Hardware:
diff --git a/sw/dnn/requirements.txt b/sw/dnn/requirements.txt
index f73316ce9..c42e8ddf8 100644
--- a/sw/dnn/requirements.txt
+++ b/sw/dnn/requirements.txt
@@ -3,5 +3,5 @@
 # SPDX-License-Identifier: Apache-2.0
 
 numpy
-torch==2.1
+torch
 json5
diff --git a/target/snitch_cluster/README.md b/target/snitch_cluster/README.md
index 69777ada7..ce67ea9ec 100644
--- a/target/snitch_cluster/README.md
+++ b/target/snitch_cluster/README.md
@@ -17,19 +17,18 @@ In the following tutorial you can assume the working directory to be `target/sni
 To compile the hardware for simulation run one of the following commands, depending on the desired simulator:
 
 ```shell
-# Verilator (for Docker users)
+# Verilator
 make bin/snitch_cluster.vlt
-# Verilator (for IIS users)
-verilator-4.110 make bin/snitch_cluster.vlt
 
-# Questa (for IIS users)
-questa-2022.3 make bin/snitch_cluster.vsim
+# Questa
+make DEBUG=ON bin/snitch_cluster.vsim
 
-# VCS (for IIS users)
-vcs-2020.12 make bin/snitch_cluster.vcs
+# VCS
+make bin/snitch_cluster.vcs
 ```
 
 These commands compile the RTL sources respectively in `work-vlt`, `work-vsim` and `work-vcs`. Additionally, common C++ testbench sources (e.g. the [frontend server (fesvr)](https://github.com/riscv-software-src/riscv-isa-sim)) are compiled under `work`. Each command will also generate a script or an executable (e.g. `bin/snitch_cluster.vsim`) which you can invoke to simulate the hardware. We will see how to do this in a later section.
+The variable `DEBUG=ON` is used to preserve the visibility of all the internal signals during simulation.
 
 ### Building the Banshee simulator
 Instead of running an RTL simulation, you can use our instruction-accurate simulator called `banshee`. To install the simulator, please follow the instructions of the Banshee repository: [https://github.com/pulp-platform/banshee](https://github.com/pulp-platform/banshee).
@@ -77,36 +76,29 @@ ___Note:__ on GVSOC, it is better to use OpenOCD semi-hosting to prevent putchar
 
 ### Running a simulation
 
-Create the `logs` directory to host the simulation traces:
+Run one of the executables which was compiled in the previous step on your Snitch cluster simulator of choice:
 
 ```shell
-# If it's the first time you run this the logs/ folder won't exist and you will have to create it
-mkdir logs
-```
-
-Run one of the executables which was compiled in the previous step on your Snitch cluster hardware with your preferred simulator:
-
-```shell
-# Verilator (for Docker users)
+# Verilator
 bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
-# Verilator (for IIS users)
-verilator-4.110 bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
 
-# Questa (for IIS users)
-questa-2022.3 bin/snitch_cluster.vsim sw/apps/blas/axpy/build/axpy.elf
+# Questa
+bin/snitch_cluster.vsim sw/apps/blas/axpy/build/axpy.elf
 
-# VCS (for IIS users)
-vcs-2020.12 bin/snitch_cluster.vcs sw/apps/blas/axpy/build/axpy.elf
+# VCS
+bin/snitch_cluster.vcs sw/apps/blas/axpy/build/axpy.elf
 
 # Banshee
 banshee --no-opt-llvm --no-opt-jit --configuration src/banshee.yaml --trace sw/apps/blas/axpy/build/axpy.elf
 ```
 
-The previous commands will run the simulation in your current terminal. You can also run the simulation in the QuestaSim GUI by adapting the previous command to:
+The Snitch cluster simulator binaries can be invoked from any directory, just adapt the relative paths in the preceding commands accordingly, or use absolute paths. We refer to the working directory where the simulation is launched as the simulation directory. Within it, you will find several log files produced by the RTL simulation.
+
+The previous commands will launch the simulation on the console. QuestaSim simulations can also be launched with the QuestaSim GUI, by adapting the previous command to:
 
 ```shell
-# Questa (for IIS users)
-questa-2022.3 bin/snitch_cluster.vsim.gui sw/apps/blas/axpy/build/axpy.elf
+# Questa
+bin/snitch_cluster.vsim.gui sw/apps/blas/axpy/build/axpy.elf
 ```
 
 For Banshee, you need to give a specific cluster configuration to the simulator with the flag `--configuration <cluster_config.yaml>`. A default Snitch cluster configuration is given (`src/banshee.yaml`). The flag `--trace` enables the printing of the traces similar to the RTL simulation.
@@ -146,14 +138,14 @@ void axpy(uint32_t l, double a, double *x, double *y, double *z) {
 
 int main() {
     // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
-    uint32_t start_cycle = mcycle();
+    uint32_t start_cycle = snrt_mcycle();
 
     // DM core does not participate in the computation
     if(snrt_is_compute_core())
         axpy(L, a, x, y, z);
 
     // Read the mcycle CSR
-    uint32_t end_cycle = mcycle();
+    uint32_t end_cycle = snrt_mcycle();
 }
 
 ```
@@ -162,7 +154,7 @@ The `snrt.h` file implements the snRuntime API, a library of convenience functio
 
 ___Note:__ Have a look at the files inside `sw/snRuntime` in the root of this repository to see what kind of functionality the snRuntime API defines. Note this is only an API, with some base implementations. The Snitch cluster implementation of the snRuntime for RTL simulation can be found under `target/snitch_cluster/sw/runtime/rtl`. It is automatically built and linked with user applications thanks to our compilation scripts._
 
-We will have to instead create the `data.h` file ourselves. Create a `target/snitch_cluster/sw/apps/axpy/data/data` folder to host the data for your kernel to operate on:
+We will have to instead create the `data.h` file ourselves. Create a `target/snitch_cluster/sw/apps/axpy/data` folder to host the data for your kernel to operate on:
 
 ```bash
 mkdir sw/apps/axpy/data
@@ -183,7 +175,7 @@ double z[16];
 
 ```
 
-In this file we hardcode the data to be used by the kernel. This data will be loaded in memory together with your application code. In general, to verify your code you may want to randomly generate the above data. You may also want to test your kernel on different problem sizes, e.g. varying the length of the vectors, without having to manually rewrite the file. This can be achieved by generating the data header file with a Python script. You may have a look at the `sw/blas/axpy/datagen` folder in the root of this repository as an example. You may reuse several of the functions defined in `sw/blas/axpy/datagen/datagen.py`. Eventually, we will promote these functions to a dedicated Python module which can be easily reused.
+In this file we hardcode the data to be used by the kernel. This data will be loaded in memory together with your application code. In general, to verify your code you may want to randomly generate the above data. You may also want to test your kernel on different problem sizes, e.g. varying the length of the vectors, without having to manually rewrite the file. This can be achieved by generating the data header file with a Python script. You may have a look at the `sw/blas/axpy/scripts/datagen.py` script in the root of this repository as an example. As you can see, it reuses many convenience classes and functions for data generation from the `data_utils` module. Documentation for this module can be found [here](https://pulp-platform.github.io/snitch_cluster/rm/sim/data_utils.html).
 
 #### Compiling the C Code
 
@@ -199,10 +191,10 @@ include ../common.mk
 
 This Makefile will be invoked recursively by the top-level Makefile, compiling your source code into an executable with the name provided in the `APP` variable.
 
-In order for the top-level Makefile to find your application, add your application's directory to the `SUBDIRS` variable in `sw/apps/Makefile`:
+In order for the top-level Makefile to find your application, add your application's directory to the `APPS` variable in `sw.mk`:
 
 ```
-SUBDIRS += axpy
+APPS += sw/apps/axpy
 ```
 
 Now you can recompile all software, including your newly added AXPY application:
@@ -222,12 +214,12 @@ If you want to dig deeper into how our build system works and how these files we
 You can run your application in simulation as shown in the previous sections. Make sure to pick up the right binary, e.g.:
 
 ```shell
-questa-2022.3 bin/snitch_cluster.vsim sw/apps/axpy/build/axpy.elf
+bin/snitch_cluster.vsim sw/apps/axpy/build/axpy.elf
 ```
 
 ### Debugging and benchmarking
 
-When you run the simulation, every core will log all the instructions it executes (along with additional information, such as the value of the registers before/after the instruction) in a trace file, located in the `target/snitch_cluster/logs` directory. The traces are identified by their hart ID, that is a unique ID for every hardware thread (hart) in a RISC-V system (and since all our cores have a single thread that is a unique ID per core)
+When you run the simulation, every core will log all the instructions it executes (along with additional information, such as the value of the registers before/after the instruction) in a trace file. The traces are located in the `logs` folder within the simulation directory. The traces are identified by their hart ID, that is a unique ID for every hardware thread (hart) in a RISC-V system (and since all our cores have a single thread that is a unique ID per core).
 
 The simulation logs the traces in a non-human readable format with `.dasm` extension. To convert these to a human-readable form run:
 
@@ -235,6 +227,14 @@ The simulation logs the traces in a non-human readable format with `.dasm` exten
 make -j traces
 ```
 
+If the simulation directory does not coincide with the current working directory, you will have to specify the path explicitly:
+
+```bash
+make -j traces SIM_DIR=<path_to_simulation_directory>
+```
+
+Detailed information on how to interpret the generated traces can be found [here](../../docs/ug/trace_analysis.md).
+
 In addition to generating readable traces (`.txt` format), the above command also computes several performance metrics from the trace and appends them at the end of the trace. These can be collected into a single CSV file with the following target:
 
 ```bash
@@ -243,9 +243,9 @@ make logs/perf.csv
 libreoffice logs/perf.csv
 ```
 
-In this file you can find the `X_tstart` and `X_tend` metrics. These are the cycles in which a particular code region `X` starts and ends, and can hence be used to profile your code. Code regions are defined by calls to `mcycle()`. Every call to this function defines two code regions:
-- the code preceding the call, up to the previous `mcycle()` call or the start of the source file
-- the code following the call, up to the next `mcycle()` call or the end of the source file
+In this file you can find the `X_tstart` and `X_tend` metrics. These are the cycles in which a particular code region `X` starts and ends, and can hence be used to profile your code. Code regions are defined by calls to `snrt_mcycle()`. Every call to this function defines two code regions:
+- the code preceding the call, up to the previous `snrt_mcycle()` call or the start of the source file
+- the code following the call, up to the next `snrt_mcycle()` call or the end of the source file
 
 The CSV file can be useful to automate collection and post-processing of benchmarking data.
 
@@ -324,14 +324,14 @@ void axpy(uint32_t l, double a, double *x, double *y, double *z) {
 
 int main() {
     // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
-    uint32_t start_cycle = mcycle();
+    uint32_t start_cycle = snrt_mcycle();
 
     // DM core does not participate in the computation
     if(snrt_is_compute_core())
         axpy(L / snrt_cluster_compute_core_num(), a, x, y, z);
 
     // Read the mcycle CSR
-    uint32_t end_cycle = mcycle();
+    uint32_t end_cycle = snrt_mcycle();
 }
 ```