From 6c3e34041b51aff0a12e669f5d7a3b0c87cf0a3f Mon Sep 17 00:00:00 2001
From: Luca Colagrande <luca.colagrande3@gmail.com>
Date: Tue, 1 Oct 2024 16:11:05 +0200
Subject: [PATCH] treewide: Update docs and tutorial

---
 docs/ug/documentation.md        |  23 +-
 docs/ug/trace_analysis.md       |   4 +-
 docs/ug/tutorial.md             | 413 ++++++++++++++++++++++++++++++--
 mkdocs.yml                      |   7 +-
 target/snitch_cluster/README.md | 341 +-------------------------
 util/container/README.md        |  11 +-
 6 files changed, 423 insertions(+), 376 deletions(-)

diff --git a/docs/ug/documentation.md b/docs/ug/documentation.md
index 24467bc24..ad464b590 100644
--- a/docs/ug/documentation.md
+++ b/docs/ug/documentation.md
@@ -1,22 +1,23 @@
 # Documentation
 
-Documentation of the generator and related infrastructure is hosted under
-`docs`. Static `html` documentation is build from the latest `main` branch by
-the CI. We use [mkdocs](https://www.mkdocs.org/) together with the [material
-theme](https://squidfunk.github.io/mkdocs-material/). Before building the
-documentation, make sure you have the required dependencies installed:
+Documentation pages for the Snitch cluster are hosted under `docs`. Static
+`html` documentation is built and deployed from the latest `main` branch by the
+CI. We use [mkdocs](https://www.mkdocs.org/) together with the [material
+theme](https://squidfunk.github.io/mkdocs-material/).
 
-```shell
-pip install .
-```
-
-After everything is installed, you can build a static copy of the `html` documentation by
-executing (in the root directory):
+You can build a static copy of the `html` documentation by
+executing (in the root of this repository):
 
 ```shell
 make docs
 ```
 
+Documentation for the Python sources in this repository is generated from the
+docstrings contained within the sources themselves, using
+[mkdocstrings](https://mkdocstrings.github.io/).
+Documentation for the C sources in this repository is generated from the
+Doxygen-style comments within the sources themselves, using Doxygen.
+
 ## Organization
 
 The `docs` folder is organized as follows:
diff --git a/docs/ug/trace_analysis.md b/docs/ug/trace_analysis.md
index 0fefb2dfd..ca833a872 100644
--- a/docs/ug/trace_analysis.md
+++ b/docs/ug/trace_analysis.md
@@ -79,7 +79,7 @@ One last note should be made about `frep` loops. While not visible from this tra
 
 ## Performance metrics
 
-Finally, at the end of the trace, a collection of performance metrics automatically computed from the trace is reported. The performance metrics are associated to regions defined in your code. More information on how to define these regions can be found in the Snitch [tutorial](../../target/snitch_cluster/README.md).
+Finally, at the end of the trace, a collection of performance metrics automatically computed from the trace is reported. The performance metrics are associated to regions defined in your code. More information on how to define these regions can be found in the Snitch [tutorial](tutorial.md).
 
 ```
 ## Performance metrics
@@ -104,7 +104,7 @@ cycles                                          87
 total_ipc                                   0.8046
 ```
 
-The trace will contain the most relevant performance metrics for manual inspection. These and additional performance metrics can also be dumped to a JSON file for further processing (see [gen_trace.py](../../util/trace/gen_trace.py)).
+The trace will contain the most relevant performance metrics for manual inspection. These and additional performance metrics can also be dumped to a JSON file for further processing (see [gen_trace.py](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/util/trace/gen_trace.py)).
 In the following table you can find a complete list of all the performance metrics extracted from the trace along with their description:
 
 |Metric                    |Description                                                                                                                                                                                         |
diff --git a/docs/ug/tutorial.md b/docs/ug/tutorial.md
index 363fa82e3..3437a2e89 100644
--- a/docs/ug/tutorial.md
+++ b/docs/ug/tutorial.md
@@ -2,27 +2,408 @@
 
 The following tutorial will guide you through the use of the Snitch cluster. You will learn how to develop, simulate, debug and benchmark software for the Snitch cluster architecture.
 
-<!---
-The following documentation is directly included from `../../target/snitch_cluster/README.md`
--->
+You can assume the working directory to be `target/snitch_cluster`. All paths are to be assumed relative to this directory. Paths relative to the root of the repository are prefixed with a slash.
+
+### Setup
+
+If you don't have access to an IIS machine, and you set up the Snitch Docker container as described in the [getting started guide](getting_started.md), all of the commands presented in this tutorial will have to be executed in the Docker container.
+
 {%
-   include-markdown '../../target/snitch_cluster/README.md'
+   include-markdown '../../util/container/README.md'
+   start="## Usage"
+   end="## Limitations"
    comments=false
-   start="## Tutorial"
+   heading-offset=1
 %}
 
-## Using Verilator with LLVM
+Where you should replace `<path_to_repository_root>` with the path to the root directory of the Snitch cluster repository cloned on your machine.
+
+!!! warning
+    As QuestaSim and VCS are proprietary tools and require a license, only Verilator is provided within the container for RTL simulations.
+
+### Building the hardware
+
+To run software on Snitch without a physical chip, you will need a simulation model of the Snitch cluster. You can build a cycle-accurate simulation model from the RTL sources directly using QuestaSim, VCS or Verilator, with either of the following commands:
+
+=== "Verilator"
+    ```shell
+    make bin/snitch_cluster.vlt
+    ```
+=== "Questa"
+    ```shell
+    make DEBUG=ON bin/snitch_cluster.vsim
+    ```
+=== "VCS"
+    ```shell
+    make bin/snitch_cluster.vcs
+    ```
+
+These commands compile the RTL sources respectively in `work-vlt`, `work-vsim` and `work-vcs`. Additionally, common C++ testbench sources (e.g. the [frontend server (fesvr)](https://github.com/riscv-software-src/riscv-isa-sim)) are compiled under `work`. Each command will also generate a script or an executable (e.g. `bin/snitch_cluster.vsim`) which we can use to simulate software on Snitch, as we will see in section [Running a simulation](#running-a-simulation).
+
+!!! info
+    The variable `DEBUG=ON` is required when using QuestaSim to preserve the visibility of all internal signals. If you need to inspect the simulation waveforms, you should set this variable when building the simulation model. For faster simulations you can omit the variable assignment, as QuestaSim may be able to optimize internal signals away.
+
+
+### Building the Banshee simulator
+
+Instead of building a simulation model from the RTL sources, you can use our instruction-accurate (note: not cycle-accurate) simulator called `banshee`. To install the simulator, please follow the instructions provided in the [Banshee repository](https://github.com/pulp-platform/banshee).
+
+### Configuring the hardware
+
+The Snitch cluster RTL sources are partly automatically generated from a configuration file provided in `.hjson` format. Several RTL files are templated and use the `.hjson` configuration file to fill in the template. An example is [snitch_cluster_wrapper.sv.tpl](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/hw/snitch_cluster/src/snitch_cluster_wrapper.sv.tpl).
+
+In the [`cfg`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/cfg) folder, different configurations are provided. The [`cfg/default.hjson`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/cfg/default.hjson) configuration instantiates 8 compute cores + 1 DMA core in the cluster.
+
+The command you previously executed automatically generated the RTL sources from the templates, and it implicitly used the default configuration file. In this configuration the FPU is not equipped with a floating-point divide and square-root unit.
+To override the default configuration file, e.g. to use the configuration with FDIV/FSQRT unit, define the following variable when you invoke `make`:
+```shell
+make CFG_OVERRIDE=cfg/fdiv.hjson bin/snitch_cluster.vlt
+```
+
+If you want to use a custom configuration, just point `CFG_OVERRIDE` to the path of your configuration file.
+
+!!! tip
+    When you override the configuration file on the `make` command-line, the configuration is stored in the `cfg/lru.hjson` file. Successive invocations of `make` will automatically pick up the `cfg/lru.hjson` file. You can therefore omit the `CFG_OVERRIDE` definition in successive commands unless you want to override the least-recently used configuration.
+
+### Building the software
+
+To build all of the software for the Snitch cluster, run the following command. Different simulators may require different runtime implementations, so different options have to be specified to select the appropriate implementation, e.g. for Banshee simulations or OpenOCD semi-hosting:
+
+=== "RTL"
 
-LLVM+clang can be used to build the Verilator model. Optionally specify a path
-to the LLVM toolchain in `CLANG_PATH` and set `VLT_USE_LLVM=ON`.
-For the verilated model itself to be complied with LLVM, verilator must be built
-with LLVM (`CC=clang CXX=clang++ ./configure`). The `VLT` environment variable
-can then be used to point to the verilator binary.
+    ```bash
+    make DEBUG=ON sw -j
+    ```
+
+=== "Banshee"
+
+    ```bash
+    make DEBUG=ON SELECT_RUNTIME=banshee sw -j
+    ```
+
+=== "OpenOCD"
+
+    ```bash
+    make DEBUG=ON OPENOCD_SEMIHOSTING=ON sw -j
+    ```
+
+This builds all software targets defined in the repository, e.g. the Snitch runtime library and all applications. Artifacts are stored in the build directory of each target. For example, have a look inside `sw/apps/blas/axpy/build/` and you will find the artifacts of the AXPY application build, e.g. the compiled executable `axpy.elf` and a disassembly `axpy.dump`.
+
+If you only want to build a specific software target, you can by replacing `sw` with the name of that target, e.g. the name of an application:
 
 ```bash
-# Optional: Specify which llvm to use
-export CLANG_PATH=/path/to/llvm-12.0.1
-# Optional: Point to a verilator binary compiled with LLVM
-export VLT=/path/to/verilator-llvm/bin/verilator
-make VLT_USE_LLVM=ON bin/snitch_cluster.vlt
+make DEBUG=ON axpy -j
 ```
+
+For this to be possible, we require all software targets to have unique and distinct names from any other Make target.
+
+!!! warning
+    The RTL is not the only source which is generated from the configuration file. The software stack also depends on the configuration file. Make sure you always build the software with the same configuration of the hardware you are going to run it on.
+
+!!! info
+    The `DEBUG=ON` flag is used to tell the compiler to produce debugging symbols and disassemble the generated ELF binaries for inspection (`.dump` files in the build directories). Debugging symbols are required by the `annotate` target, showcased in the [Debugging and benchmarking](#debugging-and-benchmarking) section of this guide.
+
+!!! tip
+    On GVSOC, it is better to use OpenOCD semi-hosting to prevent putchar from disturbing the DRAMSys timing model.
+
+### Running a simulation
+
+Run one of the executables which was compiled in the previous step on your Snitch cluster simulator of choice:
+
+=== "Verilator"
+
+    ```shell
+    bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
+    ```
+
+=== "Questa"
+
+    ```shell
+    bin/snitch_cluster.vsim sw/apps/blas/axpy/build/axpy.elf
+    ```
+
+=== "VCS"
+
+    ```shell
+    bin/snitch_cluster.vcs sw/apps/blas/axpy/build/axpy.elf
+    ```
+
+=== "Banshee"
+
+    ```shell
+    banshee --no-opt-llvm --no-opt-jit --configuration src/banshee.yaml --trace sw/apps/blas/axpy/build/axpy.elf
+    ```
+
+The simulator binaries can be invoked from any directory, just adapt the relative paths in the preceding commands accordingly, or use absolute paths. We refer to the working directory where the simulation is launched as the _simulation directory_. Within it, you will find several log files produced by the RTL simulation.
+
+!!! tip
+    If you don't want your log files to be overriden when you run another simulation, just create separate simulation directories for every simulation whose artifacts you want to preserve, and run the simulations therein.
+
+The previous commands will launch the simulation on the console. QuestaSim simulations can also be launched with the GUI, e.g. for waveform inspection. Just adapt the previous command to:
+
+```shell
+bin/snitch_cluster.vsim.gui sw/apps/blas/axpy/build/axpy.elf
+```
+
+### Developing your first Snitch application
+
+In the following you will create your own AXPY kernel implementation as an example how to develop software for Snitch.
+
+#### Writing the C Code
+
+Create a directory for your AXPY kernel:
+
+```bash
+mkdir sw/apps/tutorial
+```
+
+And a `src` subdirectory to host your source code:
+
+```bash
+mkdir sw/apps/tutorial/src
+```
+
+Here, create a new file named `tutorial.c` with the following contents:
+
+```C
+#include "snrt.h"
+#include "data.h"
+
+// Define your kernel
+void axpy(uint32_t l, double a, double *x, double *y, double *z) {
+    for (uint32_t i = 0; i < l ; i++) {
+        z[i] = a * x[i] + y[i];
+    }
+    snrt_fpu_fence();
+}
+
+int main() {
+    // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
+    uint32_t start_cycle = snrt_mcycle();
+
+    // DM core does not participate in the computation
+    if(snrt_is_compute_core())
+        axpy(L, a, x, y, z);
+
+    // Read the mcycle CSR
+    uint32_t end_cycle = snrt_mcycle();
+}
+
+```
+
+The [`snrt.h`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/sw/runtime/rtl/src/snrt.h) file implements the snRuntime API, a library of convenience functions to program Snitch-cluster-based systems, and it is automatically referenced by our compilation scripts. Documentation for the snRuntime can be found at the [Doxygen-generated pages](../doxygen/html/index.html).
+
+!!! note
+    The [snRuntime sources](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/sw/snRuntime) only define the snRuntime API, and provide a base implementation of a subset of functions. A complete implementation of the snRuntime for RTL simulation can be found under [`target/snitch_cluster/sw/runtime/rtl`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/sw/runtime/rtl).
+
+We will have to instead create the `data.h` file ourselves. Create a folder to host the data for your kernel to operate on:
+
+```bash
+mkdir sw/apps/tutorial/data
+```
+
+Here, create a C file named `data.h` with the following contents:
+
+```C
+uint32_t L = 16;
+
+double a = 2;
+
+double x[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
+
+double y[16] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1,  1,  1,  1,  1,  1};
+
+double z[16];
+
+```
+
+In this file we hardcode the data to be used by the kernel. This data will be loaded in memory together with your application code.
+
+#### Compiling the C Code
+
+In your `tutorial` folder, create a new file named `app.mk` with the following contents:
+
+```make
+APP     = tutorial
+SRCS    = src/tutorial.c
+INCDIRS = data
+
+include $(ROOT)/target/snitch_cluster/sw/apps/common.mk
+```
+
+This file will be included in the top-level Makefile, compiling your source code into an executable with the name provided in the `APP` variable.
+
+In order for the top-level Makefile to find your application, add your application's directory to the `APPS` variable in [`sw.mk`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/sw.mk):
+
+```
+APPS += sw/apps/tutorial
+```
+
+Now you can recompile the software, including your newly added tutorial application, as shown in section [Building the software](#building-the-software).
+
+!!! note
+    Only the software targets depending on the sources you have added/modified have been recompiled.
+
+!!! info
+    If you want to dig deeper into how our build system works and how these files were generated you can start from the [top-level Makefile](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/Makefile) and work your way through the other Makefiles included within it.
+
+#### Running your application
+
+You can then run your application as shown in section [Running a simulation](#running-a-simulation). Make sure to pick up the right binary, i.e. `sw/apps/tutorial/build/tutorial.elf`.
+
+#### Generating input data
+
+In the `sw/apps/tutorial/build` directory, you will now find your `tutorial.elf` executable and some other files which were automatically generated to aid debugging. Open the disassembly `tutorial.dump` and search for `<x>`, `<y>` and `<z>`. You will see the addresses where the respective vectors defined in `data.h` have been allocated by the compiler. This file can also be very useful to see what assembly instructions your source code was compiled to, and correlate the traces (we will later see) with the source code.
+
+
+In general, you may want to randomly generate the data for your application. You may also want to test your kernel on different problem sizes, e.g. varying the length of the AXPY vectors, without having to manually rewrite the file.
+
+The approach we use is to generate the header file with a Python script. An input `.json` file can be used to configure the data generation, e.g. to set the length of the AXPY vectors. Have a look at the [`datagen.py`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/sw/blas/axpy/scripts/datagen.py) and [`params.json`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/sw/blas/axpy/data/params.json) files in our full-fledged [AXPY application](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/sw/blas/axpy/) as an example. As you can see, the data generation script reuses many convenience classes and functions from the [`data_utils`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/util/sim/data_utils.py) module. We advise you to do the same. Documentation for this module can be found at the [auto-generated pages](../rm/sim/data_utils.md).
+
+#### Verifying your application
+
+When developing an application, it is good practice to verify the results of your application against a golden model. The traditional approach is to generate expected results in your data generation script, dump these into the header file and extend your application to check its results against the expected results, _in simulation_! Every cycle spent on verification is simulated, and this may take a significant time for large designs. We refer to this approach as the _Built-in self-test (BIST)_ approach.
+
+A better alternative is to read out the results from your application at the end of the simulation, and compare them outside of the simulation. You may have a look at our AXPY's [`verify.py`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/sw/blas/axpy/scripts/verify.py) script as an example. This script can be used to verify the AXPY application by prepending it to the usual simulation command, as:
+
+```shell
+../../sw/blas/axpy/scripts/verify.py bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
+```
+
+You can test if the verification passed by checking that the exit code of the previous command is 0 (e.g. in a bash terminal):
+```bash
+echo $?
+```
+
+Again, most of the logic in the script is implemented in convenience classes and functions provided by the [`verif_utils`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/util/sim/verif_utils.py) module. Documentation for this module can be found at the [auto-generated pages](../rm/sim/verif_utils.md).
+
+!!! info
+    The `verif_utils` functions build upon a complex verification infrastructure, which uses inter-process communication (IPC) between the Python process and the simulation process to get the results of your application at the end of the simulation. If you want to dig deeper into how this framework is implemented, have a look at the [`SnitchSim.py`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/util/sim/SnitchSim.py) module and the IPC files within the [`test`](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/common/test) folder.
+
+### Debugging and benchmarking
+
+When you run the simulation, every core will log all the instructions it executes (along with additional information, such as the value of the registers before/after the instruction) in a trace file. The traces are located in the `logs` folder within the simulation directory. The traces are identified by their hart ID, that is a unique ID for every hardware thread (hart) in a RISC-V system (and since all our cores have a single thread that is a unique ID per core).
+
+The simulation logs the traces in a non-human readable format with `.dasm` extension. To convert these to a human-readable form run:
+
+```bash
+make -j traces
+```
+
+If the simulation directory does not coincide with the current working directory, you will have to specify the path explicitly:
+
+```bash
+make -j traces SIM_DIR=<path_to_simulation_directory>
+```
+
+Detailed information on how to interpret the generated traces can be found [here](trace_analysis.md).
+
+In addition to generating readable traces (`.txt` format), the above command also computes several performance metrics from the trace and appends them at the end of the trace. These can be collected into a single CSV file with the following target:
+
+```bash
+make logs/perf.csv
+# View the CSV file
+libreoffice logs/perf.csv
+```
+
+In this file you can find the `X_tstart` and `X_tend` metrics. These are the cycles in which a particular code region `X` starts and ends, and can hence be used to profile your code. Code regions are defined by calls to `snrt_mcycle()`. Every call to this function defines two code regions:
+- the code preceding the call, up to the previous `snrt_mcycle()` call or the start of the source file
+- the code following the call, up to the next `snrt_mcycle()` call or the end of the source file
+
+The CSV file can be useful to automate collection and post-processing of benchmarking data.
+
+Finally, debugging your program from the trace alone can be quite tedious and time-consuming. You would have to manually understand which instructions in the trace correspond to which lines in your source code. Surely, you can help yourself with the disassembly.
+
+Alternatively, you can automatically annotate the traces with that information. With the following commands you can view the trace instructions side-by-side with the corresponding source code lines they were compiled from:
+
+```bash
+make -j annotate
+kompare -o logs/trace_hart_00000.diff
+```
+
+If you prefer to view this information in a regular text editor (e.g. for search), you can open the `logs/trace_hart_xxxxx.s` files. Here, the annotations are interleaved with the trace rather than being presented side-by-side.
+
+___Note:__ the `annotate` target uses the `addr2line` binutil behind the scenes, which needs debugging symbols to correlate instruction addresses with originating source code lines. The `DEBUG=ON` flag you specified when building the software is used to tell the compiler to produce debugging symbols when compiling your code._
+
+The traces contain a lot of information which we might not be interested at first. To simply visualize the runtime of the compute region in our code, first create a file named `layout.csv` in `sw/apps/axpy` with the following contents:
+
+```
+            , compute
+"range(0,8)",       1
+8           ,
+
+```
+
+Then run the following commands:
+
+```bash
+# Similar to logs/perf.csv but filters all but tstart and tend metrics
+make logs/event.csv
+# Labels, filters and reorders the event regions as specified by an application-specific layout file
+../../util/trace/layout_events.py logs/event.csv sw/apps/axpy/layout.csv -o logs/trace.csv
+# Creates a trace file which can be visualized with Chrome's TraceViewer
+../../util/trace/eventvis.py -o logs/trace.json logs/trace.csv
+```
+
+Go to `http://ui.perfetto.dev/`. Here you can load the `logs/trace.json` file and graphically view the runtime of the compute region in your code. To learn more about the layout file syntax and what the Python scripts do you can have a look at the description comment at the start of the scripts themselves.
+
+__Great, but, have you noticed a problem?__
+
+Look into `sw/apps/axpy/build/axpy.dump` and search for the address of the output variable `<z>` :
+
+```
+Disassembly of section .bss:
+
+80000960 <z>:
+   ...
+```
+
+Now grep this address in your traces:
+
+```bash
+grep 80000960 logs/*.txt
+...
+```
+
+It appears in every trace! All the cores issue a `fsd` (float store double) to this address. You are not parallelizing your kernel but executing it 8 times!
+
+Modify `sw/apps/axpy/src/axpy.c` to truly parallelize your kernel:
+
+```C
+#include "snrt.h"
+#include "data.h"
+
+// Define your kernel
+void axpy(uint32_t l, double a, double *x, double *y, double *z) {
+    int core_idx = snrt_cluster_core_idx();
+    int offset = core_idx * l;
+
+    for (int i = 0; i < l; i++) {
+        z[offset] = a * x[offset] + y[offset];
+        offset++;
+    }
+    snrt_fpu_fence();
+}
+
+int main() {
+    // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
+    uint32_t start_cycle = snrt_mcycle();
+
+    // DM core does not participate in the computation
+    if(snrt_is_compute_core())
+        axpy(L / snrt_cluster_compute_core_num(), a, x, y, z);
+
+    // Read the mcycle CSR
+    uint32_t end_cycle = snrt_mcycle();
+}
+```
+
+Now re-run your kernel and compare the execution time of the compute region with the previous version.
+
+## Code Reuse
+
+As you may have noticed, there is a good deal of code which is independent of the hardware platform we execute our AXPY kernel on. This is true for the `data.h` file and possible data generation scripts. The Snitch AXPY kernel itself is not specific to the Snitch cluster, but can be ported to any platform which provides an implementation of the snRuntime API. An example is Occamy, with its own testbench and SW development environment.
+
+It is thus preferable to develop the data generation scripts and Snitch kernels in a shared location, from which multiple platforms can take and include the code. The `sw` directory in the root of this repository was created with this goal in mind. For the AXPY example, shared sources are hosted under the `sw/blas/axpy` directory. As an example of how these shared sources are used to build an AXPY application for a specific platform (in this case the standalone Snitch cluster) you can have a look at the `target/snitch_cluster/sw/apps/blas/axpy`.
+
+We recommend that you follow this approach also in your own developments for as much of the code which can be reused.
diff --git a/mkdocs.yml b/mkdocs.yml
index d13ad0eb1..1fe011338 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -4,9 +4,10 @@
 site_name: Snitch Cluster
 theme:
   name: material
-
   icon:
     repo: fontawesome/brands/github
+  features:
+    - content.code.copy
 
 repo_url: https://github.com/pulp-platform/snitch_cluster
 repo_name: pulp-platform/snitch_cluster
@@ -16,10 +17,12 @@ markdown_extensions:
   - def_list
   - pymdownx.highlight
   - pymdownx.superfences
-  - pymdownx.tabbed
+  - pymdownx.tabbed:
+      alternate_style: true
   - pymdownx.emoji:
       emoji_index: !!python/name:material.extensions.emoji.twemoji
       emoji_generator: !!python/name:material.extensions.emoji.to_svg
+  - toc
 plugins:
   - include-markdown
   - mkdocstrings:
diff --git a/target/snitch_cluster/README.md b/target/snitch_cluster/README.md
index ce67ea9ec..f50adae0a 100644
--- a/target/snitch_cluster/README.md
+++ b/target/snitch_cluster/README.md
@@ -1,346 +1,9 @@
 # Snitch cluster target
 
 The Snitch cluster target (`target/snitch_cluster`) is a simple RTL testbench
-around a Snitch cluster. The cluster can be configured using a config file. By default, the config file which will be picked up is `target/snitch_cluster/cfg/default.hsjon`.
-
-The configuration parameters are documented using JSON schema. Documentation for the schema and available configuration options can be found in `docs/schema-doc/snitch_cluster/`).
+around a Snitch cluster.
 
 The cluster testbench simulates an infinite memory. The RISC-V ELF file to be simulated is
 preloaded using RISC-V's Front-End Server (`fesvr`).
 
-## Tutorial
-
-In the following tutorial you can assume the working directory to be `target/snitch_cluster`. All paths are to be assumed relative to this directory. Paths relative to the root of the repository are prefixed with a slash.
-
-### Building the hardware
-
-To compile the hardware for simulation run one of the following commands, depending on the desired simulator:
-
-```shell
-# Verilator
-make bin/snitch_cluster.vlt
-
-# Questa
-make DEBUG=ON bin/snitch_cluster.vsim
-
-# VCS
-make bin/snitch_cluster.vcs
-```
-
-These commands compile the RTL sources respectively in `work-vlt`, `work-vsim` and `work-vcs`. Additionally, common C++ testbench sources (e.g. the [frontend server (fesvr)](https://github.com/riscv-software-src/riscv-isa-sim)) are compiled under `work`. Each command will also generate a script or an executable (e.g. `bin/snitch_cluster.vsim`) which you can invoke to simulate the hardware. We will see how to do this in a later section.
-The variable `DEBUG=ON` is used to preserve the visibility of all the internal signals during simulation.
-
-### Building the Banshee simulator
-Instead of running an RTL simulation, you can use our instruction-accurate simulator called `banshee`. To install the simulator, please follow the instructions of the Banshee repository: [https://github.com/pulp-platform/banshee](https://github.com/pulp-platform/banshee).
-
-### Cluster configuration
-
-Note that the Snitch cluster RTL sources are partly automatically generated from a configuration file provided in `.hjson` format. Several RTL files are templated and use the `.hjson` configuration file to fill the template entries. An example is `/hw/snitch_cluster/src/snitch_cluster_wrapper.sv.tpl`.
-
-Under the `cfg` folder, different configurations are provided. The `cfg/default.hjson` configuration instantiates 8 compute cores + 1 DMA core in the cluster. If you need a specific configuration you can create your own configuration file.
-
-The command you executed previously automatically generated the templated RTL sources. It implicitly used the default configuration file.
-To override the default configuration file, define the following variable when you invoke `make`:
-```shell
-make CFG_OVERRIDE=cfg/custom.hjson bin/snitch_cluster.vlt
-```
-
-___Note:__ whenever you override the configuration file on the `make` command-line, the configuration will be stored in the `cfg/lru.hjson` file. Successive invocations of `make` will automatically pick up the `cfg/lru.hjson` file. You can therefore omit the `CFG_OVERRIDE` definition in successive commands unless you want to override the least-recently used configuration._
-
-Banshee uses also a cluster configuration file, however, that is given directly when simulating a specific binary with banshee with the help of `--configuration <cluster_config.yaml>`.
-
-### Building the software
-
-To build all of the software for the Snitch cluster, run the following command:
-
-```bash
-# for RTL simulation
-make DEBUG=ON sw
-
-# for Banshee simulation (requires slightly different runtime)
-make SELECT_RUNTIME=banshee DEBUG=ON sw
-
-# to use OpenOCD semi-hosting for putchar and termination
-make DEBUG=ON OPENOCD_SEMIHOSTING=ON sw
-```
-
-The `sw` target first generates some C header files which depend on the hardware configuration. Hence, the need to generate the software for the same configuration as your hardware. Afterwards, it recursively invokes the `make` target in the `sw` subdirectory to build the apps/kernels which have been developed in that directory.
-
-The `DEBUG=ON` flag is used to tell the compiler to produce debugging symbols. It is necessary for the `annotate` target, showcased in the Debugging section of this guide, to work.
-
-The `SELECT_RUNTIME` flag is set by default to `rtl`. To build the software with the Banshee runtime, set the flag to `banshee`.
-
-___Note:__ the RTL is not the only source which is generated from the configuration file. The software stack also depends on the configuration file. Make sure you always build the software with the same configuration of the hardware you are going to run it on._
-
-___Note:__ on GVSOC, it is better to use OpenOCD semi-hosting to prevent putchar from disturbing the DRAMSys timing model._
-
-### Running a simulation
-
-Run one of the executables which was compiled in the previous step on your Snitch cluster simulator of choice:
-
-```shell
-# Verilator
-bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
-
-# Questa
-bin/snitch_cluster.vsim sw/apps/blas/axpy/build/axpy.elf
-
-# VCS
-bin/snitch_cluster.vcs sw/apps/blas/axpy/build/axpy.elf
-
-# Banshee
-banshee --no-opt-llvm --no-opt-jit --configuration src/banshee.yaml --trace sw/apps/blas/axpy/build/axpy.elf
-```
-
-The Snitch cluster simulator binaries can be invoked from any directory, just adapt the relative paths in the preceding commands accordingly, or use absolute paths. We refer to the working directory where the simulation is launched as the simulation directory. Within it, you will find several log files produced by the RTL simulation.
-
-The previous commands will launch the simulation on the console. QuestaSim simulations can also be launched with the QuestaSim GUI, by adapting the previous command to:
-
-```shell
-# Questa
-bin/snitch_cluster.vsim.gui sw/apps/blas/axpy/build/axpy.elf
-```
-
-For Banshee, you need to give a specific cluster configuration to the simulator with the flag `--configuration <cluster_config.yaml>`. A default Snitch cluster configuration is given (`src/banshee.yaml`). The flag `--trace` enables the printing of the traces similar to the RTL simulation.
-For more information and debug options, please have a look at the Banshee repository: [https://github.com/pulp-platform/banshee](https://github.com/pulp-platform/banshee).
-
-### Creating your first Snitch app
-
-In the following you will create your own AXPY kernel implementation as an example how to develop software for Snitch.
-
-#### Writing the C Code
-
-Create a directory for your AXPY kernel under `sw/`:
-
-```bash
-mkdir sw/apps/axpy
-```
-
-And a `src` subdirectory to host your source code:
-
-```bash
-mkdir sw/apps/axpy/src
-```
-
-Here, create a new file named `axpy.c` inside the `src` directory with the following contents:
-
-```C
-#include "snrt.h"
-#include "data.h"
-
-// Define your kernel
-void axpy(uint32_t l, double a, double *x, double *y, double *z) {
-    for (uint32_t i = 0; i < l ; i++) {
-        z[i] = a * x[i] + y[i];
-    }
-    snrt_fpu_fence();
-}
-
-int main() {
-    // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
-    uint32_t start_cycle = snrt_mcycle();
-
-    // DM core does not participate in the computation
-    if(snrt_is_compute_core())
-        axpy(L, a, x, y, z);
-
-    // Read the mcycle CSR
-    uint32_t end_cycle = snrt_mcycle();
-}
-
-```
-
-The `snrt.h` file implements the snRuntime API, a library of convenience functions to program Snitch cluster based systems. These sources are located under `target/snitch_cluster/sw/runtime/rtl` and are automatically referenced by our compilation scripts.
-
-___Note:__ Have a look at the files inside `sw/snRuntime` in the root of this repository to see what kind of functionality the snRuntime API defines. Note this is only an API, with some base implementations. The Snitch cluster implementation of the snRuntime for RTL simulation can be found under `target/snitch_cluster/sw/runtime/rtl`. It is automatically built and linked with user applications thanks to our compilation scripts._
-
-We will have to instead create the `data.h` file ourselves. Create a `target/snitch_cluster/sw/apps/axpy/data` folder to host the data for your kernel to operate on:
-
-```bash
-mkdir sw/apps/axpy/data
-```
-
-Here, create a C file named `data.h` with the following contents:
-
-```C
-uint32_t L = 16;
-
-double a = 2;
-
-double x[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
-
-double y[16] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1,  1,  1,  1,  1,  1};
-
-double z[16];
-
-```
-
-In this file we hardcode the data to be used by the kernel. This data will be loaded in memory together with your application code. In general, to verify your code you may want to randomly generate the above data. You may also want to test your kernel on different problem sizes, e.g. varying the length of the vectors, without having to manually rewrite the file. This can be achieved by generating the data header file with a Python script. You may have a look at the `sw/blas/axpy/scripts/datagen.py` script in the root of this repository as an example. As you can see, it reuses many convenience classes and functions for data generation from the `data_utils` module. Documentation for this module can be found [here](https://pulp-platform.github.io/snitch_cluster/rm/sim/data_utils.html).
-
-#### Compiling the C Code
-
-In your `axpy` folder, create a new file named `Makefile` with the following contents:
-
-```make
-APP     = axpy
-SRCS    = src/axpy.c
-INCDIRS = data
-
-include ../common.mk
-```
-
-This Makefile will be invoked recursively by the top-level Makefile, compiling your source code into an executable with the name provided in the `APP` variable.
-
-In order for the top-level Makefile to find your application, add your application's directory to the `APPS` variable in `sw.mk`:
-
-```
-APPS += sw/apps/axpy
-```
-
-Now you can recompile all software, including your newly added AXPY application:
-
-```shell
-make DEBUG=ON sw
-```
-
-Note, only the targets depending on the sources you have added/modified will be recompiled.
-
-In the `sw/apps/axpy/build` directory, you will now find your `axpy.elf` executable and some other files which were automatically generated to aid debugging. Open `axpy.dump` and search for `<x>`, `<y>` and `<z>`. You will see the addresses where the respective vectors defined in `data.h` have been allocated by the compiler. This file can also be very useful to see what assembly instructions your source code was compiled to, and correlate the traces (we will later see) with the source code.
-
-If you want to dig deeper into how our build system works and how these files were generated you can follow the recursive Makefile invocations starting from the `sw` target in `snitch_cluster/Makefile`.
-
-#### Run your application
-
-You can run your application in simulation as shown in the previous sections. Make sure to pick up the right binary, e.g.:
-
-```shell
-bin/snitch_cluster.vsim sw/apps/axpy/build/axpy.elf
-```
-
-### Debugging and benchmarking
-
-When you run the simulation, every core will log all the instructions it executes (along with additional information, such as the value of the registers before/after the instruction) in a trace file. The traces are located in the `logs` folder within the simulation directory. The traces are identified by their hart ID, that is a unique ID for every hardware thread (hart) in a RISC-V system (and since all our cores have a single thread that is a unique ID per core).
-
-The simulation logs the traces in a non-human readable format with `.dasm` extension. To convert these to a human-readable form run:
-
-```bash
-make -j traces
-```
-
-If the simulation directory does not coincide with the current working directory, you will have to specify the path explicitly:
-
-```bash
-make -j traces SIM_DIR=<path_to_simulation_directory>
-```
-
-Detailed information on how to interpret the generated traces can be found [here](../../docs/ug/trace_analysis.md).
-
-In addition to generating readable traces (`.txt` format), the above command also computes several performance metrics from the trace and appends them at the end of the trace. These can be collected into a single CSV file with the following target:
-
-```bash
-make logs/perf.csv
-# View the CSV file
-libreoffice logs/perf.csv
-```
-
-In this file you can find the `X_tstart` and `X_tend` metrics. These are the cycles in which a particular code region `X` starts and ends, and can hence be used to profile your code. Code regions are defined by calls to `snrt_mcycle()`. Every call to this function defines two code regions:
-- the code preceding the call, up to the previous `snrt_mcycle()` call or the start of the source file
-- the code following the call, up to the next `snrt_mcycle()` call or the end of the source file
-
-The CSV file can be useful to automate collection and post-processing of benchmarking data.
-
-Finally, debugging your program from the trace alone can be quite tedious and time-consuming. You would have to manually understand which instructions in the trace correspond to which lines in your source code. Surely, you can help yourself with the disassembly.
-
-Alternatively, you can automatically annotate the traces with that information. With the following commands you can view the trace instructions side-by-side with the corresponding source code lines they were compiled from:
-
-```bash
-make -j annotate
-kompare -o logs/trace_hart_00000.diff
-```
-
-If you prefer to view this information in a regular text editor (e.g. for search), you can open the `logs/trace_hart_xxxxx.s` files. Here, the annotations are interleaved with the trace rather than being presented side-by-side.
-
-___Note:__ the `annotate` target uses the `addr2line` binutil behind the scenes, which needs debugging symbols to correlate instruction addresses with originating source code lines. The `DEBUG=ON` flag you specified when building the software is used to tell the compiler to produce debugging symbols when compiling your code._
-
-The traces contain a lot of information which we might not be interested at first. To simply visualize the runtime of the compute region in our code, first create a file named `layout.csv` in `sw/apps/axpy` with the following contents:
-
-```
-            , compute
-"range(0,8)",       1
-8           ,
-
-```
-
-Then run the following commands:
-
-```bash
-# Similar to logs/perf.csv but filters all but tstart and tend metrics
-make logs/event.csv
-# Labels, filters and reorders the event regions as specified by an application-specific layout file
-../../util/trace/layout_events.py logs/event.csv sw/apps/axpy/layout.csv -o logs/trace.csv
-# Creates a trace file which can be visualized with Chrome's TraceViewer
-../../util/trace/eventvis.py -o logs/trace.json logs/trace.csv
-```
-
-Go to `http://ui.perfetto.dev/`. Here you can load the `logs/trace.json` file and graphically view the runtime of the compute region in your code. To learn more about the layout file syntax and what the Python scripts do you can have a look at the description comment at the start of the scripts themselves.
-
-__Great, but, have you noticed a problem?__
-
-Look into `sw/apps/axpy/build/axpy.dump` and search for the address of the output variable `<z>` :
-
-```
-Disassembly of section .bss:
-
-80000960 <z>:
-	...
-```
-
-Now grep this address in your traces:
-
-```bash
-grep 80000960 logs/*.txt
-...
-```
-
-It appears in every trace! All the cores issue a `fsd` (float store double) to this address. You are not parallelizing your kernel but executing it 8 times!
-
-Modify `sw/apps/axpy/src/axpy.c` to truly parallelize your kernel:
-
-```C
-#include "snrt.h"
-#include "data.h"
-
-// Define your kernel
-void axpy(uint32_t l, double a, double *x, double *y, double *z) {
-    int core_idx = snrt_cluster_core_idx();
-    int offset = core_idx * l;
-
-    for (int i = 0; i < l; i++) {
-        z[offset] = a * x[offset] + y[offset];
-        offset++;
-    }
-    snrt_fpu_fence();
-}
-
-int main() {
-    // Read the mcycle CSR (this is our way to mark/delimit a specific code region for benchmarking)
-    uint32_t start_cycle = snrt_mcycle();
-
-    // DM core does not participate in the computation
-    if(snrt_is_compute_core())
-        axpy(L / snrt_cluster_compute_core_num(), a, x, y, z);
-
-    // Read the mcycle CSR
-    uint32_t end_cycle = snrt_mcycle();
-}
-```
-
-Now re-run your kernel and compare the execution time of the compute region with the previous version.
-
-## Code Reuse
-
-As you may have noticed, there is a good deal of code which is independent of the hardware platform we execute our AXPY kernel on. This is true for the `data.h` file and possible data generation scripts. The Snitch AXPY kernel itself is not specific to the Snitch cluster, but can be ported to any platform which provides an implementation of the snRuntime API. An example is Occamy, with its own testbench and SW development environment.
-
-It is thus preferable to develop the data generation scripts and Snitch kernels in a shared location, from which multiple platforms can take and include the code. The `sw` directory in the root of this repository was created with this goal in mind. For the AXPY example, shared sources are hosted under the `sw/blas/axpy` directory. As an example of how these shared sources are used to build an AXPY application for a specific platform (in this case the standalone Snitch cluster) you can have a look at the `target/snitch_cluster/sw/apps/blas/axpy`.
-
-We recommend that you follow this approach also in your own developments for as much of the code which can be reused.
+You can find information on how to build and simulate the Snitch cluster in the dedicated [tutorial](https://pulp-platform.github.io/snitch_cluster/ug/tutorial.html).
diff --git a/util/container/README.md b/util/container/README.md
index 714e55a13..05a136926 100644
--- a/util/container/README.md
+++ b/util/container/README.md
@@ -10,7 +10,7 @@ There is a pre-built version of the container available online. This version is
 
 To download the container, first login to the GitHub container registry:
 ```shell
-$ docker login ghcr.io
+docker login ghcr.io
 ```
 You will be asked for a username (your GitHub username).
 As a password you should use a
@@ -19,16 +19,15 @@ that at least has package registry read permission.
 
 You can then install the container by running:
 ```shell
-$ docker pull ghcr.io/pulp-platform/snitch_cluster:main
+docker pull ghcr.io/pulp-platform/snitch_cluster:main
 ```
 
 ### Build instructions
 
-In case you cannot use the pre-built container, e.g. if you need to make changes to the Dockerfile, you can build the
-container locally by running the following command in the root of the repository:
+In case you cannot use the pre-built container, e.g. if you need to make changes to the Dockerfile, you can build the container locally by running the following command in the root of the repository:
 
 ```shell
-$ sudo docker buildx build -t ghcr.io/pulp-platform/snitch_cluster:main -f util/container/Dockerfile .
+sudo docker buildx build -t ghcr.io/pulp-platform/snitch_cluster:main -f util/container/Dockerfile .
 ```
 
 ## Usage
@@ -36,7 +35,7 @@ $ sudo docker buildx build -t ghcr.io/pulp-platform/snitch_cluster:main -f util/
 To run the container in interactive mode:
 
 ```shell
-$ docker run -it -v $REPO_TOP:/repo -w /repo ghcr.io/pulp-platform/snitch_cluster:main
+docker run -it -v <path_to_repository_root>:/repo -w /repo ghcr.io/pulp-platform/snitch_cluster:main
 ```
 
 ## Limitations