From 8e5f1c0323e3d135c11b1dd202fc58d70daed587 Mon Sep 17 00:00:00 2001 From: Arne Symons Date: Tue, 19 Nov 2024 22:22:17 +0100 Subject: [PATCH] add README to labs 1-3 and fix lab3 accelerators bandwidths --- lab1/README.md | 55 ++++++++ lab2/README.md | 67 ++++++++++ lab3/README.md | 168 +++++++++++++++++++++++++ lab3/inputs/hardware/accelerator1.yaml | 8 +- lab3/inputs/hardware/accelerator2.yaml | 6 +- lab3/inputs/hardware/accelerator3.yaml | 6 +- 6 files changed, 300 insertions(+), 10 deletions(-) create mode 100644 lab1/README.md create mode 100644 lab2/README.md create mode 100644 lab3/README.md diff --git a/lab1/README.md b/lab1/README.md new file mode 100644 index 00000000..ea7c459d --- /dev/null +++ b/lab1/README.md @@ -0,0 +1,55 @@ +# Lab 1: First Run of the ZigZag Framework + +## Objective +The goal of this lab is to perform the first run of the ZigZag framework. You will execute the first layer of ResNet-18 on a defined accelerator configuration with a constrained mapping. + +## Setup +1. Ensure you have installed the requirements in `requirements.txt`. +2. Make sure you are in the base directory, as `lab1/main.py` automatically inserts this into PATH which is needed for the ZigZag imports. + +## Inputs +There are three main inputs defined in the `inputs/` folder: +1. **Workload**: The first layer of ResNet18 in ONNX format. The layer name is `Conv1`. You can use [Netron](https://netron.app) to visualize the model. +2. **Hardware**: A sample accelerator is encoded in `accelerator1.yaml`. This accelerator includes 32x32 operational units with a hierarchy of memories attached which store different `memory operands I1, I2, O`. +3. **Mapping**: The mapping specifies for the `Conv1` layer both the spatial mapping and the temporal loop ordering. The spatial mapping links to the dimensions of the operational array defined in the hardware. The temporal loop ordering specifies the orders of the loops from inner to outer. Additionally, the mapping also specifies the operand links which link together the `memory operands` and the `layer operands`. + +## Running the Experiment +Run the main file: + ``` + python lab1/main.py + ``` + +The mapping is fixed in both the spatial and temporal domains, resulting in a single cost model evaluation (CME). + +## Outputs +The results of the experiment will be saved in the `outputs/` folder. + +## Homework + +- Take a look inside the ZigZag API call in `zigzag/api.py`. Do you understand the meaning of all the defined stages and all arguments passed to these stages? + >
+ > Answer + > + > You can read more information on the different stages [here](https://kuleuven-micas.github.io/zigzag/stages.html). Each stage performs a different function, ranging from parsing inputs to generating temporal mappings to evaluating the cost model. Others filter multiple mappings to only keep the best one(s), or make sure results can be aggragated across multiple layers in a robust way. + > + >
+ +- How does the fixed temporal ordering in `lab1/inputs/mapping/mapping.yaml` match with the produced temporal mapping? What information did you not give as an input but was inferred by the framework? + >
+ > Answer + > + > The LOMA engine inside of the `TemporalMappingGeneratorStage` takes in the defined `temporal_ordering` and allocates the different temporal loops from inner to outer to the memories in the hierarchy. This is the extra information you see in the printed mapping: for every operand and every loop, it shows the memory level it was allocated to. + > + >
+ +- Analyze the fields of `lab1/outputs/tpu_like-resnet18_first_layer/Conv1_complete.json`. How much energy went to memory read/writes versus operations? + >
+ > Answer + > + > The json contains the following fields: + > "operational_energy": 4720558.08 + > "memory_energy": 2637751874.296 + > + > As such, the memory reads/writes account for 99.8% of the total energy. Of course this value heavily depends on the defined `unit_energy` for operations and the defined read and write energy cost of the memories. + > + >
diff --git a/lab2/README.md b/lab2/README.md new file mode 100644 index 00000000..c2aa42dd --- /dev/null +++ b/lab2/README.md @@ -0,0 +1,67 @@ +# Lab 2: Automating the temporal mapping + +## Objective +The goal of this lab is to have ZigZag generate multiple temporal mappings automatically, and only return the best one it found. + +Keep in mind that each evaluated mapping is uniquely different in loop ordering and memory allocation. Using traditional simulation, this would take orders of magnitude longer to 1. encode the mappings as a different control flow and 2. use cycle-accurate simulations to obtain the performance and switching activity. The trade-off is that the analytical cost model makes simplifying assumptions on both the hardware and the mapping of the workload onto its resources. + +## Setup +1. Ensure you have installed the requirements in `requirements.txt`. +2. Make sure you are in the base directory, as `lab2/main.py` automatically inserts this into PATH which is needed for the ZigZag imports. + +## Inputs +There are three main inputs defined in the `inputs/` folder: +1. **Workload**: _[Same as lab1]_ The first layer of ResNet18 in ONNX format. The layer name is `Conv1`. You can use [Netron](https://netron.app) to visualize the model. +2. **Hardware**: _[Same as lab1]_ A sample accelerator is encoded in `accelerator1.yaml`. This accelerator includes 32x32 operational units with a hierarchy of memories attached which store different `memory operands I1, I2, O`. +3. **Mapping**: The mapping specifies for the `Conv1` layer only the spatial mapping. The `TemporalMappingGeneratorStage` automatically detects there is no user-defined temoral loop ordering and generates multiple temporal mappings to be evaluated by the cost model. + +## Running the Experiment +Run the main file: + ``` + python lab2/main.py + ``` + +As only the spatial mapping is fixed, there will be multiple cost model evaluations. The progress is shown through a bar, where the numbers to the right indicate the evaluated and total amount of mappings that will be evaluated. + +## Outputs +The results of the experiment will be saved in the `outputs/` folder. + +## Homework + +- What does the API call optimize for? Try changing this to a different valid criterion and analyze the impact on the performance. + >
+ > Answer + > + > The API call optimizes for minimal latency, defined through the `optimization_criterion` in the main file. + > + > Other valid criteria are `energy` and `EDP` (energy-delay product). A custom criterion requires manual implementation of a custom `Stage` which filters cost model evlauations to only return the one that optimizes the custom criterion. + > + > **Tip:** When trying different criteria, change the `experiment_id` to automatically save the results to a different folder and easily compare them. + > + >
+ +- How does the `TemporalMappingGeneratorStage` detect that there is no user-defined temporal loop ordering? + >
+ > Answer + > + > The `WorkloadFactory` checks for each layer if there is a user-defined temporal ordering defined in the mapping file. If so, it saves it as the `temporal_ordering` attribute of the layer. The `TemporalMappingGeneratorStage` gets this attribute and passes it to the underlying `LomaEngine`, which can be seen [here](https://github.com/KULeuven-MICAS/zigzag/blob/b8a523b10215eef8f82ad4eff3be9d17446457ed/zigzag/stages/mapping/temporal_mapping_generator_stage.py#L58). The engine is responsible for generating valid temporal mappings, i.e. with allocation of the memory levels for the different loops, from the provided user-defined temporal ordering or any other constraints. + > + >
+ +- What is the difference in performance (latency) compared to the user-defined temporal ordering? + >
+ > Answer + > + > The LOMA engine inside of the `TemporalMappingGeneratorStage` takes in the defined `temporal_ordering` and allocates the different temporal loops from inner to outer to the memories in the hierarchy. This is the extra information you see in the printed mapping: for every operand and every loop, it shows the memory level it was allocated to. + > + >
+ +- How would you modify the mapping file to also automatically optimize the spatial mapping? + >
+ > Answer + > + > Identically to the temporal ordering, you can simply remove the defined spatial mapping in the mapping file. Then, the `SpatialMappingGeneratorStage` will automatically generate a number of spatial mappings. For each generated spatial mapping, the same flow will run as before: multiple temporal mappings are evaluated and filtered to return the best one wrt. the optimization criterion. + > + > The standard number of spatial mappings evaluated is 3, which are those with the highest spatial utilization. This can be increased or reduced by passing a different `nb_spatial_mappings_generated` to the API call. + > + >
diff --git a/lab3/README.md b/lab3/README.md new file mode 100644 index 00000000..0c7a97df --- /dev/null +++ b/lab3/README.md @@ -0,0 +1,168 @@ +# Lab 3: Assess the impact of different accelerators + +## Objective +The goal of this lab is to get more familiar with the relationship between the hardware architecture and the spatial mapping in ZigZag. + +## Setup +1. Ensure you have installed the requirements in `requirements.txt`. +2. Make sure you are in the base directory, as `lab3/main.py` automatically inserts this into PATH which is needed for the ZigZag imports. + +## Inputs +There are three main inputs defined in the `inputs/` folder: +1. **Workload**: _[Same as lab1/2]_ The first layer of ResNet18 in ONNX format. The layer name is `Conv1`. You can use [Netron](https://netron.app) to visualize the model. +2. **Hardware**: Three different accelerator architectures `accelerator1.yaml`, `accelerator2.yaml` and `accelerator3.yaml`. All three have the same 32x32 operational array size with similar memories construction in varying hierarchies. +3. **Mapping**: The mapping for the three different accelerator architectures. + +## Understanding the hardware architecture + +Before we run the experiment, let's dive into the definition of `accelerator1.yaml` in more detail. A hardware architecture contains two major components: the `operational_array` and the `memories`. + +### Operational array +The `operational_array` specifies an array of 'operational units', where we make abstraction of what that operation is exactly. It can represent a multiplication, a multiply-accumulate, a division, etc. All that matters for ZigZag is the energy cost of an operation (in `unit_energy`) and what is the area of a unit (in `unit_area`). Then, an N-dimensional array is constructed by using the `dimensions` and the `sizes` field, where each dimension is denoted `Dx` with `x in [1, N]` and `sizes` of equal length representing the size of the dimension. The total amount of units in the array is thus the product of the sizes list. + +The reason we allow N-dimensional arrays is that this allows us to flexibly interconnect these units to the lowest level of the memory hierarchy, which we discuss next. + +### Memories + +The `memories` entry specify a number of `MemoryLevel`s. Each level has a name through the key of the entry, and various fields. Most fields are fairly self-explanatory, you can find more information [here](https://kuleuven-micas.github.io/zigzag/hardware.html#memory-instance). Here, we focus mostly on the `served_dimensions` attribute to understand its link with the spatial mapping. + +It is important to keep in mind that each `MemoryLevel` can consist of one or more `MemoryInstance`s, which are unrolled with a specific replication pattern. This pattern is encoded through the `served_dimensions` attribute. It specifies which dimensions of the `operational_array` a single instance of this level will serve. If the attribute is the empty list, this means that a single instance doesn't serve any dimensions, but rather is replicated alongside each unit. If there are dimensions listed, it means a single instance is interconnected to all units across that dimension. Thus, the number of instances in a level always equals the product of the dimension sizes not present in `served_dimensions`. + +## Example `served_dimensions` for `accelerator1.yaml` + +Let's make this concrete using the `accelerator1.yaml` architecture description. First, we focus on the three lowest level memory levels that are added. These three levels store one memory operand: `I1`, `I2` and `O` each. + +**Note:** The drawings below are simplified to a 2x2 operational array for simplicity. + +The `I1` lowest memory level looks as follows: +``` + Dimension + D2 + ◄─────────────────► +┌──────────┐ +│ ┼──────────┬──────────┐ +│ rf_1B_I1 │ │ │ +│ │ ▼ ▼ +└──────────┘ ┌──────┐ ┌──────┐ ▲ + │ OP │ │ OP │ │ + └──────┘ └──────┘ │ +┌──────────┐ │ +│ ┼──────────┬──────────┐ │ Dimension +│ rf_1B_I1 │ │ │ │ D1 +│ │ ▼ ▼ │ +└──────────┘ ┌──────┐ ┌──────┐ │ + │ OP │ │ OP │ │ + └──────┘ └──────┘ ▼ +``` + +As can be seen, each `rf_1B_I1` instance serves all operational units across dimension `D2`. There are thus 1 instances for this 2x2 example. + +The `I2` lowest level: +``` + Dimension + D2 + ◄──────────────────► +┌────────┐ ┌────────┐ +│ │ │ │ +│rf_1B_I2│ │rf_1B_I2│ +│ │ │ │ +└──┬─────┘ └──┬─────┘ + │ │ + ▼ ▼ + ┌──────┐ ┌──────┐ ▲ + │ OP │ │ OP │ │ + └──────┘ └──────┘ │ +┌────────┐ ┌────────┐ │ +│ │ │ │ │ +│rf_1B_I2│ │rf_1B_I2│ │ +│ │ │ │ │ Dimension +└──┬─────┘ └──┬─────┘ │ D1 + │ │ │ + ▼ ▼ │ + ┌──────┐ ┌──────┐ │ + │ OP │ │ OP │ │ + └──────┘ └──────┘ ▼ +``` + +Each `rf_1B_I2` serves a single operational unit. There are thus 4 instances for this 2x2 example. + +The `O` lowest memory level: +``` + Dimension + D2 +◄──────────────────► + + ┌──────┐ ┌──────┐ ▲ + │ OP │ │ OP │ │ + └───┬──┘ └────┬─┘ │ + └────┐ └───┐ │ Dimension + ┌──────┐ │ ┌──────┐ │ │ D1 + │ OP │ │ │ OP │ │ │ + └─┬────┘ │ └─┬────┘ │ ▼ + │ ┌────┘ │ ┌────┘ + ▼ ▼ ▼ ▼ +┌───────┐ ┌───────┐ +│ │ │ │ +│rf_4B_O│ │rf_4B_O│ +│ │ │ │ +└───────┘ └───────┘ +``` + +Each `rf_4B_O` serves the operational units across dimension `D1`. There are thus 2 instances for this 2x2 example. Note that while in reality this architecture would have an adder tree to sum up the outputs coming from different units in a column, this is abstracted out in the framework. + +The higher memory levels automatically connect to these lower memory levels. These higher memory levels typically aren't unrolled (although this is possible with the representation), as such they contain more array dimensions in the `served_dimensions` attribute. + +## Relationship between memory interconnection and spatial mapping + +The interconnection pattern as explained above ties in closely with the potential spatial mappings. For example, the `rf_1B_I1` level, which through the mapping is linked to the `I` operand of the `Conv1` layer, has a read bandwidth of 8 bits. This means that within a clock cycle, only a single `I` element (assuming 8 bit precision) can be read out. Thus, the operational elements across `D2` should require the same input and as such only `irrelevant` dimensions of the `I` operand can be assigned to the `D2` dimension. The spatial mapping encoded in `inputs/mapping/accelerator1.yaml` unrolls the `K` dimension (output channels) across `D2`. + +## Running the Experiment +Run the main file: + ``` + python lab3/main.py + ``` + +ZigZag will optimize the temporal mapping for all three accelerator architectures with their defined spatial mappings. + +## Outputs +The results of the experiment will be saved in the `outputs/` folder. + +## Homework + +- Try drawing the lowest memory levels for the `accelerator2.yaml` architecture description. Which level has the most instances? + >
+ > Answer + > + > The output memory operand `O` has the most instances in the lowest level. Its `served_dimensions` attribute is empty, which means that there will be 32x32 instances of the output RF. This is typically referred to as an 'output-stationary' dataflow (in combination with output reuse in these RFs). + > + >
+ +- How do you define a memory level with only a single instance? + >
+ > Answer + > + > A memory level with a single instance is defined by specifying all dimensions in the `served_dimensions` attribute. + > + >
+ +- Which accelerator architecture has the best latency? What causes the other ones to be worse? + >
+ > Answer + > + > `accelerator2` has the best latency. In broad terms, this can be attributed mainly to the fact that the output RF `rf_4B_O` is unrolled 32x32 times. This RF has a higher capacity than the other two RFs, and a higher bandwidth. Thus, output data can be reused longer in the array, which avoids memory stalls due to insufficient bandwidths at higher memory levels. This can be checked by looking at the `mac_utilization` field of the complete output json `Conv1_complete.json`. For accelerator 3 for example, the `ideal` utilization (without taking memory stalls into account) is 67%, but when taking stalls into account, this drops to 31%. Meanwhile, the ideal utilization of accelerator2 is 87% due to better spatial mapping, and there are no memory stalls. + > + >
+ +- Increase the memory size of `rf_1B_I2` of `accelerator1.yaml`. Does the latency become better than that of `accelerator2.yaml`? + >
+ > Answer + > + > Increasing the size of this memory allows for more data reuse. At the baseline 8 bits, the layer has a latency of `2.46e6` cycles. At 32 bits, this decreases to `1.85e6`, and at 1024 bits it further decreases to `1.23e6`. Increasing it more doesn't decrease the latency. This is due to the fundamental limitation of the spatial mapping: the `C` dimension only has a size of 3, which means the utilization can never become better than 3/32 which is roughly 9%. On the other hand, there are enough `OX` and `K` to comletely unroll across the operational array of `accelerator2.yaml`. + >
+ +- What is the spatial utilization on `accelerator2.yaml` and why? + >
+ > Answer + > + > The utilization, as mentioned in previous answers, is 87.5%. The reason it's not 100% is because of a mismatch between the operational array dimension `D1` and the layer dimension unrolled: `OX`. `OX` is 112, which means the closest factor we obtain is 28, as opposed to 32. This is equivalent to a 'greedy' mapping strategy of 32, 32, 32, and remainder 16, where we would still need 4 temporal iterations. + >
\ No newline at end of file diff --git a/lab3/inputs/hardware/accelerator1.yaml b/lab3/inputs/hardware/accelerator1.yaml index 895a9728..19c6b3ac 100644 --- a/lab3/inputs/hardware/accelerator1.yaml +++ b/lab3/inputs/hardware/accelerator1.yaml @@ -7,8 +7,8 @@ operational_array: sizes: [32, 32] memories: - rf_1B: - size: 8 + rf_1B_I2: + size: 16384 r_bw: 8 w_bw: 8 r_cost: 0.095 # TODO @@ -25,7 +25,7 @@ memories: tl: r_port_1 served_dimensions: [] # Fully unrolled over all multipliers - rf_1B: + rf_1B_I1: size: 8 r_bw: 8 w_bw: 8 @@ -43,7 +43,7 @@ memories: tl: r_port_1 served_dimensions: [D2] # One RF per column - rf_4B: + rf_4B_O: size: 32 r_bw: 32 w_bw: 32 diff --git a/lab3/inputs/hardware/accelerator2.yaml b/lab3/inputs/hardware/accelerator2.yaml index f80cee25..f39ab6f8 100644 --- a/lab3/inputs/hardware/accelerator2.yaml +++ b/lab3/inputs/hardware/accelerator2.yaml @@ -7,7 +7,7 @@ operational_array: sizes: [32, 32] memories: - rf_1B: + rf_1B_I2: size: 8 r_bw: 8 w_bw: 8 @@ -25,7 +25,7 @@ memories: tl: r_port_1 served_dimensions: [D1] # One per column - rf_1B: + rf_1B_I1: size: 8 r_bw: 8 w_bw: 8 @@ -43,7 +43,7 @@ memories: tl: r_port_1 served_dimensions: [D2] # One per row - rf_4B: + rf_4B_O: size: 32 r_bw: 32 w_bw: 32 diff --git a/lab3/inputs/hardware/accelerator3.yaml b/lab3/inputs/hardware/accelerator3.yaml index 1ac328e5..544f4e66 100644 --- a/lab3/inputs/hardware/accelerator3.yaml +++ b/lab3/inputs/hardware/accelerator3.yaml @@ -7,7 +7,7 @@ operational_array: sizes: [64, 4, 4] memories: - rf_1B: + rf_1B_I2: size: 8 r_bw: 8 w_bw: 8 @@ -25,7 +25,7 @@ memories: tl: r_port_1 served_dimensions: [] # One per PE - rf_1B: + rf_1B_I1: size: 8 r_bw: 8 w_bw: 8 @@ -43,7 +43,7 @@ memories: tl: r_port_1 served_dimensions: [] # One per PE - rf_4B: + rf_4B_O: size: 32 r_bw: 32 w_bw: 32