Skip to content

Commit

Permalink
merge master
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinGeens committed Dec 19, 2024
2 parents f7e73e6 + ed4e12e commit 7472e8d
Show file tree
Hide file tree
Showing 28 changed files with 516 additions and 100 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ More information with respect to the capabilities of Stream can be found in the


## Install required packages:
```
> pip install -r requirements.txt
```bash
pip install -r requirements.txt
```

## The first run
```
> cd stream
> python api.py
```bash
git checkout tutorial
python lab1/main.py
```

## Documentation
Documentation for Stream is underway!
You can find extensive documentation of Stream [here](https://kuleuven-micas.github.io/stream/).
66 changes: 8 additions & 58 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,66 +2,16 @@
Getting Started
===============

Stream allows you to run a design space exploration for both, traditional layer-by-layer processing as well as layer-fused processing of DNN workloads. The framework can be used to explore the performace of a workload on multi-core and single-core architectures.
Tutorial
--------

In a first run, we are going to run ResNet-18 on quad-core architecture similar to a TPU like hardware [1]. We provide an `onnx <https://onnx.ai/>`_ model of this network in ``stream/inputs/examples/workload/resnet18.onnx`` and the HW architecture in ``stream/inputs/examples/hardware/TPU_like_quad_core.py``.
The recommended way to get started with Stream is through the tutorial labs. You can find them in the `tutorial` branch of the repository. You should start with lab1 `here <https://github.com/KULeuven-MICAS/stream/tree/tutorial/lab1>`_.

The onnx model has been shape inferred, which means that besied the input and output tensor shapes, all intermediate tensor shapes have been inferred, which is information required by Stream.
Manual run
----------

.. warning::
ZigZag requires an inferred onnx model, as it needs to know the shapes of all intermediate tensors to correctly infer the layer shapes. You can find more information on how to infer an onnx model `here <https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#running-shape-inference-on-an-onnx-model>`_.
You can also run the main script directly. This will run a genetic algorithm on the inputs defined in the main file.

Besides the workload and HW architecture, a mapping file must be provided which, as the name suggests, provides information about which layer can be mapped to which core in the hardware architecture. The mapping is provided in ``stream/inputs/examples/mapping/tpu_like_quad_core.py``.
.. code-block:: bash
The framework is generally ran through a main file which parses the provided inputs and contains the program flow through the stages defined in the main file.

.. note::

You can find more information in the :doc:`stages` document.

Layer-by-layer processing of workload
=====================================

Now, we would like to run the previously introduced workload in a layer-by-layer fashion, which means that one layer is exectued at once on a certain core and the next layer can only start as soon as all previous layers are completely done.

For this we have to exectue

.. code:: sh
python main_stream.py
which parses the given workload, hw architecture and the corresponding mapping. Stream will now evaluate how efficently the workload can be executed on the given hardware with a layer-by-layer approach.

Layer-fused processing of workload
==================================

In a second run, we would like to run the same workload on the same HW with the same mapping. The difference will be that a layer-fused approach is used instead of a layer-by-layer approach.

For this we have to execute

.. code:: sh
python main_stream_layer_splitting.py
which starts another run of Stream. Now the given inputs are processed in a layer-fused approach which means that each layer is split in several smaller parts.

Analyzing results
=================

During the run of each experiement, Streams saves the results in the ``outputs`` folder based on the paths provided in the ``main_stream.py`` and ``main_stream_layer_splitting.py`` files. In this folder, there will be four ``.png`` files. Two of them show the schedule of workload's layer on the different cores of the hw architecture (one file for the layer-by-layer approach and one file for the layer-fused approach). Besides this, the other two ``.png`` files show the memory utilization of the different cores in the system for the two different experiements. More explanation about the results can be found on the :doc:`outputs` page.

[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter
performance analysis of a tensor processing unit,” SIGARCH Comput.
Archit. News, vol. 45, no. 2, p. 1–12, jun 2017.
$ python main_stream_ga.py
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
zigzag-dse==3.7.2
zigzag-dse==3.8.0
rtree
deap
matplotlib
Expand Down
4 changes: 2 additions & 2 deletions stream/cost_model/communication_manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import itertools
from math import ceil
from typing import TYPE_CHECKING, Any
from typing import TYPE_CHECKING

from zigzag.datatypes import Constants, MemoryOperand

Expand Down Expand Up @@ -112,7 +112,7 @@ def get_shortest_paths(self):
return shortest_paths

def get_links_for_all_core_pairs(self):
communication_links: dict[tuple[Core, Core], Any] = {}
communication_links: dict[tuple[Core, Core], "CommunicationLink"] = {}
for pair, path in self.shortest_paths.items():
traversed_edges = [(i, j) for i, j in zip(path, path[1:])]
communication_links[pair] = [
Expand Down
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/eyeriss_like.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,8 @@ memories:
served_dimensions: [D1, D2]

operational_array:
multiplier_energy: 0.5 # pJ
multiplier_area: 0.1 # unit
unit_energy: 0.5 # pJ
unit_area: 0.1 # unit
dimensions: [D1, D2]
sizes: [14, 12]

Expand Down
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/fusemax_array.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ memories:
served_dimensions: [D1, D2]

operational_array:
multiplier_energy: 1.5
multiplier_area: 1 # unit
unit_energy: 1.5
unit_area: 1 # unit
dimensions: [D1, D2]
sizes: [256, 256]

Expand Down
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/fusemax_dram.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ memories:

operational_array:
input_precision: [0, 0]
multiplier_energy: 0
multiplier_area: 0
unit_energy: 0
unit_area: 0
dimensions: [D1, D2]
sizes: [0, 0]
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/fusemax_vec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ memories:

operational_array:
input_precision: [8, 8]
multiplier_energy: 0.1 # pJ
multiplier_area: 0.01 # unit
unit_energy: 0.1 # pJ
unit_area: 0.01 # unit
dimensions: [D1, D2]
sizes: [256, 1]
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/meta_prototype.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ memories:
served_dimensions: [D1, D2, D3, D4]

multipliers:
multiplier_energy: 0.04 # pJ
multiplier_area: 1 # unit
unit_energy: 0.04 # pJ
unit_area: 1 # unit
dimensions: [D1, D2, D3, D4]
sizes: [32, 2, 4, 4]
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/offchip.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ memories:
served_dimensions: [D1, D2]

operational_array:
multiplier_energy: 0
multiplier_area: 0
unit_energy: 0
unit_area: 0
dimensions: [D1, D2]
sizes: [0, 0]
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/pooling.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ memories:


operational_array:
multiplier_energy: 0.1 # pJ
multiplier_area: 0.01 # unit
unit_energy: 0.1 # pJ
unit_area: 0.01 # unit
dimensions: [D1, D2, D3]
sizes: [3, 3, 8]

Expand Down
114 changes: 114 additions & 0 deletions stream/inputs/examples/hardware/cores/simba_chiplet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
name: simba_chiplet

memories:

weight_registers:
size: 512 # 8 word-bits * 64 cluster_size
r_bw: 8
w_bw: 8
r_cost: 0.08 # TODO
w_cost: 0.08 # TODO
area: 0
r_port: 1
w_port: 1
rw_port: 0
latency: 1
operands: [I2] # Weights
ports:
- fh: w_port_1
tl: r_port_1
served_dimensions: []

weight_buffer:
size: 32768 # 4096 depth * 8 width
r_bw: 64 # 8 bits/bank * 8 banks
w_bw: 64
r_cost: 0.5 # TODO
w_cost: 0.5
area: 0
r_port: 1
w_port: 1
rw_port: 0
latency: 1
operands: [I2] # Weights
ports:
- fh: w_port_1
tl: r_port_1
served_dimensions: [D3, D4]

accumulation_buffer:
size: 3072 # 128 depth * 24 width
r_bw: 192 # partial sums are 24 bits * 8 units reading in parallel
w_bw: 192
r_cost: 0.1 # TODO
w_cost: 0.1
area: 0
r_port: 1
w_port: 1
rw_port: 0
latency: 1
operands: [O] # Partial sums
ports:
- fh: w_port_1
tl: r_port_1
fl: w_port_1
th: r_port_1
served_dimensions: [D3, D4]

input_buffer:
size: 524288 # 8192 depth * 64 width
r_bw: 64
w_bw: 64
r_cost: 7 # TODO
w_cost: 7 # TODO
area: 0
r_port: 1
w_port: 1
rw_port: 0
latency: 1
operands: [I1] # Input activations
ports:
- fh: w_port_1
tl: r_port_1
served_dimensions: [D3, D4]

global_buffer:
size: 2097152 # 2048 depth * 256 width * 4 banks
r_bw: 1024 # 256 bits width * 4 banks
w_bw: 1024
r_cost: 10 # Example cost, refine with more details
w_cost: 10
area: 0
r_port: 1
w_port: 1
rw_port: 0
latency: 1
operands: [I1, I2, O] # Input activations, weights, partial sums
ports:
- fh: w_port_1
tl: r_port_1
- fh: w_port_1
tl: r_port_1
- fh: w_port_1
tl: r_port_1
fl: w_port_1
th: r_port_1
served_dimensions: [D1, D2, D3, D4]


operational_array:
unit_energy: 0.04 # Refine with more accurate data if available
unit_area: 1 # unit
# D1/2 = 4x4 PE array. Each PE has 8 vector MACS (D3) that process 8 elements (D4) in parallel
dimensions: [D1, D2, D3, D4]
sizes: [4, 4, 8, 8]

dataflows:
D1:
- K, 4
D2:
- C, 4
D3:
- K, 8
D4:
- C, 8
31 changes: 31 additions & 0 deletions stream/inputs/examples/hardware/cores/simba_offchip.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: simba_offchip

memories:
dram:
size: 10000000000
r_bw: 64
w_bw: 64
r_cost: 100
w_cost: 100
area: 0
r_port: 0
w_port: 0
rw_port: 1
latency: 1
operands: [I1, I2, O]
ports:
- fh: rw_port_1
tl: rw_port_1
- fh: rw_port_1
tl: rw_port_1
- fh: rw_port_1
tl: rw_port_1
fl: rw_port_1
th: rw_port_1
served_dimensions: [D1, D2]

operational_array:
unit_energy: 0
unit_area: 0
dimensions: [D1, D2]
sizes: [0, 0]
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/simd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ memories:
served_dimensions: [D1]

operational_array:
multiplier_energy: 0.1 # pJ
multiplier_area: 0.01 # unit
unit_energy: 0.1 # pJ
unit_area: 0.01 # unit
dimensions: [D1]
sizes: [64]

Expand Down
4 changes: 2 additions & 2 deletions stream/inputs/examples/hardware/cores/tpu_like.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ memories:
served_dimensions: [D1, D2]

operational_array:
multiplier_energy: 0.04 # pJ
multiplier_area: 1 # unit
unit_energy: 0.04 # pJ
unit_area: 1 # unit
dimensions: [D1, D2]
sizes: [32, 32]

Expand Down
Loading

0 comments on commit 7472e8d

Please sign in to comment.