merge master

KULeuven-MICAS · Dec 19, 2024 · 7472e8d · 7472e8d
2 parents f7e73e6 + ed4e12e
commit 7472e8d
Show file tree

Hide file tree

Showing 28 changed files with 516 additions and 100 deletions.
diff --git a/README.md b/README.md
@@ -7,15 +7,15 @@ More information with respect to the capabilities of Stream can be found in the
 
 
 ## Install required packages:
-```
-> pip install -r requirements.txt
+```bash
+pip install -r requirements.txt
 ```
 
 ## The first run
-```
-> cd stream
-> python api.py
+```bash
+git checkout tutorial
+python lab1/main.py
 ```
 
 ## Documentation
-Documentation for Stream is underway!
+You can find extensive documentation of Stream [here](https://kuleuven-micas.github.io/stream/).
diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst
@@ -2,66 +2,16 @@
 Getting Started
 ===============
 
-Stream allows you to run a design space exploration for both, traditional layer-by-layer processing as well as layer-fused processing of DNN workloads. The framework can be used to explore the performace of a workload on multi-core and single-core architectures.
+Tutorial
+--------
 
-In a first run, we are going to run ResNet-18 on quad-core architecture similar to a TPU like hardware [1]. We provide an `onnx <https://onnx.ai/>`_ model of this network in ``stream/inputs/examples/workload/resnet18.onnx`` and the HW architecture in ``stream/inputs/examples/hardware/TPU_like_quad_core.py``.
+The recommended way to get started with Stream is through the tutorial labs. You can find them in the `tutorial` branch of the repository. You should start with lab1 `here <https://github.com/KULeuven-MICAS/stream/tree/tutorial/lab1>`_.
 
-The onnx model has been shape inferred, which means that besied the input and output tensor shapes, all intermediate tensor shapes have been inferred, which is information required by Stream. 
+Manual run
+----------
 
-.. warning::
-    ZigZag requires an inferred onnx model, as it needs to know the shapes of all intermediate tensors to correctly infer the layer shapes. You can find more information on how to infer an onnx model `here <https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#running-shape-inference-on-an-onnx-model>`_.
+You can also run the main script directly. This will run a genetic algorithm on the inputs defined in the main file.
 
-Besides the workload and HW architecture, a mapping file must be provided which, as the name suggests, provides information about which layer can be mapped to which core in the hardware architecture. The mapping is provided in ``stream/inputs/examples/mapping/tpu_like_quad_core.py``.
+.. code-block:: bash
 
-The framework is generally ran through a main file which parses the provided inputs and contains the program flow through the stages defined in the main file.
-
-.. note::
-
-    You can find more information in the :doc:`stages` document.
-
-Layer-by-layer processing of workload
-=====================================
-
-Now, we would like to run the previously introduced workload in a layer-by-layer fashion, which means that one layer is exectued at once on a certain core and the next layer can only start as soon as all previous layers are completely done.
-
-For this we have to exectue
-
-.. code:: sh
-
-    python main_stream.py
-
-which parses the given workload, hw architecture and the corresponding mapping. Stream will now evaluate how efficently the workload can be executed on the given hardware with a layer-by-layer approach.
-
-Layer-fused processing of workload
-==================================
-
-In a second run, we would like to run the same workload on the same HW with the same mapping. The difference will be that a layer-fused approach is used instead of a layer-by-layer approach.
-
-For this we have to execute
-
-.. code:: sh
-
-    python main_stream_layer_splitting.py
-
-which starts another run of Stream. Now the given inputs are processed in a layer-fused approach which means that each layer is split in several smaller parts. 
-
-Analyzing results
-=================
-
-During the run of each experiement, Streams saves the results in the ``outputs`` folder based on the paths provided in the ``main_stream.py`` and ``main_stream_layer_splitting.py`` files. In this folder, there will be four ``.png`` files. Two of them show the schedule of workload's layer on the different cores of the hw architecture (one file for the layer-by-layer approach and one file for the layer-fused approach). Besides this, the other two ``.png`` files show the memory utilization of the different cores in the system for the two different experiements. More explanation about the results can be found on the :doc:`outputs` page.
-
-[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
-S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
-C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.
-Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
-D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
-A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
-J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,
-G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,
-R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
-N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
-C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
-M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,
-R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter
-performance analysis of a tensor processing unit,” SIGARCH Comput.
-Archit. News, vol. 45, no. 2, p. 1–12, jun 2017. 
+    $ python main_stream_ga.py
diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
-zigzag-dse==3.7.2
+zigzag-dse==3.8.0
 rtree
 deap
 matplotlib

diff --git a/stream/cost_model/communication_manager.py b/stream/cost_model/communication_manager.py
@@ -1,6 +1,6 @@
 import itertools
 from math import ceil
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING
 
 from zigzag.datatypes import Constants, MemoryOperand
 
@@ -112,7 +112,7 @@ def get_shortest_paths(self):
         return shortest_paths
 
     def get_links_for_all_core_pairs(self):
-        communication_links: dict[tuple[Core, Core], Any] = {}
+        communication_links: dict[tuple[Core, Core], "CommunicationLink"] = {}
         for pair, path in self.shortest_paths.items():
             traversed_edges = [(i, j) for i, j in zip(path, path[1:])]
             communication_links[pair] = [

diff --git a/stream/inputs/examples/hardware/cores/eyeriss_like.yaml b/stream/inputs/examples/hardware/cores/eyeriss_like.yaml
@@ -112,8 +112,8 @@ memories:
     served_dimensions: [D1, D2]
 
 operational_array:
-  multiplier_energy: 0.5 # pJ
-  multiplier_area: 0.1 # unit
+  unit_energy: 0.5 # pJ
+  unit_area: 0.1 # unit
   dimensions: [D1, D2]
   sizes: [14, 12]
 

diff --git a/stream/inputs/examples/hardware/cores/fusemax_array.yaml b/stream/inputs/examples/hardware/cores/fusemax_array.yaml
@@ -86,8 +86,8 @@ memories:
     served_dimensions: [D1, D2]
 
 operational_array:
-  multiplier_energy: 1.5
-  multiplier_area: 1 # unit
+  unit_energy: 1.5
+  unit_area: 1 # unit
   dimensions: [D1, D2]
   sizes: [256, 256]
 

diff --git a/stream/inputs/examples/hardware/cores/fusemax_dram.yaml b/stream/inputs/examples/hardware/cores/fusemax_dram.yaml
@@ -28,7 +28,7 @@ memories:
 
 operational_array:
   input_precision: [0, 0]
-  multiplier_energy: 0
-  multiplier_area: 0
+  unit_energy: 0
+  unit_area: 0
   dimensions: [D1, D2]
   sizes: [0, 0]
diff --git a/stream/inputs/examples/hardware/cores/fusemax_vec.yaml b/stream/inputs/examples/hardware/cores/fusemax_vec.yaml
@@ -87,7 +87,7 @@ memories:
 
 operational_array:
   input_precision: [8, 8]
-  multiplier_energy: 0.1 # pJ
-  multiplier_area: 0.01 # unit
+  unit_energy: 0.1 # pJ
+  unit_area: 0.01 # unit
   dimensions: [D1, D2]
   sizes: [256, 1]
diff --git a/stream/inputs/examples/hardware/cores/meta_prototype.yaml b/stream/inputs/examples/hardware/cores/meta_prototype.yaml
@@ -119,7 +119,7 @@ memories:
     served_dimensions: [D1, D2, D3, D4]
 
 multipliers:
-  multiplier_energy: 0.04 # pJ
-  multiplier_area: 1 # unit
+  unit_energy: 0.04 # pJ
+  unit_area: 1 # unit
   dimensions: [D1, D2, D3, D4]
   sizes: [32, 2, 4, 4]
diff --git a/stream/inputs/examples/hardware/cores/offchip.yaml b/stream/inputs/examples/hardware/cores/offchip.yaml
@@ -25,7 +25,7 @@ memories:
     served_dimensions: [D1, D2]
 
 operational_array:
-  multiplier_energy: 0
-  multiplier_area: 0
+  unit_energy: 0
+  unit_area: 0
   dimensions: [D1, D2]
   sizes: [0, 0]
diff --git a/stream/inputs/examples/hardware/cores/pooling.yaml b/stream/inputs/examples/hardware/cores/pooling.yaml
@@ -26,8 +26,8 @@ memories:
 
 
 operational_array:
-  multiplier_energy: 0.1 # pJ
-  multiplier_area: 0.01 # unit
+  unit_energy: 0.1 # pJ
+  unit_area: 0.01 # unit
   dimensions: [D1, D2, D3]
   sizes: [3, 3, 8]
 

diff --git a/stream/inputs/examples/hardware/cores/simba_chiplet.yaml b/stream/inputs/examples/hardware/cores/simba_chiplet.yaml
@@ -0,0 +1,114 @@
+name: simba_chiplet
+
+memories:
+
+  weight_registers:
+    size: 512  # 8 word-bits * 64 cluster_size
+    r_bw: 8
+    w_bw: 8
+    r_cost: 0.08  # TODO
+    w_cost: 0.08  # TODO
+    area: 0
+    r_port: 1
+    w_port: 1
+    rw_port: 0
+    latency: 1
+    operands: [I2]  # Weights
+    ports:
+      - fh: w_port_1
+        tl: r_port_1
+    served_dimensions: []
+
+  weight_buffer:
+    size: 32768  # 4096 depth * 8 width
+    r_bw: 64  # 8 bits/bank * 8 banks
+    w_bw: 64
+    r_cost: 0.5  # TODO
+    w_cost: 0.5
+    area: 0
+    r_port: 1
+    w_port: 1
+    rw_port: 0
+    latency: 1
+    operands: [I2]  # Weights
+    ports:
+      - fh: w_port_1
+        tl: r_port_1
+    served_dimensions: [D3, D4]
+
+  accumulation_buffer:
+    size: 3072   # 128 depth * 24 width
+    r_bw: 192  # partial sums are 24 bits * 8 units reading in parallel
+    w_bw: 192
+    r_cost: 0.1  # TODO
+    w_cost: 0.1
+    area: 0
+    r_port: 1
+    w_port: 1
+    rw_port: 0
+    latency: 1
+    operands: [O]  # Partial sums
+    ports:
+      - fh: w_port_1
+        tl: r_port_1
+        fl: w_port_1
+        th: r_port_1
+    served_dimensions: [D3, D4]
+
+  input_buffer:
+    size: 524288   # 8192 depth * 64 width
+    r_bw: 64
+    w_bw: 64
+    r_cost: 7  # TODO
+    w_cost: 7  # TODO
+    area: 0
+    r_port: 1
+    w_port: 1
+    rw_port: 0
+    latency: 1
+    operands: [I1]  # Input activations
+    ports:
+      - fh: w_port_1
+        tl: r_port_1
+    served_dimensions: [D3, D4]
+
+  global_buffer:
+    size: 2097152  # 2048 depth * 256 width * 4 banks
+    r_bw: 1024  # 256 bits width * 4 banks
+    w_bw: 1024
+    r_cost: 10  # Example cost, refine with more details
+    w_cost: 10
+    area: 0
+    r_port: 1
+    w_port: 1
+    rw_port: 0
+    latency: 1
+    operands: [I1, I2, O]  # Input activations, weights, partial sums
+    ports:
+      - fh: w_port_1
+        tl: r_port_1
+      - fh: w_port_1
+        tl: r_port_1
+      - fh: w_port_1
+        tl: r_port_1
+        fl: w_port_1
+        th: r_port_1
+    served_dimensions: [D1, D2, D3, D4]
+
+
+operational_array:
+  unit_energy: 0.04  # Refine with more accurate data if available
+  unit_area: 1  # unit
+  # D1/2 = 4x4 PE array. Each PE has 8 vector MACS (D3) that process 8 elements (D4) in parallel
+  dimensions: [D1, D2, D3, D4]
+  sizes: [4, 4, 8, 8]
+
+dataflows:
+  D1:
+    - K, 4
+  D2:
+    - C, 4
+  D3:
+    - K, 8
+  D4:
+    - C, 8
diff --git a/stream/inputs/examples/hardware/cores/simba_offchip.yaml b/stream/inputs/examples/hardware/cores/simba_offchip.yaml
@@ -0,0 +1,31 @@
+name: simba_offchip
+
+memories:
+  dram:
+    size: 10000000000
+    r_bw: 64
+    w_bw: 64
+    r_cost: 100
+    w_cost: 100
+    area: 0
+    r_port: 0
+    w_port: 0
+    rw_port: 1
+    latency: 1
+    operands: [I1, I2, O]
+    ports:
+      - fh: rw_port_1
+        tl: rw_port_1
+      - fh: rw_port_1
+        tl: rw_port_1
+      - fh: rw_port_1
+        tl: rw_port_1
+        fl: rw_port_1
+        th: rw_port_1
+    served_dimensions: [D1, D2]
+
+operational_array:
+  unit_energy: 0
+  unit_area: 0
+  dimensions: [D1, D2]
+  sizes: [0, 0]
diff --git a/stream/inputs/examples/hardware/cores/simd.yaml b/stream/inputs/examples/hardware/cores/simd.yaml
@@ -25,8 +25,8 @@ memories:
     served_dimensions: [D1]
 
 operational_array:
-  multiplier_energy: 0.1 # pJ
-  multiplier_area: 0.01 # unit
+  unit_energy: 0.1 # pJ
+  unit_area: 0.01 # unit
   dimensions: [D1]
   sizes: [64]
 

diff --git a/stream/inputs/examples/hardware/cores/tpu_like.yaml b/stream/inputs/examples/hardware/cores/tpu_like.yaml
@@ -64,8 +64,8 @@ memories:
     served_dimensions: [D1, D2]
 
 operational_array:
-  multiplier_energy: 0.04 # pJ
-  multiplier_area: 1 # unit
+  unit_energy: 0.04 # pJ
+  unit_area: 1 # unit
   dimensions: [D1, D2]
   sizes: [32, 32]