Skip to content

Commit

Permalink
Add neureka support (#2)
Browse files Browse the repository at this point in the history
* Add neureka support

* Fix Ne16 weight rolling was unpacking to bits instead of 8

* Fix Arpan's name in the contributors

* Add multi-accelerator support and neureka as a target

* Skip invalid tests

* Add readme per accelerator

* Remove ne16 input dim 2 stride calculation

* Move common validators to NnxTest

* Rename test/<acc>.py to test/<acc>MemoryLayout.py

* Extract functional model from test gen

* Add isort

* Add without norm_quant

* Remove -flto

* Move stride shift to ne16_task_set_dims_stride2x2 function

* Set quantMode in *_set_bits function

* Fix output d0 stride

* Fixes to strides and stride2x2

Fixed stride2x2 mode for 32bit output.
Changed stride meaning from number of elements in that dimension, to
number of bytes between elements in that dimension. This also required a
change of dimension names.

* Rename divnceil and remainder, and add nnx_ prefix

* Add citation

* Add sdk and compiler commit hashes

* Change task size to a define

* Remove __PLATFORM__ check from the library since it's pulp-sdk specific

* Change channel and bits with w_in_stride for task_set_ptrs
  • Loading branch information
lukamac authored Feb 1, 2024
1 parent b3ede52 commit 9afe077
Show file tree
Hide file tree
Showing 62 changed files with 2,827 additions and 1,519 deletions.
24 changes: 20 additions & 4 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,41 @@ stages:
- lint
- test

format_python:
python_format:
stage: lint
tags:
- python-lint
script:
- black --check .

static_check_python:
python_sort_imports:
stage: lint
tags:
- python-lint
script:
- isort --check test

python_static_check:
stage: lint
tags:
- python-lint
script:
- pyright .

run_test0:
run_ne16_test:
stage: test
tags:
- gap9-sdk
artifacts:
untracked: true
script:
- cd test && pytest test.py --test-dir tests --recursive
- cd test && pytest test.py --test-dir tests --recursive -A ne16

run_neureka_test:
stage: test
tags:
- siracusa-sdk
artifacts:
untracked: true
script:
- cd test && pytest test.py --test-dir tests --recursive -A neureka
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Changelog

## [Unreleased]

### Added

- N-EUREKA accelerator support: 3x3, 1x1, and 3x3 depthwise convolution kernels
- Support for kernels without normalization and quantization for NE16
- isort check
- publication citation

### Changed

- `ne16_task_init` got split into smaller parts: `ne16_task_init`, `ne16_task_set_op_to_conv`, `ne16_task_set_weight_offset`, `ne16_task_set_bits`, `ne16_task_set_norm_quant`
- strides in `ne16_task_set_strides`, `ne16_task_set_dims`, and `ne16_task_set_ptrs` are now strides between consecutive elements in that dimension
- `ne16_task_queue_size` is now `NE16_TASK_QUEUE_SIZE`

### Removed

- `k_in_stride`, `w_in_stride`, `k_out_stride`, and `w_out_stride` from `ne16_nnx_dispatch_stride2x2`
- `mode` attribute from `ne16_quant_t` structure

## [0.3.0] - 2024-01-14

### Added
Expand Down
80 changes: 37 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,51 +39,22 @@ _Note: The accelerator can provide additional helper functions if needed._
## Accelerators
### NE16
Github repo [link](https://github.com/pulp-platform/ne16).
#### Implemented features
- [x] Convolution w/ kernel shape 1x1
- [x] Convolution w/ kernel shape 3x3
- [x] Depthwise convolution w/ kernel shape 3x3
- [x] Stride 1x1
- [x] Stride 2x2
- [ ] Normalization and quantization
- [x] With
- [ ] Without
- [x] Relu (w/ and w/o)
- [x] Bias (w/ and w/o)
- [ ] Per-channel shift
- [x] Per-layer shift
- [ ] Rounding
- [ ] Input type
- [x] uint8
- [ ] uint16
- [ ] Output type
- [x] int8
- [x] uint8 (only w/ Relu)
- [ ] int32
- [ ] uint32 (only w/ Relu)
- [ ] Scale type
- [x] uint8
- [ ] uint16
- [ ] uint32
- [x] Bias type
- [x] int32
- [ ] Weight type
- [x] int8
- [ ] int2-7
### Neureka
**Untested and considered broken.**
- [NE16](ne16/README.md)
- [Neureka](neureka/README.md)
## Testing
You can find information about testing in the dedicated [README](test/README.md).
### Environment
The library was tested with following pairs of SDKs and compilers:
| SDK | SDK Commit Hash | Compiler | Compiler Commit Hash |
| --- | --------------- | -------- | -------------------- |
| gap\_sdk (obtainable from GreenWaves Technologies) | 90df4ce219 | [gap\_gnu\_toolchain](https://github.com/GreenWaves-Technologies/gap_gnu_toolchain) | 360fd4f9d6 |
| [pulp-sdk](https://github.com/Scheremo/pulp-sdk) | c216298881 | [pulp-riscv-gnu-toolchain](https://github.com/GreenWaves-Technologies/gap_gnu_toolchain) | 9938bd8fcf (release v1.0.16) |
## Contributing
Bug reports and feature requests should be reported through issues.
Expand All @@ -93,15 +64,38 @@ All the development should be done through forks and merged onto the `dev` branc
The library will follow the [Semantic Versioning](https://semver.org/).
## Citing
## Publication
<details>
<summary>If you use PULP-NNX in your work, you can cite us:</summary>
```
@inproceedings{10.1145/3607889.3609092,
author = {Macan, Luka and Burrello, Alessio and Benini, Luca and Conti, Francesco},
title = {WIP: Automatic DNN Deployment on Heterogeneous Platforms: the GAP9 Case Study},
year = {2024},
isbn = {9798400702907},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3607889.3609092},
doi = {10.1145/3607889.3609092},
abstract = {Emerging Artificial-Intelligence-enabled System-on-Chips (AI-SoCs) combine a flexible microcontroller with parallel Digital Signal Processors (DSP) and heterogeneous acceleration capabilities. In this Work-in-Progress paper, we focus on the GAP9 RISC-V SoC as a case study to show how the open-source DORY Deep Neural Network (DNN) tool flow can be extended for heterogeneous acceleration by fine grained interleaving of a dedicated Neural Engine and a cluster of RISC-V cores. Our results show that up to 91\% of the peak accelerator throughput can be extracted in end-to-end execution of benchmarks based on MobileNet-V1 and V2.},
booktitle = {Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems},
pages = {9–10},
numpages = {2},
keywords = {TinyML, MCUs, deep learning, HW accelerators},
location = {<conf-loc>, <city>Hamburg</city>, <country>Germany</country>, </conf-loc>},
series = {CASES '23 Companion}
}
```
*TBA*
</details>
## Contributors
* Luka Macan <[[email protected]](mailto:[email protected])>
* Francesco Conti <[[email protected]](mailto:[email protected])>
* Arpan Prasad <[[email protected]](mailto:[email protected])>
* Arpan Suravi Prasad <[[email protected]](mailto:[email protected])>
## License
Expand Down
15 changes: 7 additions & 8 deletions inc/pulp_nnx_ne16.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ void ne16_nnx_dispatch_wait(ne16_dev_t *dev);
/** ne16_nnx_dispatch
*
* Dispatch a task to the accelerator.
* Fails with return code 1 if the task cannot be dispatched. Otherwise returns 0.
* Fails with return code 1 if the task cannot be dispatched. Otherwise returns
* 0.
*/
int ne16_nnx_dispatch(ne16_dev_t *dev, ne16_task_t *task);

Expand All @@ -59,7 +60,6 @@ int ne16_nnx_resolve_check(ne16_dev_t *dev, ne16_task_t *task);
*/
void ne16_nnx_resolve_wait(ne16_dev_t *dev, ne16_task_t *task);


/* Additional helper functions */

/** ne16_nnx_dispatch_stride2x2
Expand All @@ -69,9 +69,8 @@ void ne16_nnx_resolve_wait(ne16_dev_t *dev, ne16_task_t *task);
* tile the tile to the subtile's spatial dimensions (in this case 3x3 output).
* Works only if the k_out is divisible by 2.
*/
void ne16_nnx_dispatch_stride2x2(
ne16_dev_t *dev, ne16_task_t *task, const uint32_t w_in, const uint32_t k_in,
const uint32_t w_in_stride, const uint32_t k_in_stride,
const uint32_t h_out, const uint32_t w_out, const uint32_t k_out,
const uint32_t w_out_stride, const uint32_t k_out_stride,
const uint8_t h_ker, const uint8_t w_ker);
void ne16_nnx_dispatch_stride2x2(ne16_dev_t *dev, ne16_task_t *task,
const uint32_t w_in, const uint32_t k_in,
const uint32_t h_out, const uint32_t w_out,
const uint32_t k_out, const uint8_t h_ker,
const uint8_t w_ker);
61 changes: 61 additions & 0 deletions inc/pulp_nnx_neureka.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/*
* Luka Macan <[email protected]>
*
* Copyright 2023 ETH Zurich and University of Bologna
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* SPDX-License-Identifier: Apache-2.0
*/

#include "neureka.h"
#include "neureka_siracusa_bsp.h"
#include "neureka_task.h"
#include <stdint.h>

/* PULP-NNX interface */

void neureka_nnx_init(neureka_dev_t *dev, neureka_siracusa_conf_t *conf);
void neureka_nnx_term(neureka_dev_t *dev);

/** neureka_nnx_dispatch_check
*
* Check whether you can dispatch to the accelerator.
*/
int neureka_nnx_dispatch_check(neureka_dev_t *dev);

/** neureka_nnx_dispatch_wait
*
* Block until you can dispatch to the accelerator.
*/
void neureka_nnx_dispatch_wait(neureka_dev_t *dev);

/** neureka_nnx_dispatch
*
* Dispatch a task to the accelerator.
* Fails with return code 1 if the task cannot be dispatched. Otherwise returns
* 0.
*/
int neureka_nnx_dispatch(neureka_dev_t *dev, neureka_task_t *task);

/** neureka_nnx_resolve_check
*
* Check whether the task has been resolved.
*/
int neureka_nnx_resolve_check(neureka_dev_t *dev, neureka_task_t *task);

/** neureka_nnx_resolve_wait
*
* Block until you can resolve the task.
*/
void neureka_nnx_resolve_wait(neureka_dev_t *dev, neureka_task_t *task);
36 changes: 36 additions & 0 deletions ne16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# NE16

## Docs

- Github repo [link](https://github.com/pulp-platform/ne16).

## Implemented features

- [x] Convolution w/ kernel shape 1x1
- [x] Convolution w/ kernel shape 3x3
- [x] Depthwise convolution w/ kernel shape 3x3
- [x] Stride 2x2
- [ ] Normalization and quantization
- [x] With
- [x] Without
- [x] Relu (w/ and w/o)
- [x] Bias (w/ and w/o)
- [ ] Per-channel shift
- [x] Per-layer shift
- [ ] Rounding
- [ ] Input type
- [x] uint8
- [ ] uint16
- [ ] Output type
- [x] int8
- [x] uint8 (only w/ Relu)
- [x] int32
- [ ] Scale type
- [x] uint8
- [ ] uint16
- [ ] uint32
- [x] Bias type
- [x] int32
- [ ] Weight type
- [x] int8
- [ ] int2-7
2 changes: 0 additions & 2 deletions ne16/hal/ne16.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@
#define NE16_STATUS_EMPTY (0x000)
#define NE16_STATUS_FULL (0x101)

inline int ne16_task_queue_size(ne16_dev_t *dev) { return 2; }

inline int ne16_task_queue_tasks_in_flight(ne16_dev_t *dev) {
uint32_t status = hwpe_task_queue_status(&dev->hwpe_dev);
return (status & 0x1) + ((status >> 8) & 0x1);
Expand Down
3 changes: 2 additions & 1 deletion ne16/hal/ne16.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,12 @@
#include "hwpe.h"
#include <stdint.h>

#define NE16_TASK_QUEUE_SIZE (2)

typedef struct ne16_dev_t {
hwpe_dev_t hwpe_dev; /* Implements the HWPE device interface */
} ne16_dev_t;

int ne16_task_queue_size(ne16_dev_t *dev);
int ne16_task_queue_tasks_in_flight(ne16_dev_t *dev);
int ne16_task_queue_empty(ne16_dev_t *dev);
int ne16_task_queue_full(ne16_dev_t *dev);
Expand Down
Loading

0 comments on commit 9afe077

Please sign in to comment.