Skip to content

Commit

Permalink
增加昇腾NPU编译文档
Browse files Browse the repository at this point in the history
  • Loading branch information
TylunasLi committed Aug 11, 2024
1 parent 948da1e commit a7b237b
Showing 1 changed file with 139 additions and 0 deletions.
139 changes: 139 additions & 0 deletions docs/ascend_npu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# 昇腾 NPU 编译

## 认识昇腾 NPU产品

初次接触华为昇腾产品线时,厘清混乱的版本和编号规则十分重要,

下表总结了目前最常见的昇腾产品、与芯片型号的对应关系:

| 系列 | 产品 | 芯片系列 | 内存 | 内存带宽 |
| ------------------------- | -------------------------------------------- | -------- | ----------- | -------------- |
| Atlas 200/500 推理产品 | Atlas 200加速模块 Atlas 500智能小站 | _昇腾?_ | 8G | 51.2GB/s |
| Atlas 200/500 A2 推理产品 | Atlas 200I A2加速模块 Atlas 500 A2智能小站 | _昇腾?_ | 12G | 51.2GB/s |
| Atlas 300 推理产品 | Atlas 300I | 昇腾310 | 32G(8*4G) | 204.8GB/s |
| Atlas 训练系列产品 | Atlas 800T 训练服务器/ Atlas 300T | 昇腾910A | 32G | HBM, > 640GB/s |
| Atlas A2训练系列产品 | Atlas 800T A2 训练服务器 | 昇腾910B | 32G / 64G | HBM, > 640GB/s |
| Atlas 推理系列产品 | Atlas 300I Pro | 昇腾310P | 24G | 204.8 GB/s |
| | Atlas 300I Duo | 昇腾310P | 48G / 96G | 408GB/s |

其中,第一代昇腾310 / 910A系列产品按架构分为不同版本,3000 / 9000 适配 ARM64架构, 3010 / 9010 适配x86架构,彼此不可混用。如Atlas 800 推理服务器(型号:3000)表示ARM架构推理服务器,插 Atlas 300I 卡时只能插型号为3000的。

执行`npu-smi info`命令时,可以看到设备的SoC信息,SoC与对芯片类型的对应关系如下:

| 芯片系列 | SOC 型号(Name) |
| -------- | --------------------------------------------- |
| 昇腾310 | 310B1 |
| 昇腾910A | 910A **910B** 910ProA **910ProB** 910PremiumA |
| 昇腾310P | 310P1 310P3 |
| 昇腾910B | 910B1 910B2 910B2C 910B3 910B4 |

其中,910B、910ProB仍然是910A;910B3、910B4为910B屏蔽部分计算单元的版本。

## 环境搭建

### 安装驱动和CANN

* 安装NPU驱动:
参考官方文档:[310P](https://www.hiascend.com/document/detail/zh/quick-installation/24.0.RC1/quickinstg/800I_A2/quickinstg_800I_A2_0007.html) [910A](https://www.hiascend.com/document/detail/zh/quick-installation/23.0.RC2/quickinstg/800_9000/quickinstg_800_9000_0007.html) [910B](https://www.hiascend.com/document/detail/zh/quick-installation/24.0.RC1/quickinstg_train/800_9000A2/quickinstg_800_9000A2_0007.html)

* 安装CANN:

需要先安装对应版本的Python. (CANN < 7.0 安装 Python3.7, CANN > 7.0 安装 Python3.9/Python3.10)

参考官方文档:[310P](https://www.hiascend.com/document/detail/zh/quick-installation/24.0.RC1/quickinstg/800_3000/quickinstg_800_3000_0021.html) [910A](https://www.hiascend.com/document/detail/zh/quick-installation/23.0.RC2/quickinstg/800_9000/quickinstg_800_9000_0018.html) [910B](https://www.hiascend.com/document/detail/zh/quick-installation/24.0.RC1/quickinstg_train/800_9000A2/quickinstg_800_9000A2_0020.html)

此外,6.0 版本以上需要按对应的NPU安装预编译算子包(kernels):

```shell
chmod a+x Ascend-cann-kernels-910*_*_linux.run
Ascend-cann-kernels-910*_*_linux.run --install --quiet
```

### 使用Docker镜像

昇腾910A 参考paddleCustomDevice镜像,如

* `paddlepaddle/paddle:latest-dev-cann5.0.2.alpha005-gcc82-aarch64`
* `paddlepaddle/paddle:latest-dev-cann5.0.2.alpha005-gcc82-x86_64`
* `registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-aarch64-gcc82`
* `registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-x86_64-gcc82`
* `registry.baidubce.com/device/paddle-npu:cann701-ubuntu20-aarch64-gcc84-py39`
* `registry.baidubce.com/device/paddle-npu:cann701-ubuntu20-x86_64-gcc84-py39`

昇腾910B 参考paddleCustomDevice镜像,如
```shell
docker pull registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-aarch64-gcc84-py39
```

git clone本仓库后,使用如下命令启动:

```shell
docker run -it -v ${PWD}:/workspace --network host -u root --name fastllm_builder --device=/dev/davinci0 -v /etc/localtime:/etc/localtime --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info -v /var/log/npu/:/usr/slog -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /sys/fs/cgroup:/sys/fs/cgroup:ro -v /dev/shm/:/dev/shm/ ${IMAGE_NAME} /bin/bash
```
其中`IMAGE_NAME`为使用的镜像名称。

## 编译

进入环境后,执行如下脚本设置环境变量:
```shell
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```

如果是Docker容器,需要验证当前环境下NPU是否可用:
```shell
npu-smi info
```
类似下面的结果表示NPU可正常使用:
```
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc1 Version: 24.1.rc1 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B4 | OK | 81.1 37 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 2773 / 32768 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
```
类似下面的结果表示NPU被占用:
```
DrvMngGetConsoleLogLevel failed. (g_conLogLevel=3)
dcmi model initialized failed, because the device is used. ret is -8020
```

然后即可执行编译流程:
```shell
mkdir build-ascend/ && cd build-ascend/
cmake .. -DUSE_ASCEND_NPU=ON # 使用NPU
make -j
```

## 运行demo程序

我们假设已经获取了名为`model.flm`的模型(参照 [模型获取](#模型获取),初次使用可以先下载转换好的模型)

编译完成之后在build目录下可以使用下列demo:

```shell
# 这时在fastllm/build目录下

# 命令行聊天程序, 支持打字机效果 (只支持Linux)
./main -p model.flm

# 简易webui, 使用流式输出 + 动态batch,可多路并发访问
./webui -p model.flm --port 1234

# python版本的命令行聊天程序,使用了模型创建以及流式对话效果
python tools/cli_demo.py -p model.flm

# python版本的简易webui,需要先安装streamlit-chat
streamlit run tools/web_demo.py model.flm

```

更多功能及接口请参照[详细文档](../README.md)

0 comments on commit a7b237b

Please sign in to comment.