Skip to content

Commit

Permalink
Translating "translate perf_infer_gpu_multi.md" to Chinese (#35271)
Browse files Browse the repository at this point in the history
add "translate perf_infer_gpu_multi"
  • Loading branch information
HMJ0628 authored Dec 16, 2024
1 parent 22834ee commit 886f690
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/source/zh/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@
title: 完全分片数据并行
- local: perf_train_special
title: 在 Apple silicon 芯片上进行 PyTorch 训练
- local: perf_infer_gpu_multi
title: 多GPU推理
- local: perf_train_cpu
title: 在CPU上进行高效训练
- local: perf_hardware
Expand Down
68 changes: 68 additions & 0 deletions docs/source/zh/perf_infer_gpu_multi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# 多GPU推理

某些模型现已支持内置的**张量并行**(Tensor Parallelism, TP),并通过 PyTorch 实现。张量并行技术将模型切分到多个 GPU 上,从而支持更大的模型尺寸,并对诸如矩阵乘法等计算任务进行并行化。

要启用张量并行,只需在调用 [`~AutoModelForCausalLM.from_pretrained`] 时传递参数 `tp_plan="auto"`

```python
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# 初始化分布式环境
rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.distributed.init_process_group("nccl", device_id=device)

# 获取支持张量并行的模型
model = AutoModelForCausalLM.from_pretrained(
model_id,
tp_plan="auto",
)

# 准备输入tokens
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# 分布式运行
outputs = model(inputs)
```

您可以使用 `torchrun` 命令启动上述脚本,多进程模式会自动将每个进程映射到一张 GPU:

```
torchrun --nproc-per-node 4 demo.py
```

目前,PyTorch 张量并行支持以下模型:
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)

如果您希望对其他模型添加张量并行支持,可以通过提交 GitHub Issue 或 Pull Request 来提出请求。

### 预期性能提升

对于推理场景(尤其是处理大批量或长序列的输入),张量并行可以显著提升计算速度。

以下是 [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) 模型在序列长度为 512 且不同批量大小情况下的单次前向推理的预期加速效果:

<div style="text-align: center">
<img src="huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Meta-Llama-3-8B-Instruct, seqlen = 512, python, w_ compile.png">
</div>

0 comments on commit 886f690

Please sign in to comment.