Skip to content

Commit

Permalink
Merge branch 'main' of github.com:torchpipe/torchpipe.github.io
Browse files Browse the repository at this point in the history
  • Loading branch information
张仕洋 committed Feb 27, 2024
2 parents 88690c7 + 5cb0be4 commit 8c91e8a
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 23 deletions.
40 changes: 20 additions & 20 deletions docs/quick_start_new_user.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ img = self.precls_trans(cv2.resize(cv2.cvtColor(img, cv2.COLOR_BGR2RGB), (224,22
```py

input_shape = torch.ones((1, 3, 224, 224)).cuda()
self.classification_engine = torch2trt(resnet50, [input_shape],
self.classification_engine = torch2trt(resnet50, [input_shape],
fp16_mode=self.fp16,
max_batch_size=self.cls_trt_max_batchsize,
)
Expand Down Expand Up @@ -116,8 +116,8 @@ config = torchpipe.parse_toml("resnet50.toml")
self.classification_engine = pipe(config)

self.classification_engine(bin_data)


if TASK_RESULT_KEY not in bin_data.keys():
print("error decode")
return results
Expand All @@ -133,9 +133,9 @@ The contents of the toml file are as follows:

```bash
# Schedule'parameter
batching_timeout = 5
instance_num = 8
precision = "fp16"
batching_timeout = 5
instance_num = 8
precision = "fp16"

## Data decoding
#
Expand All @@ -145,11 +145,11 @@ precision = "fp16"
# The original decoding output format was BGR
# The DecodeMat backend also defaults to outputting in BGR format
# Since decoding is done on the CPU, DecodeMat is used
# After each node is completed, the name of the next node needs to be
# After each node is completed, the name of the next node needs to be
# appended, otherwise the last node is assumed by default
#
[cpu_decoder]
backend = "DecodeMat"
backend = "DecodeMat"
next = "cpu_posdecoder"

## preprocessing: resize、cvtColorMat
Expand All @@ -160,11 +160,11 @@ next = "cpu_posdecoder"
# Note:
# The original preprocessing order was resize, cv2.COLOR_BGR2RGB,
# then Normalize.
# However, the normalization step is now integrated into the model
# processing (the [resnet50] node), so the output result after the
# preprocessing in this node is consistent with the preprocessing result
# However, the normalization step is now integrated into the model
# processing (the [resnet50] node), so the output result after the
# preprocessing in this node is consistent with the preprocessing result
# without normalization.
# After each node is completed, the name of the next node needs to be
# After each node is completed, the name of the next node needs to be
# appended, otherwise the last node is assumed by default.
#
[cpu_posdecoder]
Expand All @@ -183,23 +183,23 @@ next = "resnet50"
#
# This corresponds to 3.1(3) TensorRT acceleration and 3.1(2)Normalize
# Note:
# There's a slight difference from the original method of generating
# engines online. Here, the model needs to be first converted to ONNX
# There's a slight difference from the original method of generating
# engines online. Here, the model needs to be first converted to ONNX
# format.
#
#
# For the conversion method, see [Converting Torch to ONNX].
#
[resnet50]
backend = "SyncTensor[TensorrtTensor]"
backend = "SyncTensor[TensorrtTensor]"
min = 1
max = 4
instance_num = 4
model = "/you/model/path/resnet50.onnx"
model = "/you/model/path/resnet50.onnx"

mean="123.675, 116.28, 103.53" # 255*"0.485, 0.456, 0.406"
std="58.395, 57.120, 57.375" # 255*"0.229, 0.224, 0.225"

# TensorrtTensor
# TensorrtTensor
"model::cache"="/you/model/path/resnet50.trt" # or resnet50.trt.encrypted

```
Expand All @@ -221,7 +221,7 @@ std="58.395, 57.120, 57.375" # 255*"0.229, 0.224, 0.225"

The specific test code can be found at [client_qps.py](https://github.com/torchpipe/torchpipe/blob/develop/examples/resnet50_thrift/client_qps.py)

With the same Thrift service interface, testing on a machine with NIDIA-3080 GPU, 36-core CPU, and concurrency of 10, we have the following results:
With the same Thrift service interface, testing on a machine with NVIDIA-3080 GPU, 36-core CPU, and concurrency of 10, we have the following results:

- throughput:

Expand All @@ -233,7 +233,7 @@ With the same Thrift service interface, testing on a machine with NIDIA-3080 GPU
- response time:

| Methods | TP50 | TP99 |
:-: | :-: | :-:|
:-: | :-: | :-:|
| Pure TensorRT | 26.74 |35.24|
| Using TorchPipe |8.89|14.28|

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ type: explainer

业界有一些实践,如[triton inference server](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models), [阿里妈妈high_service](https://mp.weixin.qq.com/s/Fd2GNXqO3wl3FrA7Wli3jA), [美团视觉GPU推理服务部署架构优化实践](https://zhuanlan.zhihu.com/p/605094862)

通常用户对于Trinton Inference Server的一个抱怨是,在多个节点交织的系统中,需要在客户端完成大量业务逻辑,并通过RPC调用服务端,很麻烦;而为了性能考虑,不得不考虑共享显存、ensemble、[自定义业务逻辑(Business Logic Scripting)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)等非常规手段。
通常用户对于Triton Inference Server的一个抱怨是,在多个节点交织的系统中,需要在客户端完成大量业务逻辑,并通过RPC调用服务端,很麻烦;而为了性能考虑,不得不考虑共享显存、ensemble、[自定义业务逻辑(Business Logic Scripting)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)等非常规手段。

为了解决这个问题,TorchPipe通过深入PyTorch的C++计算后端和CUDA流管理,以及针对多节点的领域特定语言建模,对外提供面向PyTorch前端的线程安全函数接口,对内提供面向用户的细粒度后端扩展。


![jpg](.././static/images/EngineFlow-light.png)
<center>torchpipe框架图</center>
<center>torchpipe框架图</center>

**torchpipe框架特点:**
- 性能(峰值吞吐/TP99)上达到业务角度上的近乎最优,减少广泛存在的负优化和节点间性能损耗。
Expand Down
2 changes: 1 addition & 1 deletion i18n/zh/docusaurus-plugin-content-docs/current/welcome.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ type: explainer
# 欢迎查看 torchpipe 文档!
torchpipe是一个独立作用于底层加速库(如tensorrt,opencv, CVCUDA,torchscript)以及 RPC(如thrift, gRPC)之间的多实例流水线并行库。在满足时延前提下最大限度挖掘服务吞吐能力。

整个方案集并发安全和全链路流水线调度等特点于一身,支持NIDIA硬件平台, 兼顾了开发效率与性能提速的特点。
整个方案集并发安全和全链路流水线调度等特点于一身,支持 NVIDIA 硬件平台, 兼顾了开发效率与性能提速的特点。

torchpipe 对外提供面向 PyTorch 前端的线程安全函数接口,对内则提供面向用户的细粒度后端扩展。

Expand Down

0 comments on commit 8c91e8a

Please sign in to comment.