TorchServe v0.7.0 Release Notes
This is the release of TorchServe v0.7.0.
New Examples
- HF + Better Transformer integration #2002 @HamidShojanazeri
Better Transformer / Flash Attention & Xformer Memory Efficient provides out of box performance with major speed ups for PyTorch Transformer encoders. This has been integrated into Torchserve HF Transformer example, please read more about this integration here.
Main speed ups in Better Transformers comes from exploiting sparsity on padded inputs and kernel fusions. As a result you would see the biggest gains when dealing with larger workloads, such sequences with longer paddings and larger batch sizes.
In our benchmarks on P3 instances with 4 V100 GPUs, using Torchserve benchmarking workloads, throughput has shown significant improvement with large batch sizes. 45.5% increase with batch size 8; 50.8% increase with batch size 16; 45.2% increase with batch size 32; 47.2% increase with batch size 64. and 17.2 increase with batch size 4. These number can vary based on your workload (batch size , padding percentage) and your hardware. Please look up some other benchmarks in the blog post.
torch.compile()
support #1960 @msaroufim
We've added experimental support for PT 2.0 as in torch.compile() support within torchserve. To use it you need to supply a file compile.json
when archiving your model to specify which backend you want. We've also enabled by default mode=reduce-overhead
which is ideally suited for smaller batch sizes which are more common for inference. We recommend for now to leverage GPUs with tensor cores available like A10G or A100 since you're likely to see the greatest speedups there.
On training we've seen speedups ranging from 30% to 2x https://pytorch.org/get-started/pytorch-2.0/ but we haven't ran any performance benchmarks yet for inference. Until then we recommend you continue leveraging other runtimes like TensorRT or IPEX for accelerated inference which we highlight in our performance_guide.md
. There are a few important caveats to consider when you're using torch.compile: changes in batch sizes will cause recompilations so make sure to leverage a small batch size, there will be additional overhead to start a model since you need to compile it first and you'll likely still see the largest speedups with TensorRT.
However, we hope that adding this support will make it easier for you to benchmark and try out PT 2.0. Learn more here https://github.com/pytorch/serve/tree/master/examples/pt2
Dependency Upgrades
- Support Python 3.10 #2031 @agunapal
- Support PyTorch 1.13 and Cuda 11.7 #1980 @agunapal
- Update docker default from Ubuntu 18.04 to Ubuntu 20.04 (LTS) #1970 @LuigiCerone
Improvements
- KFServe upgrade to 0.9 - #1860 @jagadeesh
- Added pyyaml for python venv #2014 @lxning
- Added HG BERT better transformer benchmark #2024 @lxning
Documentation
Platform Support
Ubuntu 16.04, Ubuntu 18.04, MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 1.13 + Cuda 11.7
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2