Release TorchServe v0.8.0 Release Notes · pytorch/serve

This is the release of TorchServe v0.8.0.

New Features

Supported large model inference in distributed environment #2193 #2320 #2209 #2215 #2310 #2218 @lxning @HamidShojanazeri

TorchServe added the deep integration to support large model inference. It provides PyTorch native large model inference solution by integrating PiPPy. It also provides the flexibility and extensibility to support other popular libraries such as Microsoft Deepspeed, and HuggingFace Accelerate.

Supported streaming response for GRPC #2186 and HTTP #2233 @lxning

To improve UX in Generative AI inference, TorchServe allows for sending intermediate token response to client side by supporting GRPC server side streaming and HTTP 1.1 chunked encoding .

Supported PyTorch/XLA on GPU and TPU #2182 @morgandu

By leveraging torch.compile it's now possible to run torchserve using XLA which is optimized for both GPU and TPU deployments.

Implemented New Metrics platform #2199 #2190 #2165 @namannandan @lxning

TorchServe fully supports metrics in Prometheus mode or Log mode. Both frontend and backend metrics can be configured in a central metrics YAML file.

Supported map based model config YAML file. #2193 @lxning

Added config-file option for model config to model archiver tool. Users is able to flexibly define customized parameters in this YAML file, and easily access them in backend handler via variable context.model_yaml_config. This new feature also made TorchServe easily support the other new features and enhancements.

Refactored PT2.0 support #2222 @msaroufim

We've refactored our model optimization utilities, improved logging to help debug compilation issues. We've also now deprecated compile.json in favor of using the new YAML config format, follow our guide here to learn more https://github.com/pytorch/serve/blob/master/examples/pt2/README.md the main difference is while archiving a model instead of passing in compile.json via --extra-files we can pass in a --config-file model_config.yaml

Supported user specified gpu deviceIds for a model #2193 @lxning

By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. Starting from v0.8.0, TorchServe allows users to define deviceIds in the model_config.yaml. to assign GPUs to a model.

Supported cpu model on a GPU host #2193 @lxning

TorchServe supports hybrid mode on a GPU host. Users are able to define deviceType in model config YAML file to deploy a model on CPU of a GPU host.

Supported Client Timeout #2267 @lxning

TorchServe allows users to define clientTimeoutInMills in a model config YAML file. TorchServe calculates the expired timestamp of an incoming inference request if clientTimeoutInMills is set, and drops the request once it is expired.

Updated ping endpoint default behavior #2254 @lxning

Supported maxRetryTimeoutInSec, which defines the max maximum time window of recovering a dead backend worker of a model, in model config YAML file. The default value is 5 min. Users are able to adjust it in model config YAML file. The ping endpoint returns 200 if all models have enough healthy workers (ie, equal or larger the minWorkers); otherwise returns 500.

New Examples

Example of Pippy onboarding Open platform framework for distributed model inference #2215 @HamidShojanazeri
Example of DeepSpeed onboarding Open platform framework for distributed model inference #2218 @lxning
Example of Stable diffusion v2 #2009 @jagadeeshi2i

Improvements

Upgraded to PyTorch 2.0 #2194 @agunapal
Enabled Core pinning in CPU nightly benchmark #2166 #2237 @min-jean-cho

TorchServe can be used with Intel® Extension for PyTorch* to give performance boost on Intel hardware. Intel® Extension for PyTorch* is a Python package extending PyTorch with up-to-date features optimizations that take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI), Intel® Advanced Matrix Extensions (Intel® AMX), and more.

Enabling core pinning in TorchServe CPU nightly benchmark shows significant performance speedup. This feature is implemented via a script under PyTorch Xeon backend, initiated from Intel® Extension for PyTorch*. To try out core pinning on your workload, add cpu_launcher_enable=true in config.properties.

To try out more optimizations with Intel® Extension for PyTorch*, install Intel® Extension for PyTorch* and add ipex_enable=true in config.properties.

Added Neuron nightly benchmark dashboard #2171 #2167 @namannandan
Enabled torch.compile support for torch 2.0.0 pre-release #2256 @morgandu
Fixed torch.compile mac regression test #2250 @msaroufim
Added configuration option to disable system metrics #2104 @namannandan
Added regression test cases for SageMaker MME contract #2200 @agunapal

In case of OOM , return error code 507 instead of generic code 503

Fixed Error thrown in KServe while loading multi-models #2235 @jagadeeshi2i
Added Docker CI for TorchServe #2226 @fabridamicelli
Change docker image release from dev to production #2227 @agunapal
Supported building docker images with specified Python version #2154 @agunapal
Model archiver optimizations:

a). Added wildcard file search in model archiver --extra-file #2142 @gustavhartz
b). Added zip-store option to model archiver tool #2196 @mreso
c). Made model archiver tests runnable from any directory #2191 @mreso
d). Supported tgz format model decompression in TorchServe frontend #2214 @lxning

Enabled batch processing in example scripted tokenizer #2130 @mreso
Made handler tests callable with pytest #2173 @mreso
Refactored sanity tests #2219 @mreso
Improved benchmark tool #2228 and added auto-validation # 2144 #2157 @agunapal

Automatically flag deviation of metrics from the average of last 30 runs

Added notification for CI jobs' (benchmark, regression test) failure @agunapal
Updated CI to run on ubuntu 20.04 #2153 @agunapal
Added github code scanning codeql.yml #2149 @msaroufim
freeze pynvml version to avoid crash in nvgpu #2138 @mreso
Made pre-commit usage clearer in error message #2241 and upgraded isort version #2132 @msaroufim

Dependency Upgrades

Documentation

Nvidia MPS integration study #2205 @mreso

This study compares TPS b/w TorchServe with Nvidia MPS enabled and TorchServe without Nvidia MPS enabled on P3 and G4. It can help to the decision in enabling MPS for your deployment or not.

Updated TorchServe page on pytorch.org #2243 @agunapal
Lint fixed broken windows Conda link #2240 @msaroufim
Corrected example PT2 doc #2244 @samils7
Fixed regex error in Configuration.md #2172 @mpoemsl
Fixed dead Kubectl links #2160 @msaroufim
Updated model file docs in example doc #2148 @tmc
Example for serving TorchServe using docker #2118 @agunapal
Updated walmart blog link #2117 @agunapal

Platform Support

Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

Torch 2.0.0 + Cuda 11.7, 11.8
Torch 1.13 + Cuda 11.7, 11.8
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchServe v0.8.0 Release Notes