TorchServe v0.8.0 Release Notes
This is the release of TorchServe v0.8.0.
New Features
- Supported large model inference in distributed environment #2193 #2320 #2209 #2215 #2310 #2218 @lxning @HamidShojanazeri
TorchServe added the deep integration to support large model inference. It provides PyTorch native large model inference solution by integrating PiPPy. It also provides the flexibility and extensibility to support other popular libraries such as Microsoft Deepspeed, and HuggingFace Accelerate.
To improve UX in Generative AI inference, TorchServe allows for sending intermediate token response to client side by supporting GRPC server side streaming and HTTP 1.1 chunked encoding .
By leveraging torch.compile
it's now possible to run torchserve using XLA which is optimized for both GPU and TPU deployments.
- Implemented New Metrics platform #2199 #2190 #2165 @namannandan @lxning
TorchServe fully supports metrics in Prometheus mode or Log mode. Both frontend and backend metrics can be configured in a central metrics YAML file.
Added config-file option for model config to model archiver tool. Users is able to flexibly define customized parameters in this YAML file, and easily access them in backend handler via variable context.model_yaml_config. This new feature also made TorchServe easily support the other new features and enhancements.
- Refactored PT2.0 support #2222 @msaroufim
We've refactored our model optimization utilities, improved logging to help debug compilation issues. We've also now deprecated compile.json
in favor of using the new YAML config format, follow our guide here to learn more https://github.com/pytorch/serve/blob/master/examples/pt2/README.md the main difference is while archiving a model instead of passing in compile.json
via --extra-files
we can pass in a --config-file model_config.yaml
By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. Starting from v0.8.0, TorchServe allows users to define deviceIds in the model_config.yaml. to assign GPUs to a model.
TorchServe supports hybrid mode on a GPU host. Users are able to define deviceType in model config YAML file to deploy a model on CPU of a GPU host.
TorchServe allows users to define clientTimeoutInMills in a model config YAML file. TorchServe calculates the expired timestamp of an incoming inference request if clientTimeoutInMills is set, and drops the request once it is expired.
Supported maxRetryTimeoutInSec, which defines the max maximum time window of recovering a dead backend worker of a model, in model config YAML file. The default value is 5 min. Users are able to adjust it in model config YAML file. The ping endpoint returns 200 if all models have enough healthy workers (ie, equal or larger the minWorkers); otherwise returns 500.
New Examples
-
Example of Pippy onboarding Open platform framework for distributed model inference #2215 @HamidShojanazeri
-
Example of DeepSpeed onboarding Open platform framework for distributed model inference #2218 @lxning
Improvements
-
Enabled Core pinning in CPU nightly benchmark #2166 #2237 @min-jean-cho
TorchServe can be used with Intel® Extension for PyTorch* to give performance boost on Intel hardware. Intel® Extension for PyTorch* is a Python package extending PyTorch with up-to-date features optimizations that take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI), Intel® Advanced Matrix Extensions (Intel® AMX), and more.
Enabling core pinning in TorchServe CPU nightly benchmark shows significant performance speedup. This feature is implemented via a script under PyTorch Xeon backend, initiated from Intel® Extension for PyTorch*. To try out core pinning on your workload, add cpu_launcher_enable=true
in config.properties
.
To try out more optimizations with Intel® Extension for PyTorch*, install Intel® Extension for PyTorch* and add ipex_enable=true
in config.properties
.
- Added Neuron nightly benchmark dashboard #2171 #2167 @namannandan
- Enabled torch.compile support for torch 2.0.0 pre-release #2256 @morgandu
- Fixed torch.compile mac regression test #2250 @msaroufim
- Added configuration option to disable system metrics #2104 @namannandan
- Added regression test cases for SageMaker MME contract #2200 @agunapal
In case of OOM , return error code 507 instead of generic code 503
-
Fixed Error thrown in KServe while loading multi-models #2235 @jagadeeshi2i
-
Added Docker CI for TorchServe #2226 @fabridamicelli
-
Change docker image release from dev to production #2227 @agunapal
-
Supported building docker images with specified Python version #2154 @agunapal
-
Model archiver optimizations:
a). Added wildcard file search in model archiver --extra-file #2142 @gustavhartz
b). Added zip-store option to model archiver tool #2196 @mreso
c). Made model archiver tests runnable from any directory #2191 @mreso
d). Supported tgz format model decompression in TorchServe frontend #2214 @lxning
- Enabled batch processing in example scripted tokenizer #2130 @mreso
- Made handler tests callable with pytest #2173 @mreso
- Refactored sanity tests #2219 @mreso
- Improved benchmark tool #2228 and added auto-validation # 2144 #2157 @agunapal
Automatically flag deviation of metrics from the average of last 30 runs
- Added notification for CI jobs' (benchmark, regression test) failure @agunapal
- Updated CI to run on ubuntu 20.04 #2153 @agunapal
- Added github code scanning codeql.yml #2149 @msaroufim
- freeze pynvml version to avoid crash in nvgpu #2138 @mreso
- Made pre-commit usage clearer in error message #2241 and upgraded isort version #2132 @msaroufim
Dependency Upgrades
Documentation
This study compares TPS b/w TorchServe with Nvidia MPS enabled and TorchServe without Nvidia MPS enabled on P3 and G4. It can help to the decision in enabling MPS for your deployment or not.
- Updated TorchServe page on pytorch.org #2243 @agunapal
- Lint fixed broken windows Conda link #2240 @msaroufim
- Corrected example PT2 doc #2244 @samils7
- Fixed regex error in Configuration.md #2172 @mpoemsl
- Fixed dead Kubectl links #2160 @msaroufim
- Updated model file docs in example doc #2148 @tmc
- Example for serving TorchServe using docker #2118 @agunapal
- Updated walmart blog link #2117 @agunapal
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 2.0.0 + Cuda 11.7, 11.8
Torch 1.13 + Cuda 11.7, 11.8
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2