Releases: AmusementClub/vs-mlrt
v15.7: latest TensorRT libraries, ONNX Runtime and MIGraphX interface improvements
TRT
- Upgraded to TensorRT 10.7.0.
ORT_DML
- Fixed blank output for the first returned frame reported by @Mr-Z-2697 in #117
MIGX
ORT_COREML
General
- Upgraded to CUDA 12.6.3.
vsmlrt.py
- Added support for RIFE v4.26 heavy model.
Full Changelog: v15.6...v15.7
v15.6: latest TensorRT and OpenVINO libraries
TRT
- Upgraded to TensorRT 10.6.0.
OV
- Upgraded to OpenVINO 2024.5.0, which adds support for Xe2 GPU and NPU 4 on Lunar Lake.
MIGX
- Fix missing precision check.
General
- Upgraded to CUDA 12.6.2.
vsmlrt.py
- Added support for RIFE v4.25 lite and heavy models.
Full Changelog: v15.5...v15.6
v15.5: latest TensorRT library, CoreML backend
TRT
- Upgraded to TensorRT 10.5.0.
- Volta GPUs (TITAN V, V100) are no longer supported.
ORT
-
Fix MacOS CoreML support for vsort by @yuygfgg in #106.
This pull request also added the
ORT_COREML
backend to vsmlrt.py.
General
- Upgraded to CUDA 12.6.1.
vsmlrt.py
-
Added support for RIFE v4.25 and v4.26 models.
-
Added automatic batch inference support via
batch_size
option ininference()
andflexible_inference()
, which may improve device utilization for inference on small inputs using some small models.- On the one hand, batching improves utilization by creating more work for each kernel invocation and reducing quantization inefficiency of kernel tiles in bulk parallelism. It also reduces average kernel launch and synchronization overhead per work.
- On the other hand, however, batching causes cache misses and inserts bubbles in the pipeline that may degrade performance.
This feature requires flexible output support starting with vs-mlrt v15 and is inspired by styler00dollar/VSGAN-tensorrt-docker@ac47012.
Note that not all onnx models are supported.
- Future RIFE v2 models will be fixed to support batch inference.
benchmark:
- NVIDIA GeForce RTX 4090
- driver 560.94
- Windows Server 2019
- python 3.12.6, vapoursynth-classic R57.A10, vs-mlrt v15.4
- input: 720x480 RGBS
- backend:
TRT(fp16=True, use_cuda_graph=True)
Measurements: FPS / Device Memory (MB)
model batch 1 batch 2 realesrgan compact (stream 1) 73.01 / 708 138.68 / 950 realesrgan compact (streams 2) 107.81 / 914 263.87 / 1347 realesrgan compact (streams 3) 108.30 / 1128 348.23 / 1738 realesrgan ultracompact (stream 1) 99.43 / 702 165.52 / 950 realesrgan ultracompact (streams 2) 184.48 / 908 302.56 / 1344 realesrgan ultracompact (streams 3) 184.69 / 1114 458.18 / 1738
Full Changelog: v15.4...v15.5
v15.4: latest TensorRT library
TRT
- Upgraded to TensorRT 10.4.0.
General
- Upgraded to CUDA 12.6.0.
vsmlrt.py
- Added support for Ani4K-v2 model by @srk24 in #105
- Added support for RIFE v4.23 and v4.24 models.
- Add
max_tactics
option to theTRT
backend, which can reduce engine build time by limiting the number of tactics to time.- By default, TensorRT will determine the number of tactics based on its own heuristic.
Batch Inference (Preview)
The latest vsmlrt.py (not in v15.4) provides experimental support for batch inference via batch_size
option in inference()
and flexible_inference()
, which may improve device utilization for inference on small inputs using some small models.
This feature requires flexible output support starting with vs-mlrt v15 and is inspired by styler00dollar/VSGAN-tensorrt-docker@ac47012.
Note that not all onnx models are supported.
Preliminary benchmark:
- NVIDIA GeForce RTX 4090
- driver 560.94
- Windows Server 2019
- python 3.12.6, vapoursynth-classic R57.A10
- input: 720x480 RGBS
- backend:
TRT(fp16=True, use_cuda_graph=True)
Measurements: FPS / Device Memory (MB)
model | batch 1 | batch 2 |
---|---|---|
realesrgan compact (stream 1) | 73.01 / 708 | 138.68 / 950 |
realesrgan compact (streams 2) | 107.81 / 914 | 263.87 / 1347 |
realesrgan compact (streams 3) | 108.30 / 1128 | 348.23 / 1738 |
realesrgan ultracompact (stream 1) | 99.43 / 702 | 165.52 / 950 |
realesrgan ultracompact (streams 2) | 184.48 / 908 | 302.56 / 1344 |
realesrgan ultracompact (streams 3) | 184.69 / 1114 | 458.18 / 1738 |
Full Changelog: v15.3...v15.4
v15.3: MIGraphX on Windows
MIGX
-
Add experimental MIGraphX support on Windows. MIGraphX is AMD's graph optimization engine to accelerate machine learning model inference.
- gfx1030: Radeon RX 6950 XT, Radeon RX 6900 XT, Radeon RX 6800 XT, Radeon RX 6800, ...
- gfx1100: Radeon RX 7900 XTX, Radeon RX 7900 XT, ...
- gfx1101: Radeon RX 7700 XT, ...
- gfx1102: Radeon RX 7600
Relevant archives include:
-
vsmlrt-windows-x64-migraphx.<version>.7z
: the all-in-one archive, containsvsmlrt.py
Python wrapper, some built-in ONNX models,vsmigx
/vsov
/vsort
/vsncnn
plugins and runtime.Supports
MIGX
/ORT_CPU
/ORT_DML
/OV_CPU
/OV_GPU
/OV_NPU
/NCNN_VK
backends. -
VSMIGX-Windows-x64.<version>.7z
: contains thevsmigx.dll
plugin only. -
vsmlrt-hip.<version>.7z
: contains the HIP and MIGraphX runtime only.
The MIGraphX runtime in this release uses HIP 6.1.2 and MIGraphX 2.11 (
9cf49f9
).Note that the Windows support has not been officially announced by AMD.
Known limitation
- The
MIGX
backend in the vsmlrt.py wrapper does not support device selection and will always use the default device (device_id=0
).
General
vsmlrt.py
- Added support for RIFE v4.22 (lite) models.
Full Changelog: v15.2...v15.3
v15.2: latest TensorRT library
TRT
-
Upgraded to TensorRT 10.3.0.
-
Fixed performance regression of RIFE and SAFA models starting with vs-mlrt v14.test4. This version may still be slightly slower than vs-mlrt v14.test3 under some conditions, however.
General
- Upgraded to CUDA 12.5.1.
vsmlrt.py
- Added support for RIFE v4.19 ~ v4.21 models.
- Added support for ArtCNN R8F64 (chroma) models.
- Deprecated ArtCNN C4F32 models based on developer's request, but compatibility at the vsmlrt.py level will be guaranteed.
Full Changelog: v15.1...v15.2
v15.1: latest TensorRT library
TRT
-
Upgraded to TensorRT 10.2.0.
-
Add TensorRT release package (
vsmlrt-windows-x64-tensorrt
). #102This package is a strict subset of the CUDA release package, with cuDNN, cuBLAS libraries and support for
ORT_CUDA
backend removed.It supports
TRT
,OV_*
,ORT_CPU
,ORT_DML
andNCNN_VK
backends.
known issue
-
Accoding to the documentation,
There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.
This affects RIFE and SAFA models.
vs-mlrt v14.test3 is the latest one that is not affected. This will be fixed in the next release by TensorRT 10.3.0.
General
- Upgraded to CUDA 12.5.0.
vsmlrt.py
- Added support for RIFE v4.17 lite and v4.18 models.
Full Changelog: v15...v15.1
v15: latest TensorRT library
General
plugins
-
Added parameter
flexible_output_prop
for flexible output:Traditionally, all plugins can only support onnx models with one or three output channels, due to vapoursynth's limitation.
By using the new flexible output feature, plugins can support onnx models with arbitrary number of output planes.
from typing import TypedDict class Output(TypedDict): clip: vs.VideoNode num_planes: int prop = "planes" # arbitrary non-empty string output = core.ov.Model(src, network_path, flexible_output_prop=prop) # type: Output clip = output["clip"] num_planes = output["num_planes"] output_planes = [ clip.std.PropToClip(prop=f"{prop}{i}") for i in range(num_planes) ] # type: list[vs.VideoNode]
This feature is supported by all plugins starting with vs-mlrt v15.
vsmlrt.py
-
Added support for RIFE v4.17 models.
-
Added support for ArtCNN models optimised for anime content. The chroma variants are not supported on previous versions of vs-mlrt, because they require the flexible output feature.
-
Added function
flexible_inference
for flexible output:The above sample is simplified as
output_planes = flexible_inference(src, network_path) # type: list[vs.VideoNode]
TRT
- Upgraded to TensorRT 10.1.0.
known issue
-
Accoding to the documentation,
There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.
This affects RIFE and SAFA models.
vs-mlrt v14.test3 is the latest one that is not affected.
Community contributions
- Fix
multiple flexible_output_prop keyword argument
error by @LightArrowsEXE in #97 - Fix missing spaces in exceptions by @LightArrowsEXE in #98
Full Changelog: v14...v15
v14: latest libraries
Compared to the previous stable (v13.2) release:
General
- External models are no longer packaged.
vsmlrt.py
- Plugin invocation order in the
get_plugin_path()
function is sorted to reduce memory consumption. - Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
- Added support for SCUNet models for image denoising.
TRT
plugin and runtime libraries
- Upgraded to TensorRT 10.0.1.
- Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
- Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
- Reduced engine build time.
- Added long path support for engines on Windows.
- cuDNN is no longer a strict runtime dependency.
vsmlrt.py
- The cuDNN tactic is no longer enabled by default.
- TF32 acceleration is disabled by default.
- The maximum workspace is set to
None
for the total memory size of the GPU. - Add parameters
builder_optimization_level
,max_aux_streams
,bf16
(#64),custom_env
,custom_args
,short_path
andengine_folder
(#90):builder_optimization_level
: "adjust how long TensorRT should spend searching for tactics with potentially better performance" linkmax_aux_streams
: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." linkbf16
: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." linkcustom_env
,custom_args
: custom environment variable and arguments for trtexec engine build.short_path
: whether to shorten engine name.- On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
engine_folder
: used to specify custom directory for engines.
known issues
-
Accoding to the documentation,
There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.
This affects RIFE and SAFA models. -
trtexec may reports errors like:
[E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
[E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
This issue has been submitted to NVIDIA.
ORT
- Upgraded to ONNX Runtime v1.18.0.
interface
- The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag in these backends is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
CUDA
- Reduced execution overhead.
- Added support for TF32 acceleration. This is disabled by default.
- Added experimental
prefer_nhwc
flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.
OV
- Upgraded to OpenVINO 2024.2.0.
- Added experimental
OV_NPU
backend for Intel NPUs.
MIGX
- Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.
Community contributions
scripts/vsmlrt.py
: update esrgan janai models by @hooke007 in #53scripts/vsmlrt.py
: add more esrgan janai models by @hooke007 in #82vsmigx
: allow fp16 input & output by @abihf in #86scripts/vsmlrt.py
: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)
Benchmark
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir color | 10.99 / 1715.172 | 11.62 / 3048.540 | 11.64 / 4381.912 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 22.38 / 2016.352 | 32.66 / 3734.880 | 32.54 / 5453.404 |
waifu2x cunet / cugan | 12.41 / 4359.284 | 15.53 / 8363.392 | 15.47 / 12367.504 |
waifu2x swin_unet | 3.80 / 7304.332 | 4.06 / 14392.408 | 4.06 / 21276.380 |
real-esrgan (v2/v3, xsx2) | 16.65 / 955.480 | 22.53 / 1645.904 | 22.49 / 2336.324 |
scunet color | 4.20 / 2847.708 | 4.33 / 6646.884 | 4.33 / 9792.736 |
Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).
This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.
Full Changelog: v13.2...v14
v14.test4: latest TensorRT and ONNX Runtime libraries
This is a preview release for TensorRT 10.0.0, following the v14.test
, v14.test2
and v14.test3
releases.
-
The
TRT
backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525. -
Added support for SwinIR models for image restoration, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation. -
Added support for SCUNet models for image denoising, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. -
Added
engine_folder
argument to theTRT
backend in vsmlrt.py to specify custom directory for engines. -
Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
-
The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
-
Reduce the overhead of the
ORT_CUDA
backend. -
Added support for TF32 acceleration to the
ORT_CUDA
backend. Disabled by default. -
Add experimental
prefer_nhwc
flag to theORT_CUDA
backend to reduce the number of layout transformations when using tensor cores. -
For production use of the
TRT
backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on theTRT
backend, continue to use any old release. -
Also check the release notes of the previous pre-releases.
benchmark 1
- RTX 4090
- processor clock @ 2520 MHz
- Intel Icelake server @ 2100 MHz
- Driver 551.86
- Windows 10 21H2 (19044.1415)
- TensorRT 10.0.0
- VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
1920x1080 rgbs, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
general
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir gray | 22.05 / 1818.796 | 25.30 / 3111.114 | 25.33 / 4403.488 |
dpir color | 18.30 / 1851.632 | 25.13 / 3176.808 | 25.17 / 4501.984 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 20.45 / 2148.716 | 41.22 / 3867.240 | 61.21 / 5585.764 |
waifu2x upresnet10 | 17.91 / 1716.588 | 34.53 / 2941.540 | 42.33 / 4166.492 |
waifu2x cunet / cugan | 13.89 / 4391.292 | 25.74 / 8346.248 | 25.96 / 12301.202 |
waifu2x swin_unet | 4.62 / 7436.692 | 5.43 / 14426.812 | 5.43 / 21412.840 |
real-esrgan (v2/v3, xsx2) | 17.06 / 1087.844 | 33.41 / 1778.264 | 38.26 / 2468.684 |
scunet gray | 5.29 / 3590.320 | 5.40 / 6678.768 | 5.40 / 9767.208 |
scunet color | 5.13 / 3555.568 | 5.48 / 6611.308 | 5.47 / 9667.048 |
swinir-s (2x, color) | 1.63 / 15897.048 | N/A | N/A |
swinir-m* (2x, color, 720p) | 1.05 / 11305.268 | N/A | N/A |
swinir-l* (4x, color, 720p) | 0.61 / 15391.316 | N/A | N/A |
*: swinir-m and swinir-l exhibit precision issues.
rife
v2, fp16 i/o
version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
---|---|---|---|---|---|
v4.4-v4.5 | 136.92/778.432 | 273.80/1149.204 | 414.80/1522.028 | 553.70/1892.796 | 574.31/2263.568 |
v4.6 | 136.01/800.960 | 275.26/1192.212 | 411.01/1585.516 | 544.30/1979.764 | 550.01/2368.020 |
v4.7-v4.9 | 98.20/1302.724 | 195.78/2187.548 | 210.12/3074.420 | 210.45/3957.196 | 210.66/4844.068 |
v4.10-v4.15 | 84.41/1595.592 | 160.93/2773.280 | 161.96/3953.020 | 162.04/5132.760 | 162.07/6310.448 |
{v4.12, v4.13, v4.15, v4.16}_lite | 93.39/1333.444 | 187.32/2255.132 | 197.71/3178.872 | 198.01/4098.508 | 197.95/5022.248 |
v4.14 lite | 81.83/1595.292 | 153.40/2779.424 | 154.19/3963.260 | 154.28/5149.140 | 154.30/6332.980 |
benchmark 2
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | ORT_CUDA NCHW | ORT_CUDA NHWC | ORT_DML |
---|---|---|---|
dpir color | 4.54 / 2573.3 | 5.98 / 2470.9 | 8.45 / 2364.5 |
dpir color (2 streams) | 4.66 / 4854.9 | 6.30 / 4680.8 | 9.48 / 4630.9 |
waifu2x upconv7 | 10.98 / 5432.5 | 3.18 / 3017.8 | 12.48 / 4493.0 |
waifu2x upconv7 (2 streams) | 14.96 / 10397.1 | 3.25 / 5780.9 | 21.72 / 8891.7 |
waifu2x cunet / cugan | 4.70 / 7955.6 | 4.49 / 6290.6 | OOM |
waifu2x cunet / cugan (2 streams) | 5.11 / 15721.9 | 4.78 / 12312.0 | OOM |
waifu2x swin_unet_art | 2.98 / 23518.5 | 3.05 / 22812.0 | N/A |
realesrgan | 8.99 / 1647.7 | 11.20 / 1127.5 | 11.99 / 1346.6 |
realesrgan (2 streams) | 10.69 / 3034.5 | 13.58 / 1994.1 | 17.34 / 2601.6 |
rife v4.4 (1920x1088) | 61.42 / 1100.9 | 56.02 / 1162.3 | 44.73 / 882.4 |
rife v4.4 (1920x1088, 2 streams) | 106.48 / 1953.4 | 92.88 / 2071.9 | 68.80 / 1670.7 |
scunet color | N/A | N/A | N/A |
benchmark 3
NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | TRT | ORT_CUDA | ORT_DML | ORT_CUDA NHWC |
---|---|---|---|---|
dpir color (1 stream) | 7.08 / 1899 | 3.10 / 2602 | 4.99 / 2341 | 4.26 / 2411 |
dpir color (2 streams) | 8.06 / 3376 | 3.30 / 5016 | 5.85 / 4619 | 4.74 / 4650 |
waifu2x upconv7 (1 stream) | 11.47 / 2014 | 7.01 / 4949 | 7.45 / 4501 | 1.59 / 2923 |
waifu2x upconv7 (2 streams) | 21.44 / 3782 | 10.11 / 9732 | 13.23 / 8940 | 1.77 / 5674 |
waifu2x cunet / cugan (1 stream) | 7.41 / 4664 | 3.10 / 10067 | OOM | 0.77 / 6188 |
waifu2x cunet / cugan (2 streams) | 10.92 / 8863 | OOM | OOM | OOM |
waifu2x swin_unet_art (1 stream) | 2.35 / 7234 | OOM | N/A | OOM |
waifu2x swin_unet_art (2 streams) | OOM | OOM | N/A | OOM |
realesrgan (1 stream) | 8.66 / 1268 | 5.33 / 1545 | 6.39 / 1316 | 6.96 / 1033 |
realesrgan (2 streams) | 13.20 / 2166 | 7.78 / 2932 | 10.22 / 2571 | 10.25 / 1895 |
rife v4.4 (1920x1088, fp16 i/o, 1 stream) | 64.97 / 609 | 46.60 / 967 | 32.18 / 723 | 48.5... |