Skip to content

v14.test4: latest TensorRT and ONNX Runtime libraries

Pre-release
Pre-release
Compare
Choose a tag to compare
@github-actions github-actions released this 27 Mar 03:27
· 199 commits to master since this release

This is a preview release for TensorRT 10.0.0, following the v14.test, v14.test2 and v14.test3 releases.

  • The TRT backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525.

  • Added support for SwinIR models for image restoration, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation.

  • Added support for SCUNet models for image denoising, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later.

  • Added engine_folder argument to the TRT backend in vsmlrt.py to specify custom directory for engines.

  • Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.

  • The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag is as follows:

    • Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
    • Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
  • Reduce the overhead of the ORT_CUDA backend.

  • Added support for TF32 acceleration to the ORT_CUDA backend. Disabled by default.

  • Add experimental prefer_nhwc flag to the ORT_CUDA backend to reduce the number of layout transformations when using tensor cores.

  • For production use of the TRT backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on the TRT backend, continue to use any old release.

  • Also check the release notes of the previous pre-releases.


benchmark 1

previous benchmark

  • RTX 4090
    • processor clock @ 2520 MHz
  • Intel Icelake server @ 2100 MHz
  • Driver 551.86
  • Windows 10 21H2 (19044.1415)
  • TensorRT 10.0.0
  • VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model 1 stream 2 streams 3 streams
dpir gray 22.05 / 1818.796 25.30 / 3111.114 25.33 / 4403.488
dpir color 18.30 / 1851.632 25.13 / 3176.808 25.17 / 4501.984
waifu2x upconv_7_{anime_style_art_rgb, photo} 20.45 / 2148.716 41.22 / 3867.240 61.21 / 5585.764
waifu2x upresnet10 17.91 / 1716.588 34.53 / 2941.540 42.33 / 4166.492
waifu2x cunet / cugan 13.89 / 4391.292 25.74 / 8346.248 25.96 / 12301.202
waifu2x swin_unet 4.62 / 7436.692 5.43 / 14426.812 5.43 / 21412.840
real-esrgan (v2/v3, xsx2) 17.06 / 1087.844 33.41 / 1778.264 38.26 / 2468.684
scunet gray 5.29 / 3590.320 5.40 / 6678.768 5.40 / 9767.208
scunet color 5.13 / 3555.568 5.48 / 6611.308 5.47 / 9667.048
swinir-s (2x, color) 1.63 / 15897.048 N/A N/A
swinir-m* (2x, color, 720p) 1.05 / 11305.268 N/A N/A
swinir-l* (4x, color, 720p) 0.61 / 15391.316 N/A N/A

*: swinir-m and swinir-l exhibit precision issues.

rife

v2, fp16 i/o

version 1 stream 2 streams 3 streams 4 streams 5 streams
v4.4-v4.5 136.92/778.432 273.80/1149.204 414.80/1522.028 553.70/1892.796 574.31/2263.568
v4.6 136.01/800.960 275.26/1192.212 411.01/1585.516 544.30/1979.764 550.01/2368.020
v4.7-v4.9 98.20/1302.724 195.78/2187.548 210.12/3074.420 210.45/3957.196 210.66/4844.068
v4.10-v4.15 84.41/1595.592 160.93/2773.280 161.96/3953.020 162.04/5132.760 162.07/6310.448
{v4.12, v4.13, v4.15, v4.16}_lite 93.39/1333.444 187.32/2255.132 197.71/3178.872 198.01/4098.508 197.95/5022.248
v4.14 lite 81.83/1595.292 153.40/2779.424 154.19/3963.260 154.28/5149.140 154.30/6332.980

benchmark 2

previous benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model ORT_CUDA NCHW ORT_CUDA NHWC ORT_DML
dpir color 4.54 / 2573.3 5.98 / 2470.9 8.45 / 2364.5
dpir color (2 streams) 4.66 / 4854.9 6.30 / 4680.8 9.48 / 4630.9
waifu2x upconv7 10.98 / 5432.5 3.18 / 3017.8 12.48 / 4493.0
waifu2x upconv7 (2 streams) 14.96 / 10397.1 3.25 / 5780.9 21.72 / 8891.7
waifu2x cunet / cugan 4.70 / 7955.6 4.49 / 6290.6 OOM
waifu2x cunet / cugan (2 streams) 5.11 / 15721.9 4.78 / 12312.0 OOM
waifu2x swin_unet_art 2.98 / 23518.5 3.05 / 22812.0 N/A
realesrgan 8.99 / 1647.7 11.20 / 1127.5 11.99 / 1346.6
realesrgan (2 streams) 10.69 / 3034.5 13.58 / 1994.1 17.34 / 2601.6
rife v4.4 (1920x1088) 61.42 / 1100.9 56.02 / 1162.3 44.73 / 882.4
rife v4.4 (1920x1088, 2 streams) 106.48 / 1953.4 92.88 / 2071.9 68.80 / 1670.7
scunet color N/A N/A N/A

benchmark 3

NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model TRT ORT_CUDA ORT_DML ORT_CUDA NHWC
dpir color (1 stream) 7.08 / 1899 3.10 / 2602 4.99 / 2341 4.26 / 2411
dpir color (2 streams) 8.06 / 3376 3.30 / 5016 5.85 / 4619 4.74 / 4650
waifu2x upconv7 (1 stream) 11.47 / 2014 7.01 / 4949 7.45 / 4501 1.59 / 2923
waifu2x upconv7 (2 streams) 21.44 / 3782 10.11 / 9732 13.23 / 8940 1.77 / 5674
waifu2x cunet / cugan (1 stream) 7.41 / 4664 3.10 / 10067 OOM 0.77 / 6188
waifu2x cunet / cugan (2 streams) 10.92 / 8863 OOM OOM OOM
waifu2x swin_unet_art (1 stream) 2.35 / 7234 OOM N/A OOM
waifu2x swin_unet_art (2 streams) OOM OOM N/A OOM
realesrgan (1 stream) 8.66 / 1268 5.33 / 1545 6.39 / 1316 6.96 / 1033
realesrgan (2 streams) 13.20 / 2166 7.78 / 2932 10.22 / 2571 10.25 / 1895
rife v4.4 (1920x1088, fp16 i/o, 1 stream) 64.97 / 609 46.60 / 967 32.18 / 723 48.55 / 1014
rife v4.4 (1920x1088, fp16 i/o, 2 streams) 127.38 / 1027 69.77 / 1868 51.01 / 1385 76.10 / 1054
scunet color (1 stream) 2.73 / 3829 N/A N/A N/A
scunet color (2 streams) 2.85 / 7165 N/A N/A N/A

Version information:

  • This pre-release uses trt 10.0.0 + cuda 12.4.0 + cudnn 8.9.7 + ort 1.18, which requires a minimum driver version of 525 and is compatible with 16/20 series and newer GPUs. The engine compilation time is reduced by up to 40%, but the runtime performance of RIFE models is worsen by up to 30% with nearly doubled gpu memory usage.
  • vsmlrt.py in all branches can be used interchangeably.