waifu2x

Waifu2x is a well-known image super-resolution neural network for anime-style arts.

Link:

Models

Includes all known publicly available waifu2x models:

anime_style_art: requires pre-scaled input for the scaled2.0x variant
- noise1 noise2 noise3 scale2.0x
anime_style_art_rgb: requires pre-scaled input for the scale2.0x variant
- noise0 noise1 noise2 noise3 scale2.0x
photo: requires pre-scaled input for the scale2.0x variant
- noise0 noise1 noise2 noise3 scale2.0x
ukbench: requires pre-scaled input
- scale2.0x
upconv_7_anime_style_art_rgb
- scale2.0x noise3_scale2.0x noise2_scale2.0x noise1_scale2.0x noise0_scale2.0x
upconv_7_photo
- scale2.0x noise0_scale2.0x noise1_scale2.0x noise2_scale2.0x noise3_scale2.0x
cunet: tile size (block_w and block_h) must be multiples of 4.
- noise0 noise1 noise2 noise3
- scale2.0x
- noise0_scale2.0x noise1_scale2.0x noise2_scale2.0x noise3_scale2.0x
upresnet10
- scale2.0x
- noise0_scale2.0x noise1_scale2.0x noise2_scale2.0x noise3_scale2.0x

`vsmlrt.py` wrapper Usage

In order to simplify usage, we provided a Python wrapper module vsmlrt that provides full functionality of waifu2x caffe but with a more Pythonic interface:

from vsmlrt import Waifu2x, Waifu2xModel, Backend

src = core.std.BlankClip(format=vs.RGBS)

# backend could be:
#  - CPU Backend.OV_CPU(): the recommended CPU backend; generally faster than ORT-CPU.
#  - CPU Backend.ORT_CPU(num_streams=1, verbosity=2): vs-ort cpu backend.
#  - GPU Backend.ORT_CUDA(device_id=0, cudnn_benchmark=True, num_streams=1, verbosity=2)
#     - use device_id to select device
#     - set cudnn_benchmark=False to reduce script reload latency when debugging, but with slight throughput performance penalty.
#  - GPU Backend.TRT(fp16=True, device_id=0, num_streams=1): TensorRT runtime, the fastest NV GPU runtime.
flt = Waifu2x(src, noise=-1, scale=2, model=Waifu2xModel.upconv_7_anime_style_art_rgb, backend=Backend.ORT_CUDA())

Raw Model Usage

This section is mostly for reference purposes as the suggested way is to use the vsmlrt.py.

src = core.std.BlankClip(width=1920, height=1080, format=vs.RGBS)
flt = core.ov.Model(src, "upconv_7_anime_style_art_rgb_scale2.0x.onnx")

anime_style_art, anime_style_art_rgb, photo, ukbench models do not include builtin upscaling. Therefore, you need to upscale 2x using Catmull-Rom (bicubic(b=0, c=0.5)) before feeding the image to the models:

src = core.std.BlankClip(width=1920, height=1080, format=vs.RGBS)
flt = core.ov.Model(src.fmtc.resample(scale=2, kernel="bicubic", a1=0, a2=0.5), "anime_style_art_rgb_scale2.0x.onnx")

Notes

cunet networks work best when the tile size (block_w/block_h) is in range 60 - 150 and multiples of 4.

Benchmarking

Measurements: FPS / Device Memory (MB)

Device memory:

CPU: private memory including VapourSynth
GPU: device memory including context

RTX 3090

Software: VapourSynth R57, Windows 10 LTSC 2021, Graphics Driver 511.23.

Input size: 1920x1080

Backends

vs-mlrt v6
vapoursynth-waifu2x-ncnn-vulkan r4
vs-mlrt v8 (driver 511.79)

Performance

FP32

Model	[1] ort-cuda	[1] trt	[2] vulkan (540p patch)	[3] ort-cuda	[3] trt	[3] trt (no tf32)
upconv7	6.12 / 6592	7.22 / 5694	2.83 / 10578	7.24 / 6408	7.99 / 5761	7.86 / 5785
upresnet10	4.72 / 5820	N/A	N/A	5.79 / 5634	N/A	N/A
cunet	2.70 / 18624	N/A	0.71 / 15082	3.28 / 18435	N/A	N/A

FP16

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[2] vulkan	[3] ort-cuda	[3] trt	[3] trt (2 streams)
upconv7	7.64 / 6204	13.4 / 4652	25.4 / 7852	4.20 / 20750	10.6 / 5764	16.2 / 2385	30.1 / 4096
upresnet10	6.38 / 5818	N/A	N/A	N/A	8.15 / 5632	N/A	N/A
cunet	3.55 / 10172	N/A	N/A	0.91 / 7696 (540p patch)	4.53 / 9983	N/A	N/A

RTX 2080 Ti

Software: VapourSynth R57, Windows 10 LTSC 2021, Graphics Driver 511.23.

Input size: 1920x1080

Backends

vs-mlrt v6
VapourSynth-Waifu2x-caffe r14
vapoursynth-waifu2x-ncnn-vulkan r4

Performance

FP32

Model	[1] ort-cuda	[1] trt	[2] caffe (540p patch)	[3] vulkan (540p patch)
upconv7	4.36 / 5922	4.73 / 5072	1.08 / 3159	1.40 / 10568
upresnet10	3.31 / 5150	N/A	1.03 / 7280	N/A
cunet	1.77 / 5170 (540p patch)	N/A	0.73 / 6957 (360p patch)	0.60 / 6992 (360p patch)

FP16

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[3] vulkan (540p patch)
upconv7	5.84 / 5278	11.9 / 3055	19.2 / 5263	2.60 / 5438
upresnet10	5.14 / 5148	N/A	N/A	N/A
cunet	1.64 / 9502	N/A	N/A	0.88 / 7686

Tesla V100

Software: VapourSynth R57, Windows Server 2019, Graphics Driver 511.23.

Input size: 1920x1080

Backends

vs-mlrt v6
VapourSynth-Waifu2x-caffe r14
vapoursynth-waifu2x-ncnn-vulkan r4, Graphics Driver 471.68

Performance

FP32

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[2] caffe (540p patch)	[3] vulkan (540p patch)
upconv7	5.98 / 5065	6.60 / 5033	8.43 / 9253	1.63 / 3248	1.67 / 11197
upresnet10	4.36 / 5061	N/A	N/A	1.54 / 7232	N/A
cunet	2.58 / 9155	N/A	N/A	1.11 / 11657	0.53 / 15705

FP16

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[3] vulkan
upconv7	10.4 / 5189	13.8 / 3041	26.2 / 5253	3.97 / 21369
upresnet10	6.43 / 5059	N/A	N/A	N/A
cunet	4.10 / 9535	N/A	N/A	0.86 / 29848

Tesla A10

Software: VapourSynth R57, Windows Server 2019, Graphics Driver 511.23, lock the GPU clocks at max frequency.

Input size: 1920x1080

Backends

vs-mlrt v6
vapoursynth-waifu2x-ncnn-vulkan r4, Graphics Driver 471.68

Performance

FP32

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[2] vulkan (540p patch)
upconv7	6.94 / 9765	7.83 / 5511	8.61 / 9731	1.63 / 10892
upresnet10	3.90 / 5665	N/A	N/A	N/A
cunet	2.20 / 18469	N/A	N/A	0.53 / 15397

FP16

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)	[2] vulkan
upconv7	9.66 / 6049	16.1 / 3501	19.9 / 5701	3.03 / 21075
upresnet10	6.53 / 5663	N/A	N/A	N/A
cunet	3.26 / 10017	N/A	N/A	0.78 / 8011 (540p patch)

Tesla A10G

Software: VapourSynth R58, Windows Server 2022, Graphics Driver 511.65, lock the GPU clocks at max frequency.

Input size: 1920x1080

Backends

vs-mlrt v8

Performance

FP32

Model	[1] trt
upconv7	7.20 / 5668

FP16

Model	[1] trt	[1] trt (2 streams)
upconv7	16.4 / 2255	22.2 / 3981

Tesla A100 (PCIe, 40 GB)

Software: VapourSynth R57, Windows Server 2019, Graphics Driver 511.23.

Input size: 1920x1080

Backends

vs-mlrt v6

Performance

FP32

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)
upconv7	17.3 / 9827	20.0 / 5713	27.2 / 10051
upresnet10	N/A	N/A	N/A
cunet	N/A	N/A	N/A

FP16

Model	[1] ort-cuda	[1] trt	[1] trt (2 streams)
upconv7	18.3 / 6111	32.8 / 4539	57.3 / 7719
upresnet10	N/A	N/A	N/A
cunet	N/A	N/A	N/A

Tesla A100 (SXM4, 80 GB)

Software: VapourSynth R57-A4, Windows Server 2022, Graphics Driver 516.94.

Input size: 1920x1080

Backends

vs-mlrt v9

Performance

FP16

Model	[1] trt	[1] trt (2 streams)
upconv7	30.4 / 2359	57.4 / 4037
cunet	19.4 / 4647	26.9 / 8558

Icelake Server

Hardware: Xeon Icelake Server 32C64T @2.90 GHz

Software: VapourSynth R57, Windows Server 2019.

Input size: 1920x1080

Backends

vs-mlrt v6
VapourSynth-Waifu2x-w2xc r8

Performance

FP32

Model	[1] ov-cpu	[2] w2xc
upconv7	1.22 / 18750	N/A
upresnet10	1.40 / 18278	N/A
cunet	0.65 / 22447	N/A
anime rgb	0.69 / 34619	0.26 / 7895

EPYC Milan

Hardware: EPYC Milan 32C64T @2.55 GHz

Software: VapourSynth R57, Windows Server 2019.

Input size: 1920x1080

Backends

vs-mlrt v6
VapourSynth-Waifu2x-w2xc r8

Performance

FP32

Model	[1] ov-cpu	[2] w2xc
upconv7	0.36 / 19583	N/A
upresnet10	0.35 / 18694	N/A
cunet	0.20 / 21644	N/A
anime rgb	0.20 / 34619	0.28 / 5398

Home

Runtimes
Models
- waifu2x
- DPIR
- RealESRGANv2
- Real-CUGAN
- RIFE
- External models
Device-specific benchmarks

waifu2x

Models

vsmlrt.py wrapper Usage

Raw Model Usage

Notes

Benchmarking

RTX 3090

Backends

Performance

FP32

FP16

RTX 2080 Ti

Backends

Performance

FP32

FP16

Tesla V100

Backends

Performance

FP32

FP16

Tesla A10

Backends

Performance

FP32

FP16

Tesla A10G

Backends

Performance

FP32

FP16

Tesla A100 (PCIe, 40 GB)

Backends

Performance

FP32

FP16

Tesla A100 (SXM4, 80 GB)

Backends

Performance

FP16

Icelake Server

Backends

Performance

FP32

EPYC Milan

Backends

Performance

FP32

Clone this wiki locally

`vsmlrt.py` wrapper Usage