ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

xenova · 2024-12-14T21:52:50Z

This PR improves the moonshine-onnx package in the following ways:

Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.
- tiny: -61.8% (285MB → 109MB)
- base: -57.6% (583MB → 247MB)

New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.

tiny: 55.1MB at 4-bit quantization, 28MB at 8-bit quantization. Sample outputs:

fp32: ['Ever tried ever failed, no matter try again, fail again, fail better.']
q4: ['Ever tried, ever failed, no matter, try again, fail again, fail better.']
q8: ['Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.']

base: 98MB at 4-bit quantization, 63MB at 8-bit quantization

fp32: ['Ever tried ever failed, no matter try again fail again fail better.']
q4: ['Ever tried ever failed, no matter try again, fail again, fail better.']
q8 decoder, fp32 encoder: ['Ever tried ever failed, no matter try again fail again fail better.']

(q8 encoder in last case produces poor results)

Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).

import moonshine_onnx as moonshine
import time

for i in range(10):
    start_time = time.time()
    output = moonshine.transcribe(moonshine.ASSETS_DIR / 'beckett.wav', 'moonshine/tiny')
    end_time = time.time()

    print(f"Execution time: {end_time - start_time} seconds")

tiny:
base:

xenova · 2024-12-14T22:13:10Z

The differences become more apparent if we separate the loading and execution:

Model	Old Load Time (s)	New Load Time (s)	Load Time Reduction (%)	Old Run Time (s)	New Run Time (s)	Run Time Reduction (%)
Tiny	1.594	1.090	31.6	0.346	0.252	27.1
Base	2.442	1.333	45.4	0.605	0.465	23.1

Benchmarking code

import time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio

audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')

load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")

for i in range(10):
    start_time = time.time()
    tokens = model.generate(audio)
    end_time = time.time()

    print(f"Run #{i+1}: {end_time - start_time} seconds")

Raw data

Tiny

Old

Model load time: 1.5940916538238525 seconds
Run #1: 0.3504812717437744 seconds
Run #2: 0.3556952476501465 seconds
Run #3: 0.46249866485595703 seconds
Run #4: 0.3608577251434326 seconds
Run #5: 0.29972147941589355 seconds
Run #6: 0.3081827163696289 seconds
Run #7: 0.33364224433898926 seconds
Run #8: 0.3344881534576416 seconds
Run #9: 0.3328516483306885 seconds
Run #10: 0.31997060775756836 seconds

New

Model load time: 1.0903031826019287 seconds
Run #1: 0.22372031211853027 seconds
Run #2: 0.2659788131713867 seconds
Run #3: 0.2293243408203125 seconds
Run #4: 0.2531099319458008 seconds
Run #5: 0.23910117149353027 seconds
Run #6: 0.2526216506958008 seconds
Run #7: 0.230133056640625 seconds
Run #8: 0.28861451148986816 seconds
Run #9: 0.23595857620239258 seconds
Run #10: 0.3022029399871826 seconds

Base

Old

Model load time: 2.4421446323394775 seconds
Run #1: 0.6086058616638184 seconds
Run #2: 0.5442285537719727 seconds
Run #3: 0.609248161315918 seconds
Run #4: 0.6299099922180176 seconds
Run #5: 0.6160895824432373 seconds
Run #6: 0.57456374168396 seconds
Run #7: 0.6728155612945557 seconds
Run #8: 0.5604102611541748 seconds
Run #9: 0.6053454875946045 seconds
Run #10: 0.625809907913208 seconds

New

Model load time: 1.333482027053833 seconds
Run #1: 0.43103766441345215 seconds
Run #2: 0.4757063388824463 seconds
Run #3: 0.4413950443267822 seconds
Run #4: 0.44200587272644043 seconds
Run #5: 0.4499983787536621 seconds
Run #6: 0.5306398868560791 seconds
Run #7: 0.47008252143859863 seconds
Run #8: 0.481827974319458 seconds
Run #9: 0.47646117210388184 seconds
Run #10: 0.45035600662231445 seconds

keveman · 2024-12-15T00:31:04Z

Hi @xenova , this is so amazing! Thanks for the PR.
Any chance you can share the scripts used for generating the ONNX files?

xenova · 2024-12-15T03:31:44Z

Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config.

Will do later today 🔥

keveman · 2024-12-16T19:02:12Z

@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files.

xenova · 2024-12-16T20:59:29Z

Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged.

Here are the steps to convert:

Install the dev branch of Optimum

pip install --upgrade git+https://github.com/huggingface/optimum.git@add-moonshine-onnx

Install moonshine dev branch of transformers:

pip install --upgrade git+https://github.com/eustlb/transformers.git@add-moonshine

Convert the model to ONNX

optimum-cli export onnx -m Xenova/moonshine-tiny-hf ./output/

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

keveman · 2024-12-16T22:13:36Z

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

https://huggingface.co/Xenova/moonshine-tiny-hf

https://huggingface.co/Xenova/moonshine-base-hf

Sent you an invite to join the usefulsensors org on HF, please move it there.

xenova · 2024-12-16T23:23:17Z

Sent you an invite to join the usefulsensors org on HF, please move it there.

Requested to join 👍 (I didn't see an invite, yet. Username is Xenova)

petewarden · 2024-12-17T23:00:43Z

Thanks so much for this @xenova, this is extremely useful!

I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime quantize_dynamic() process has a big hit on accuracy, so I'm going to be digging a bit deeper when I get time. I'm using LibriSpeech English clean as my test set, and while I'm hoping to get the script properly added to the Moonshine repo soon, here's a gist of it in case it's useful for your work: https://gist.github.com/petewarden/09a17d2ded03d24e445c7e7681517ee9

You run it like:

py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny"

If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny _quantized variant, I'll keep working through the others.

Thanks again for this work, I know it will be helpful to a lot of people.

xenova · 2024-12-18T00:07:07Z

I'm particularly interested in the _q4 variants, as these are very fast on WebGPU, so doing some evals on that would be great! Also, using the _quantized (a.k.a., q8) variant for the encoder can cause some issues, so maybe some hybrid testing (i.e., fp32 for encoder, q8 or q4 for decoder)?

The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡

petewarden · 2024-12-18T20:58:02Z

Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders.

Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray

xenova · 2024-12-18T21:18:49Z

I'll check out that repo! Regarding file size, remember to deduplicate the the tied weights (as this significantly increases size).

For example, at fp32, the tiny model is 109.1MB (30.9+78.2 MB):

and the fp32 base model is 246.8MB (166+80.8MB)

petewarden · 2024-12-19T02:02:34Z

remember to deduplicate the the tied weights (as this significantly increases size).

Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot.

I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find.

Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days.

petewarden · 2024-12-19T22:22:29Z

Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well.

I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized.

xenova added 3 commits December 14, 2024 21:14

Improve ONNX version

7fc4ffd

Use fp32 by default

7f70a6c

Formatting

d215d9a

keveman approved these changes Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

xenova commented Dec 14, 2024

xenova commented Dec 14, 2024 •

edited

Loading

keveman commented Dec 15, 2024

xenova commented Dec 15, 2024

keveman commented Dec 16, 2024

xenova commented Dec 16, 2024

keveman commented Dec 16, 2024

xenova commented Dec 16, 2024

petewarden commented Dec 17, 2024

xenova commented Dec 18, 2024 •

edited

Loading

petewarden commented Dec 18, 2024

xenova commented Dec 18, 2024

petewarden commented Dec 19, 2024

petewarden commented Dec 19, 2024

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

Are you sure you want to change the base?

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

Conversation

xenova commented Dec 14, 2024

xenova commented Dec 14, 2024 • edited Loading

Benchmarking code

Raw data

Tiny

Old

New

Base

Old

New

keveman commented Dec 15, 2024

xenova commented Dec 15, 2024

keveman commented Dec 16, 2024

xenova commented Dec 16, 2024

keveman commented Dec 16, 2024

xenova commented Dec 16, 2024

petewarden commented Dec 17, 2024

xenova commented Dec 18, 2024 • edited Loading

petewarden commented Dec 18, 2024

xenova commented Dec 18, 2024

petewarden commented Dec 19, 2024

petewarden commented Dec 19, 2024

xenova commented Dec 14, 2024 •

edited

Loading

xenova commented Dec 18, 2024 •

edited

Loading