Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

xenova
Copy link

@xenova xenova commented Dec 14, 2024

This PR improves the moonshine-onnx package in the following ways:

  • Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.

    • tiny: -61.8% (285MB → 109MB)
    • base: -57.6% (583MB → 247MB)
  • New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.

    • tiny: 55.1MB at 4-bit quantization, 28MB at 8-bit quantization. Sample outputs:
      fp32: ['Ever tried ever failed, no matter try again, fail again, fail better.']
      q4: ['Ever tried, ever failed, no matter, try again, fail again, fail better.']
      q8: ['Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.']
      
    • base: 98MB at 4-bit quantization, 63MB at 8-bit quantization
      fp32: ['Ever tried ever failed, no matter try again fail again fail better.']
      q4: ['Ever tried ever failed, no matter try again, fail again, fail better.']
      q8 decoder, fp32 encoder: ['Ever tried ever failed, no matter try again fail again fail better.']
      
      (q8 encoder in last case produces poor results)
  • Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).

    import moonshine_onnx as moonshine
    import time
    
    for i in range(10):
        start_time = time.time()
        output = moonshine.transcribe(moonshine.ASSETS_DIR / 'beckett.wav', 'moonshine/tiny')
        end_time = time.time()
    
        print(f"Execution time: {end_time - start_time} seconds")
    • tiny:
      image

    • base:
      image

@xenova
Copy link
Author

xenova commented Dec 14, 2024

The differences become more apparent if we separate the loading and execution:

Model Old Load Time (s) New Load Time (s) Load Time Reduction (%) Old Run Time (s) New Run Time (s) Run Time Reduction (%)
Tiny 1.594 1.090 31.6 0.346 0.252 27.1
Base 2.442 1.333 45.4 0.605 0.465 23.1

Benchmarking code

import time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio

audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')

load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")

for i in range(10):
    start_time = time.time()
    tokens = model.generate(audio)
    end_time = time.time()

    print(f"Run #{i+1}: {end_time - start_time} seconds")

Raw data

Tiny

Old

Model load time: 1.5940916538238525 seconds
Run #1: 0.3504812717437744 seconds
Run #2: 0.3556952476501465 seconds
Run #3: 0.46249866485595703 seconds
Run #4: 0.3608577251434326 seconds
Run #5: 0.29972147941589355 seconds
Run #6: 0.3081827163696289 seconds
Run #7: 0.33364224433898926 seconds
Run #8: 0.3344881534576416 seconds
Run #9: 0.3328516483306885 seconds
Run #10: 0.31997060775756836 seconds

New

Model load time: 1.0903031826019287 seconds
Run #1: 0.22372031211853027 seconds
Run #2: 0.2659788131713867 seconds
Run #3: 0.2293243408203125 seconds
Run #4: 0.2531099319458008 seconds
Run #5: 0.23910117149353027 seconds
Run #6: 0.2526216506958008 seconds
Run #7: 0.230133056640625 seconds
Run #8: 0.28861451148986816 seconds
Run #9: 0.23595857620239258 seconds
Run #10: 0.3022029399871826 seconds

Base

Old

Model load time: 2.4421446323394775 seconds
Run #1: 0.6086058616638184 seconds
Run #2: 0.5442285537719727 seconds
Run #3: 0.609248161315918 seconds
Run #4: 0.6299099922180176 seconds
Run #5: 0.6160895824432373 seconds
Run #6: 0.57456374168396 seconds
Run #7: 0.6728155612945557 seconds
Run #8: 0.5604102611541748 seconds
Run #9: 0.6053454875946045 seconds
Run #10: 0.625809907913208 seconds

New

Model load time: 1.333482027053833 seconds
Run #1: 0.43103766441345215 seconds
Run #2: 0.4757063388824463 seconds
Run #3: 0.4413950443267822 seconds
Run #4: 0.44200587272644043 seconds
Run #5: 0.4499983787536621 seconds
Run #6: 0.5306398868560791 seconds
Run #7: 0.47008252143859863 seconds
Run #8: 0.481827974319458 seconds
Run #9: 0.47646117210388184 seconds
Run #10: 0.45035600662231445 seconds

@keveman
Copy link
Contributor

keveman commented Dec 15, 2024

Hi @xenova , this is so amazing! Thanks for the PR.
Any chance you can share the scripts used for generating the ONNX files?

@xenova
Copy link
Author

xenova commented Dec 15, 2024

Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config.

Will do later today 🔥

@keveman
Copy link
Contributor

keveman commented Dec 16, 2024

@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files.

@xenova
Copy link
Author

xenova commented Dec 16, 2024

Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged.

Here are the steps to convert:

  1. Install the dev branch of Optimum
pip install --upgrade git+https://github.com/huggingface/optimum.git@add-moonshine-onnx
  1. Install moonshine dev branch of transformers:
pip install --upgrade git+https://github.com/eustlb/transformers.git@add-moonshine
  1. Convert the model to ONNX
optimum-cli export onnx -m Xenova/moonshine-tiny-hf ./output/

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

@keveman
Copy link
Contributor

keveman commented Dec 16, 2024

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

Sent you an invite to join the usefulsensors org on HF, please move it there.

@xenova
Copy link
Author

xenova commented Dec 16, 2024

Sent you an invite to join the usefulsensors org on HF, please move it there.

Requested to join 👍 (I didn't see an invite, yet. Username is Xenova)

@petewarden
Copy link

Thanks so much for this @xenova, this is extremely useful!

I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime quantize_dynamic() process has a big hit on accuracy, so I'm going to be digging a bit deeper when I get time. I'm using LibriSpeech English clean as my test set, and while I'm hoping to get the script properly added to the Moonshine repo soon, here's a gist of it in case it's useful for your work: https://gist.github.com/petewarden/09a17d2ded03d24e445c7e7681517ee9

You run it like:

py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny"

If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny _quantized variant, I'll keep working through the others.

Thanks again for this work, I know it will be helpful to a lot of people.

@xenova
Copy link
Author

xenova commented Dec 18, 2024

I'm particularly interested in the _q4 variants, as these are very fast on WebGPU, so doing some evals on that would be great! Also, using the _quantized (a.k.a., q8) variant for the encoder can cause some issues, so maybe some hybrid testing (i.e., fp32 for encoder, q8 or q4 for decoder)?

The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡

@petewarden
Copy link

Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders.

Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray

@xenova
Copy link
Author

xenova commented Dec 18, 2024

I'll check out that repo! Regarding file size, remember to deduplicate the the tied weights (as this significantly increases size).

For example, at fp32, the tiny model is 109.1MB (30.9+78.2 MB):

image

image

and the fp32 base model is 246.8MB (166+80.8MB)

image

image

@petewarden
Copy link

remember to deduplicate the the tied weights (as this significantly increases size).

Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot.

I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find.

Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days.

@petewarden
Copy link

Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well.

I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants