-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73
base: main
Are you sure you want to change the base?
Conversation
The differences become more apparent if we separate the loading and execution:
Benchmarking codeimport time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio
audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')
load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")
for i in range(10):
start_time = time.time()
tokens = model.generate(audio)
end_time = time.time()
print(f"Run #{i+1}: {end_time - start_time} seconds") Raw dataTinyOld
New
BaseOld
New
|
Hi @xenova , this is so amazing! Thanks for the PR. |
Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config. Will do later today 🔥 |
@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files. |
Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged. Here are the steps to convert:
Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself). |
Sent you an invite to join the usefulsensors org on HF, please move it there. |
Requested to join 👍 (I didn't see an invite, yet. Username is Xenova) |
Thanks so much for this @xenova, this is extremely useful! I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime You run it like: py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny" If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny Thanks again for this work, I know it will be helpful to a lot of people. |
I'm particularly interested in the The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡 |
Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders. Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray |
Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot. I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find. Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days. |
Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well. I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized. |
This PR improves the moonshine-onnx package in the following ways:
Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.
New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.
Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).
tiny:
base: