performance expectations #4

chadkirby · 2024-04-18T18:44:41Z

First, thanks for putting this project together!

I modified examples/basic/index.html to use a more capable model: https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf, which is 1.5gb.

Using LM Studio on my laptop (with GPU Acceleration disabled), I get roughly 25 tokens per second from gemma-2b-it-q4_k_m.gguf.

Running examples/basic/index.html in Chrome 124 on my laptop, I get roughly 6-7 tokens per second from gemma-2b-it-q4_k_m.gguf. (Similar performance in Edge 123.)

Generally, the wasm bindings seem roughly 3-4x slower than native. Is that more or less expected? Are there any wllama knobs I can twiddle to improve performance?

The text was updated successfully, but these errors were encountered:

ngxson · 2024-04-19T04:27:39Z

It is expected, since WebAssembly SIMD only support the equivalent to AVX instruction, not AVX2. This should be the biggest impact to performance atm.

Another issue is that we're using emscripten's non-native exception handler which maintains support with older browsers, but come with a small performance cost. We may move to native exception handler in the future.

Edit: seems like most mainstream versions of browsers already support native wasm exception (see here), so it's safe to enable it. The support will be added in the next build of wllama.

ngxson · 2024-04-21T14:03:06Z

v1.6.0 is now using native exception handler via -fwasm-exceptions. Here is the matrix for browser support: https://webassembly.org/features/

iSuslov · 2024-05-12T12:53:43Z

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

chadkirby · 2024-05-12T15:22:02Z

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

I did. IIRC, I saw a modest performance improvement, but wasm speed was still roughly 3x slower than native.

felladrin · 2024-05-17T21:10:27Z

One important consideration is that certain browsers, such as Brave, may alter the value of navigator.hardwareConcurrency to prevent fingerprinting.

Reference: Fingerprinting 2.0: hardwareConcurrency brave/brave-browser#10808

As a result, it is possible that the browser was utilizing only 2 threads, leading to slow inference.

Using 8 threads has resulted in satisfactory performance for the Phi-3 model:

minisearch-phi-3-wllama.mp4

Details

ngxson pinned this issue May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance expectations #4

performance expectations #4

chadkirby commented Apr 18, 2024

ngxson commented Apr 19, 2024 •

edited

Loading

ngxson commented Apr 21, 2024

iSuslov commented May 12, 2024

chadkirby commented May 12, 2024

felladrin commented May 17, 2024

performance expectations #4

performance expectations #4

Comments

chadkirby commented Apr 18, 2024

ngxson commented Apr 19, 2024 • edited Loading

ngxson commented Apr 21, 2024

iSuslov commented May 12, 2024

chadkirby commented May 12, 2024

felladrin commented May 17, 2024

ngxson commented Apr 19, 2024 •

edited

Loading