WebGPU Support #11

varunneal · 2023-07-08T01:10:41Z

GPU Acceleration of transformers is possible, but it is hacky.

Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.

Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.

do-me · 2023-07-09T14:26:18Z

This is really cool and paves the way for LLMs running in the browser!

I had this idea in my head for a while now: we already have a kind of (primitive) vector DB (just a JSON) and the small model for embeddings. If we added a LLM for Q&A/ text generation based solely on the infos in the text this would be huge!

I already talked to the folks from Qdrant on their discord server if they'd be interested in providing a JS/webassembly version of their Rust-based vector DB (as they developed plenty of optimizations) but for the moment they have other priorities. Still, they said they might go for it at some point.

Anyway, I think this would make for an interesting POC to explore this. About the idea to integrate it directly, until it's officially supported, we could maybe detect the web-GPU support automatically and simply load the right version? Or does the web-GPU version also support CPU?

P.S. There would be so much fun in it for NLP with LLMs if for example we'd created an image of all leitmotifs in the text or some kind of text summary image or similar for a visual understanding of text...

lizozom · 2023-07-12T10:44:53Z

I am working on a similar effort myself, lets cooperate!

More specifically, I wanted to use this project as a basis for an SDK that allows one to run semantic search on their own website's content.

do-me · 2023-07-12T12:40:41Z

Sounds great!
It's also on the feature/idea list of the readme.md that this repo could become a browser plugin for FF or Chrome. Of course, it would need a leaner GUI.

I was thinking of some kind of bar integrated on top of a webpage like Algolia / lunr etc. do. Good example: on mkdocs material homepage:

(By the way, I also had ideas for integrating semantic search in mkdocs, but I'm lacking the time atm...)

What about your idea?

(We're kind of drifting away from this issue's topic, let's move to discussions: #15)

do-me · 2024-05-15T11:10:17Z

We're finally getting closer to WebGPU support: huggingface/transformers.js#545
It's already usable in the dev branch. I'm really excited about this as people are reporting accelerations of factor 20x-100x!

In my case (M3 Max) I'm getting a massive inferencing speedup of 32x-46x. See for yourself: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark

Even with cheap integrated graphics (Intel "integrated" GPUs like UHD or Iris Xe) I get a 4x-8x boost. So literally everyone would see massive speed gains!

This is the most notably performance improvement I see atm, hence referencing #49.

I hope that transformers.js will allow for some kind of automatic setting where WebGPU is used if available but else falls back to plain CPU.

varunneal · 2024-05-19T19:35:37Z

Speedup is about 10x for me on an M1. Definitely huge. Not sure how embeddings will compare to inference in terms of GPU optimization but I think there is huge room for parallelization.

do-me · 2024-08-28T11:14:41Z

Transformers.js and WebGPU

Folks, it's finally here 🥹
https://huggingface.co/posts/Xenova/681836693682285

However, afaik there is no docs for v3 yet. I tried updating SemanticFinder with v3 and running some quick tests, but failed.

npm uninstall @xenova/transformers then npm install @huggingface/transformers
Replace import statements in semantic.js and worker.js to import { stuff } from '@huggingface/transformers';
Set a WebGPU compatible model (not sure whether all are compatible by default?) like: <option selected value="Xenova/all-MiniLM-L12-v2">Xenova/all-MiniLM-L12-v2 | 💾133MB | 66.7MB | 34MB 📥2 ❤️3</option> in index.html
Change the extractor pipeline and use e.g. like this:

Unfortunately still throws some errors, but I'd say it's better to wait for the official v3 docs. Also it's in alpha at the moment, so errors pretty much expected.

varunneal · 2024-08-28T15:48:48Z

exciting news!

gdmcdonald · 2024-08-28T23:55:32Z

@do-me I think also have to change the quantized:true flag to dtype:"f32" for unquantized or dtype:"f16" or "q8" ...etc for quantized.

await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
    device: 'webgpu',
    dtype: 'fp32', // or 'fp16'
});

examples
huggingface/transformers.js#784
also
huggingface/transformers.js#894

do-me · 2024-08-29T07:38:56Z

@gdmcdonald thanks for the references. Note that in these examples, they used the "old" packages from @Xenova/transformers, here I tried with the new @huggingface/transformers

On my screenshot above you can see that dtype was set automatically, so apparently that's not the problem.

Rather the problems seems to stem from worker.js:94 An error occurred during model execution: "Error: Session already started"., an error I don't understand as in the code there is only one session created.

We tightly built the core embedding logic around the old version of transformers.js with callbacks etc. so I guess there is some compatibility problem with the new logic or simply a bug in @huggingface/transformers.

When I manage to find some time, I will try with the v3 branch in @Xenova/transformers again. If someone else wants to try and create a PR helping hands are always welcome :)

gdmcdonald · 2024-08-29T09:54:45Z

Ah ok. I was using @huggingface/transformers v3 as well and I ran into the same issue you did
worker.js:94 An error occurred during model execution: "Error: Session already started". I just assumed I had too many webgpu tabs open. Apologies for the spam!

do-me · 2024-08-29T14:33:44Z

Found a bug with webgpu (wasm works fine): huggingface/transformers.js#909

The problem is calling the extractor two consecutive times. The first time works (for the query embedding) but the second time fails (for chunk embeddings).

do-me · 2024-09-04T16:26:55Z

Folks, it's here! 🥳
I added webgpu support in the new branch and it's fast!

There was a simple problem in the old code where I would call Promise.all() for parallel execution which was nonsense, more detail about this here: huggingface/transformers.js#909

I needed to modify this code in f148689. Main changes were in index.js.

It's really fast! On my system it indexes the whole bible in like 3mins with a small model like Xenova/all-MiniLM-L6-v2 when before with wasm it would take like 30-40 mins.

Not all models are supported, so we should go down that rabbit hole and see whether we can somehow filter the models in index.html for the webgpu branch.
Also, the newer @huggingface/transformers version starting with v.0.10 have some kind of bug so I needed to hardcode version 0.9.

I was trying to set up a Github action for the new webgpu branch so it would build the webgpu version and push it to gh-pages in a /webgpu dir but somehow there were errors I couldn't follow up on so far. It somehow overwrote the files in the main directory and did not create the /webgpu dir. You can see my old trials in the history. If someone wants to give a hand it would be highly appreciated :)

Anyway, I'm really excited about this change!

varunneal · 2024-09-04T17:59:48Z

Fantastic news! Just played around and it's working well on my M1. Will followup to see if I can help with errors.

do-me · 2024-09-07T15:33:12Z

Finally managed to come up with the correct GitHub Action.

You can now find the WebGPU app here: https://do-me.github.io/SemanticFinder/webgpu/
Normal page uses wasm: https://do-me.github.io/SemanticFinder/

do-me · 2024-09-08T16:55:58Z

According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.

varunneal · 2024-09-09T18:28:27Z

@do-me can you tell me how to update the github action as well for my fork of semantic-finder? ty

do-me · 2024-09-09T20:32:49Z

It's easy, you simply didn't clone/check out the webgpu branch yet (if I see correctly). If you add the branch to your repo it will work.

do-me · 2024-09-11T15:23:24Z

According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.

Batch size changes everything. It gives me insane speed-ups of more than factor 20x

I created a small app based on one of the first versions of SemanticFinder for testing the batch size. In my tests, a chunk size of around 180 chunks per extractor() (inference) call gives me best results.

Play with it here: https://geo.rocks/semanticfinder-webgpu/.
See also: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/discussions/103

The current logic in SemanticFinder is more complex than this minimal app, so it takes more time to update everything. Could use a hand here as I probably won't find time until next week.

varunneal · 2024-09-11T15:56:15Z

Will look into adding it if I get a chance this week.

do-me changed the title ~~(Eventual) GPU Acceleration~~ WebGPU Support Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU Support #11

WebGPU Support #11

varunneal commented Jul 8, 2023

do-me commented Jul 9, 2023

lizozom commented Jul 12, 2023 •

edited

Loading

do-me commented Jul 12, 2023 •

edited

Loading

do-me commented May 15, 2024

varunneal commented May 19, 2024

do-me commented Aug 28, 2024

varunneal commented Aug 28, 2024

gdmcdonald commented Aug 28, 2024 •

edited

Loading

do-me commented Aug 29, 2024

gdmcdonald commented Aug 29, 2024

do-me commented Aug 29, 2024

do-me commented Sep 4, 2024 •

edited

Loading

varunneal commented Sep 4, 2024

do-me commented Sep 7, 2024

do-me commented Sep 8, 2024

varunneal commented Sep 9, 2024

do-me commented Sep 9, 2024

do-me commented Sep 11, 2024 •

edited

Loading

varunneal commented Sep 11, 2024

WebGPU Support #11

WebGPU Support #11

Comments

varunneal commented Jul 8, 2023

do-me commented Jul 9, 2023

lizozom commented Jul 12, 2023 • edited Loading

do-me commented Jul 12, 2023 • edited Loading

do-me commented May 15, 2024

varunneal commented May 19, 2024

do-me commented Aug 28, 2024

Transformers.js and WebGPU

varunneal commented Aug 28, 2024

gdmcdonald commented Aug 28, 2024 • edited Loading

do-me commented Aug 29, 2024

gdmcdonald commented Aug 29, 2024

do-me commented Aug 29, 2024

do-me commented Sep 4, 2024 • edited Loading

varunneal commented Sep 4, 2024

do-me commented Sep 7, 2024

do-me commented Sep 8, 2024

varunneal commented Sep 9, 2024

do-me commented Sep 9, 2024

do-me commented Sep 11, 2024 • edited Loading

varunneal commented Sep 11, 2024

lizozom commented Jul 12, 2023 •

edited

Loading

do-me commented Jul 12, 2023 •

edited

Loading

gdmcdonald commented Aug 28, 2024 •

edited

Loading

do-me commented Sep 4, 2024 •

edited

Loading

do-me commented Sep 11, 2024 •

edited

Loading