Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU Support #11

Open
varunneal opened this issue Jul 8, 2023 · 19 comments
Open

WebGPU Support #11

varunneal opened this issue Jul 8, 2023 · 19 comments

Comments

@varunneal
Copy link
Collaborator

GPU Acceleration of transformers is possible, but it is hacky.

Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.

Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.

@do-me
Copy link
Owner

do-me commented Jul 9, 2023

This is really cool and paves the way for LLMs running in the browser!

I had this idea in my head for a while now: we already have a kind of (primitive) vector DB (just a JSON) and the small model for embeddings. If we added a LLM for Q&A/ text generation based solely on the infos in the text this would be huge!

I already talked to the folks from Qdrant on their discord server if they'd be interested in providing a JS/webassembly version of their Rust-based vector DB (as they developed plenty of optimizations) but for the moment they have other priorities. Still, they said they might go for it at some point.

Anyway, I think this would make for an interesting POC to explore this. About the idea to integrate it directly, until it's officially supported, we could maybe detect the web-GPU support automatically and simply load the right version? Or does the web-GPU version also support CPU?

P.S. There would be so much fun in it for NLP with LLMs if for example we'd created an image of all leitmotifs in the text or some kind of text summary image or similar for a visual understanding of text...

@lizozom
Copy link
Contributor

lizozom commented Jul 12, 2023

I am working on a similar effort myself, lets cooperate!

More specifically, I wanted to use this project as a basis for an SDK that allows one to run semantic search on their own website's content.

@do-me
Copy link
Owner

do-me commented Jul 12, 2023

Sounds great! 
It's also on the feature/idea list of the readme.md that this repo could become a browser plugin for FF or Chrome. Of course, it would need a leaner GUI.

I was thinking of some kind of bar integrated on top of a webpage like Algolia / lunr etc. do. Good example: on mkdocs material homepage:

image

image

(By the way, I also had ideas for integrating semantic search in mkdocs, but I'm lacking the time atm...)

What about your idea?

(We're kind of drifting away from this issue's topic, let's move to discussions: #15)

@do-me
Copy link
Owner

do-me commented May 15, 2024

We're finally getting closer to WebGPU support: huggingface/transformers.js#545
It's already usable in the dev branch. I'm really excited about this as people are reporting accelerations of factor 20x-100x!

In my case (M3 Max) I'm getting a massive inferencing speedup of 32x-46x. See for yourself: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark

Even with cheap integrated graphics (Intel "integrated" GPUs like UHD or Iris Xe) I get a 4x-8x boost. So literally everyone would see massive speed gains!

This is the most notably performance improvement I see atm, hence referencing #49.

I hope that transformers.js will allow for some kind of automatic setting where WebGPU is used if available but else falls back to plain CPU.

@varunneal
Copy link
Collaborator Author

Speedup is about 10x for me on an M1. Definitely huge. Not sure how embeddings will compare to inference in terms of GPU optimization but I think there is huge room for parallelization.

@do-me
Copy link
Owner

do-me commented Aug 28, 2024

Transformers.js and WebGPU

Folks, it's finally here 🥹
https://huggingface.co/posts/Xenova/681836693682285

However, afaik there is no docs for v3 yet. I tried updating SemanticFinder with v3 and running some quick tests, but failed.

  1. npm uninstall @xenova/transformers then npm install @huggingface/transformers
  2. Replace import statements in semantic.js and worker.js to import { stuff } from '@huggingface/transformers';
  3. Set a WebGPU compatible model (not sure whether all are compatible by default?) like: <option selected value="Xenova/all-MiniLM-L12-v2">Xenova/all-MiniLM-L12-v2 | 💾133MB | 66.7MB | 34MB 📥2 ❤️3</option> in index.html
  4. Change the extractor pipeline and use e.g. like this:

image

Unfortunately still throws some errors, but I'd say it's better to wait for the official v3 docs. Also it's in alpha at the moment, so errors pretty much expected.

@varunneal
Copy link
Collaborator Author

exciting news!

@gdmcdonald
Copy link
Contributor

gdmcdonald commented Aug 28, 2024

@do-me I think also have to change the quantized:true flag to dtype:"f32" for unquantized or dtype:"f16" or "q8" ...etc for quantized.

await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
    device: 'webgpu',
    dtype: 'fp32', // or 'fp16'
});

examples
huggingface/transformers.js#784
also
huggingface/transformers.js#894

@do-me
Copy link
Owner

do-me commented Aug 29, 2024

@gdmcdonald thanks for the references. Note that in these examples, they used the "old" packages from @Xenova/transformers, here I tried with the new @huggingface/transformers

On my screenshot above you can see that dtype was set automatically, so apparently that's not the problem.

Rather the problems seems to stem from worker.js:94 An error occurred during model execution: "Error: Session already started"., an error I don't understand as in the code there is only one session created.

We tightly built the core embedding logic around the old version of transformers.js with callbacks etc. so I guess there is some compatibility problem with the new logic or simply a bug in @huggingface/transformers.

When I manage to find some time, I will try with the v3 branch in @Xenova/transformers again. If someone else wants to try and create a PR helping hands are always welcome :)

@gdmcdonald
Copy link
Contributor

Ah ok. I was using @huggingface/transformers v3 as well and I ran into the same issue you did
worker.js:94 An error occurred during model execution: "Error: Session already started". I just assumed I had too many webgpu tabs open. Apologies for the spam!

@do-me
Copy link
Owner

do-me commented Aug 29, 2024

Found a bug with webgpu (wasm works fine): huggingface/transformers.js#909

The problem is calling the extractor two consecutive times. The first time works (for the query embedding) but the second time fails (for chunk embeddings).

@do-me
Copy link
Owner

do-me commented Sep 4, 2024

Folks, it's here! 🥳
I added webgpu support in the new branch and it's fast!

There was a simple problem in the old code where I would call Promise.all() for parallel execution which was nonsense, more detail about this here: huggingface/transformers.js#909

I needed to modify this code in f148689. Main changes were in index.js.

It's really fast! On my system it indexes the whole bible in like 3mins with a small model like Xenova/all-MiniLM-L6-v2 when before with wasm it would take like 30-40 mins.

Not all models are supported, so we should go down that rabbit hole and see whether we can somehow filter the models in index.html for the webgpu branch.
Also, the newer @huggingface/transformers version starting with v.0.10 have some kind of bug so I needed to hardcode version 0.9.

I was trying to set up a Github action for the new webgpu branch so it would build the webgpu version and push it to gh-pages in a /webgpu dir but somehow there were errors I couldn't follow up on so far. It somehow overwrote the files in the main directory and did not create the /webgpu dir. You can see my old trials in the history. If someone wants to give a hand it would be highly appreciated :)

Anyway, I'm really excited about this change!

@varunneal
Copy link
Collaborator Author

Fantastic news! Just played around and it's working well on my M1. Will followup to see if I can help with errors.

@do-me
Copy link
Owner

do-me commented Sep 7, 2024

Finally managed to come up with the correct GitHub Action.

@do-me
Copy link
Owner

do-me commented Sep 8, 2024

According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.

@varunneal
Copy link
Collaborator Author

@do-me can you tell me how to update the github action as well for my fork of semantic-finder? ty

@do-me
Copy link
Owner

do-me commented Sep 9, 2024

It's easy, you simply didn't clone/check out the webgpu branch yet (if I see correctly). If you add the branch to your repo it will work.
image

@do-me
Copy link
Owner

do-me commented Sep 11, 2024

According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.

Batch size changes everything. It gives me insane speed-ups of more than factor 20x

image

image

I created a small app based on one of the first versions of SemanticFinder for testing the batch size. In my tests, a chunk size of around 180 chunks per extractor() (inference) call gives me best results.

Play with it here: https://geo.rocks/semanticfinder-webgpu/.
See also: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/discussions/103

The current logic in SemanticFinder is more complex than this minimal app, so it takes more time to update everything. Could use a hand here as I probably won't find time until next week.

@varunneal
Copy link
Collaborator Author

Will look into adding it if I get a chance this week.

@do-me do-me changed the title (Eventual) GPU Acceleration WebGPU Support Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants