-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bounty] CPU inference support, Mac M1/M2 inference support #77
Comments
/bounty $2000 |
💎 $2,000 bounty created by olegklimov
|
/attempt #77 Options |
/attempt #77 Options |
Note: The user @Akshay-Patel-dev is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @Akshay-Patel-dev will complete the issue first, and be awarded the bounty. We recommend discussing with @Akshay-Patel-dev and potentially collaborating on the same solution versus creating an alternate solution. |
You can start with installing it and trying out. But unless you already familiar with CPU inference libraries and LLMs in general, it might take you quite a long time to research. |
I forked the project. And performed steps in the contributing.md file, but getting errors and unable to run it locally. |
I added this , because in the error I encountered, this has to be added. |
CPU project names: ggml, ctransformers |
/attempt #77 I've got a preliminary version working with ctransformers. I can have a codellama FIM 7B demo up and running soon. Options |
Note: The user @shobhit9957 is already attempting to complete issue #77 and claim the bounty. If you attempt to complete the same issue, there is a chance that @shobhit9957 will complete the issue first, and be awarded the bounty. We recommend discussing with @shobhit9957 and potentially collaborating on the same solution versus creating an alternate solution. |
An interesting link: Example of GGUFs of all sizes: |
If this is still open, I might try it out. Would the bounty claim still count for model conversion to GGUF format? I understand it's first come, first serve. I'm just wondering if you're looking for a conversion script or if you just want general CPU support? Quantization is a bit different from CPU inferencing and I'm just looking for clarity on the scope. If you just want quantization, then I can look into creating a conversion script and I'll submit an attempt if I get it working and this is still open. |
Someone is trying the heavy lifting here: ggerganov/llama.cpp#3061 |
Yes, I saw that. That's why I'm asking. I know that in order to do it, one would need to use the GGUF library to convert the tensors. It would require a custom script, like the others that already exist in the llama.cpp repository. Your original request was in reference to the |
@teleprint-me We are moving away from server-side scratchpads, in favor of client-side scratchpads. The plugins that can do it should land next week or a week after. There still has to be a script that takes the tasks to do, using In short, the requirement "Script similar to inference_hf.py" can now read "Script similar to inference_hf.py, but only /v1/completions needs to work". Script to test:
Stream and not stream should work, CPU output should be the same as current GPU output -- sounds like a well defined criterion. |
That's exactly what I was looking for, thank you for the update. I'll be reviewing the other open bounties in the coming days as well. Currently, I'm setting up a custom OS for my new workstation and finalizing the prototype interface for my personal assistant. If I make significant progress that aligns with the criteria for any of the outstanding bounties, I'll submit an attempt and, if appropriate, a subsequent PR. Given that I'm working against a deadline, I'm highly motivated to contribute efficiently and effectively. |
/attempt #77 Options |
💡 @ds5t5 submitted a pull request that claims the bounty. You can visit your org dashboard to reward. |
Testing this:
I see speed:
|
Xeon 5315Y
M1 doesn't depend on threads. |
First token, 551 prompt:
I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally we have about 1950 tokens. |
I tried Starcoder 1b, converted by TabbyML: https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml
|
@olegklimov I think it has to do with the conversion process. They're looking into it. Typically the smaller models are much faster in llama.cpp. |
Try the 4-bit model, you should see a performance boost compared to the 16-bit model. 4-bit llama_print_timings: load time = 45.88 ms
llama_print_timings: sample time = 3.91 ms / 300 runs ( 0.01 ms per token, 76706.72 tokens per second)
llama_print_timings: prompt eval time = 56.82 ms / 9 tokens ( 6.31 ms per token, 158.38 tokens per second)
llama_print_timings: eval time = 6762.85 ms / 299 runs ( 22.62 ms per token, 44.21 tokens per second)
llama_print_timings: total time = 6933.22 ms 8-bit llama_print_timings: load time = 71.79 ms
llama_print_timings: sample time = 3.72 ms / 300 runs ( 0.01 ms per token, 80623.49 tokens per second)
llama_print_timings: prompt eval time = 54.23 ms / 9 tokens ( 6.03 ms per token, 165.94 tokens per second)
llama_print_timings: eval time = 11387.12 ms / 299 runs ( 38.08 ms per token, 26.26 tokens per second)
llama_print_timings: total time = 11553.91 ms 16-bit llama_print_timings: load time = 5828.46 ms
llama_print_timings: sample time = 4.17 ms / 300 runs ( 0.01 ms per token, 71856.29 tokens per second)
llama_print_timings: prompt eval time = 72.36 ms / 9 tokens ( 8.04 ms per token, 124.38 tokens per second)
llama_print_timings: eval time = 20573.06 ms / 299 runs ( 68.81 ms per token, 14.53 tokens per second)
llama_print_timings: total time = 20760.76 ms Performance between the 16-bit and 32-bit converted tensor formats will perform the about the same on lower-end hardware. Also, llama.cpp is still working on FIM implementation. Quants are between 2-bit and 16-bit and support k-bit implementations if you aren't too familiar with the library or quant types. |
OK it works nicely! So all the credit goes to @ds5t5, right? |
@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing) |
@ds5t5 Hi there! We are going to slightly change modelling and weights respectively at the HF. The changes will include:
Guess we need to update ggerganov/llama.cpp#3329 as well |
thanks. let me know when it is ready for model weight. i will rebase my llama.cpp PR to the latest branch of llama.cpp. |
@JegernOUTT can i ask why we decided to make the weight change? it seems not quite aligned with other popular models. they (falcon, llama) usually keep mlp.linear_1 and mlp.linear_3 separately. while for attention, it is usually qkv or q/k/v. only the original gpt2 model uses kv as one. |
@ds5t5 We are using different inference backends in |
@JegernOUTT it seems like the latest push breaks the
|
@ds5t5 what problem do you have? |
nvm. i removed my cache and it works |
I'm working on a mod to get HF refact model to run on CPU since I don't have a working GPU backend at the moment. Not too many changes either and I just need to get the server running. Also working on a refact template for llama-cpp-python for inference in refact, so it would just be plug in and play. This wouldn't work until @ds5t5's downstream changes make it into llama-cpp-python though. Hopefully I'll have it done by the end of this weekend. |
@teleprint-me We were thinking more along the lines of bundling llama.cpp with our rust binary, linked together. The rust binary is shipped with our next get plugins, such as VS Code. This might allow for a much lower cost of installation for the end user: no docker, nothing to install, no strange packages in local python, nothing to run separately or care about. The largest problem is prompt prefill, about 4 seconds for 2048 tokens, on Apple M1. That's a bit too long for interactive use. So I asked in llama.cpp what people think about architecture more suitable for CPU or M1, here ggerganov/llama.cpp#3395 . We can train a new model so it prefills prompt faster, we have the data and the GPUs! Or maybe M2 will fix the speed 😂 (I didn't try yet). |
Alright |
i have updated the converter in the PR in llama.cpp based on the latest revision in huggingface hub. It looks like the llama.cpp community wants to wait for a few PRs to be merged before Refact PR is officially merged. i see another 5-10% performance boost after my change to the latest commit of llama.cpp. @olegklimov |
@ds5t5: Your claim has been rewarded! We'll notify you once it is processed. |
🎉🎈 @ds5t5 has been awarded $2,000! 🎈🎊 |
The docker line in the readme doesn't work for Mac/CPU, any chance to get an update on how to run it on Mac arm? |
any updates? |
Yes, we'll release bring-your-own-key in a few days |
Bring your own key is there, but the docker container still doesn't work on an M1. |
You are right, it doesn't. Other servers do though, you can help us if you test it! |
There are several projects aiming to make inference on CPU efficient.
The first part is research:
inference_hf.py
does it (needs a callback that streams output and allows to stop),Please finish the first part, get a "go-ahead" for the second part.
The second part is implementation:
inference_hf.py
,The text was updated successfully, but these errors were encountered: