Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture: what if I want to optimize for llama.cpp? #3390

Closed
olegklimov opened this issue Sep 29, 2023 · 0 comments
Closed

Architecture: what if I want to optimize for llama.cpp? #3390

olegklimov opened this issue Sep 29, 2023 · 0 comments

Comments

@olegklimov
Copy link

We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5.

But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.

We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?

Repository owner locked and limited conversation to collaborators Sep 29, 2023
@staviq staviq converted this issue into discussion #3395 Sep 29, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant