Architecture: what if I want to optimize for llama.cpp? #3390

olegklimov · 2023-09-29T05:59:03Z

We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5.

But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.

We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?

Repository owner locked and limited conversation to collaborators Sep 29, 2023

staviq converted this issue into discussion #3395 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Architecture: what if I want to optimize for llama.cpp? #3390

Architecture: what if I want to optimize for llama.cpp? #3390

olegklimov commented Sep 29, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Architecture: what if I want to optimize for llama.cpp? #3390

Architecture: what if I want to optimize for llama.cpp? #3390

Comments

olegklimov commented Sep 29, 2023

This issue was moved to a discussion.