Replies: 5 comments 1 reply
-
Indeed #9492 seems related. Back then, I was able to reproduce the issue but didn't find what is causing the slowdown. If you could pinpoint the exact commit at which this starts happening, it would be very helpful. |
Beta Was this translation helpful? Give feedback.
-
No problem, will work on that tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Well, I worked on it a little bit today, figured out a few things. When I compiled with make it would work every time with no delays at least up until make was depreciated sometime in nov. So then I switched to cmake, and the delays started. Which got me thinking. I went back to the version that I was originally using and compiled with cmake and that commit was now delaying. I don't know much about make and cmake to be of any help with that area, but I can try any flags you want me too. The cmake compiles started to increase delay times in June. I'm still going to work on finding the actual commit where a decent jump in time is, but for now, I will just throw up what I have so far and maybe it will help. All of the tests used the same model, Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. I'm running a 4090 24gb, Ryzen 7 5800X, 128gb ram, Linux Mint 21.3, Linux 5.15.0-130-generic. I'm not that familiar with github so I looked up how to build an older commit and I found this, hope its right.
Make depreciated.
I didn't add the timer until I had seen a gradual increase in delay times and had no way to time it. So I didn't get the times for the other results, but they were slow enough to know. I will work more on this in my spare time. Thanks for helping me with this. |
Beta Was this translation helpful? Give feedback.
-
Ok, I found one jump in delay on June 20'th.
|
Beta Was this translation helpful? Give feedback.
-
The first jump happens on Jun 5.
Both jumps reference MMQ. There is still at least one more time it jumps up because the latest commits time is around 5.5 seconds. The prompt I used is just "hello". Later commits compile times are long, so it will take a while to find them. I went through and looked for references of MMQ in later commits but there is many. |
Beta Was this translation helpful? Give feedback.
-
I'm developing myself a front-end for Llama.cpp (and others) that allows me to switch models dynamically. I've created a custom process server in the background using execve() to run any program on demand, currently I have the server running on the local machine. When my front end needs to use model X, it instructs the backend server to launch llama-server with specific arguments for that model. If another model Y is required later, it signals the server to terminate the current instance of llama-server (model X) and load model Y instead. This setup has been working well until I recently updated Llama.cpp by redownloading and recompiling the latest version (4393 (d79d8f3)).
Now, I'm encountering an issue where the first POST request to /completion takes up to 15 seconds to start inference, as indicated by monitoring GPU utilization. Subsequent requests are much faster. This delay only occurs on the initial post after starting llama-server via execve().
Here's a sample log for the first POST:
And for the second POST:
When I run the same command directly from the command line, there's no such delay:
Command used:
Sample log from command line:
I'm not sure what's causing this discrepancy when running via execve(). I suspect it might be related to environment variables or some changes in the latest version of the server, or more then likely something stupid I'm doing. Any ideas on how to resolve this issue? I switch models often and that delay is killing me. Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help.
On the old version (3772 (23e0d70)), running through execve() on the first POST:
Don't know if this will be helpful at all, but when the server runs llama-server the code is below.
Any help or ideas would be appreciated.
edit to add:
I found this, seems like it is kinda related #9492 I missed it when searched before.
Beta Was this translation helpful? Give feedback.
All reactions