Help with delay in llama-server #11005

indie-ai · 2024-12-28T20:50:47Z

indie-ai
Dec 28, 2024

I'm developing myself a front-end for Llama.cpp (and others) that allows me to switch models dynamically. I've created a custom process server in the background using execve() to run any program on demand, currently I have the server running on the local machine. When my front end needs to use model X, it instructs the backend server to launch llama-server with specific arguments for that model. If another model Y is required later, it signals the server to terminate the current instance of llama-server (model X) and load model Y instead. This setup has been working well until I recently updated Llama.cpp by redownloading and recompiling the latest version (4393 (d79d8f3)).

Now, I'm encountering an issue where the first POST request to /completion takes up to 15 seconds to start inference, as indicated by monitoring GPU utilization. Subsequent requests are much faster. This delay only occurs on the initial post after starting llama-server via execve().

Here's a sample log for the first POST:

0.15.774.026 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 67, n_tokens = 67, progress = 1.000000
0.15.774.054 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.15.774.055 D srv  update_slots: decoding batch, n_tokens = 67   <--14 seconds? ************************************
0.29.030.566 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.29.030.569 D srv  update_slots: run slots completed
...
prompt eval time =   13256.58 ms /    67 tokens (  197.86 ms per token,     5.05 tokens per second)
       eval time =    7969.70 ms /   561 tokens (   14.21 ms per token,    70.39 tokens per second)
      total time =   21226.28 ms /   628 tokens

And for the second POST:

0.41.991.717 I slot update_slots: id  0 | task 562 | prompt done, n_past = 67, n_tokens = 1
0.41.991.718 D srv  update_slots: decoding batch, n_tokens = 1
0.42.003.444 D slot process_toke: id  0 | task 562 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.42.003.446 D srv  update_slots: run slots completed
...
prompt eval time =      11.84 ms /     1 tokens (   11.84 ms per token,    84.47 tokens per second)
       eval time =    7693.20 ms /   547 tokens (   14.06 ms per token,    71.10 tokens per second)
      total time =    7705.04 ms /   548 tokens

When I run the same command directly from the command line, there's no such delay:

Command used:

/media/TData/AI/llama.cpp/build/bin/./llama-server -v --log-file /media/ram/log2.txt --no-kv-offload --log-prefix --log-timestamps -t 2 -ngl 70 -c 8196 -m /media/TData/AI/llm/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --host 192.168.55.9

Sample log from command line:

0.08.124.116 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 67, n_tokens = 67, progress = 1.000000
0.08.124.134 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.08.124.135 D srv  update_slots: decoding batch, n_tokens = 67
0.08.246.368 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.08.246.373 D srv  update_slots: run slots completed
...
prompt eval time =   13256.58 ms /    67 tokens (  197.86 ms per token,     5.05 tokens per second)
       eval time =    7969.70 ms /   561 tokens (   14.21 ms per token,    70.39 tokens per second)
      total time =   21226.28 ms /   628 tokens

I'm not sure what's causing this discrepancy when running via execve(). I suspect it might be related to environment variables or some changes in the latest version of the server, or more then likely something stupid I'm doing. Any ideas on how to resolve this issue? I switch models often and that delay is killing me. Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help.

On the old version (3772 (23e0d70)), running through execve() on the first POST:

0.11.084.865 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.11.084.866 D srv  update_slots: decoding batch, n_tokens = 67
0.11.177.557 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'D'
0.11.177.561 D srv  update_slots: run slots completed
...
prompt eval time =      92.89 ms /    67 tokens (    1.39 ms per token,   721.30 tokens per second)
       eval time =   11212.95 ms /   746 tokens (   15.03 ms per token,    66.53 tokens per second)
      total time =   11305.84 ms /   813 tokens

Don't know if this will be helpful at all, but when the server runs llama-server the code is below.

struct Process {
    pid_t pid;
    uint32_t cpid; // client pid
    std::string pathtocmd; // path to the cmd
    std::string pathtocd; // path to change directory to before running cmd
    std::string cmd;
    std::string name;
    std::vector<std::string> env;
    std::vector<std::string> args;

    int pipefd[2];        // For reading output from child
    int write_pipefd[2];  // For writing to the child

    std::string err;
};

pid_t runprocess(Process *prc) {
    if (pipe(prc->pipefd) == -1 || pipe(prc->write_pipefd) == -1) {
        perror("Pipe creation failed");
        return -1;
    }

    pid_t cpid = fork();
    if (cpid < 0) { // Error occurred
        std::cerr << "Error starting fork\n";
        close(prc->pipefd[0]);
        close(prc->pipefd[1]);
        close(prc->write_pipefd[0]);
        close(prc->write_pipefd[1]);
        return -1;
    } else if (cpid == 0) { // Child process
        std::string fcmd = prc->pathtocmd + prc->cmd;

        // Close unused ends of pipes
        close(prc->pipefd[0]);       // Read end of pipe for child's output
        close(prc->write_pipefd[1]); // Write end of write pipe

        // Redirect stdout and stderr to read end of pipe
        dup2(prc->pipefd[1], STDOUT_FILENO);
        dup2(prc->pipefd[1], STDERR_FILENO);

        // Close the write end after duplication
        close(prc->pipefd[1]);

        // Set up stdin from write_pipefd if needed
        dup2(prc->write_pipefd[0], STDIN_FILENO);
        close(prc->write_pipefd[0]); // No longer need this

        const char **envv = new const char* [prc->env.size() + 1];
        const char **argv = new const char* [prc->args.size() + 2];

        argv[0] = prc->cmd.c_str();
        envv[(int)prc->env.size()] = NULL;
        if (!prc->args.empty()) {
            for (size_t j = 0; j < prc->args.size(); ++j) {
                argv[j + 1] = prc->args[j].c_str();
            }
        }
        argv[prc->args.size() + 1] = NULL;

        int r = chdir(prc->pathtocd.c_str());
        if (r != 0) {
            std::cerr << "Error: Change Directory Failed" << r << "\n";
            exit(-1);
        }

        execve(fcmd.c_str(), const_cast<char**>(argv), const_cast<char**>(envv));

        // If exec fails
        perror("Exec failed");
        exit(EXIT_FAILURE);

    } else { // Parent process
        int retval = fcntl(prc->pipefd[0], F_SETFL, O_NONBLOCK);
        if (retval == -1) {
            std::cerr << "Error setting pipe to non-blocking\n";
            close(prc->pipefd[0]);
            close(prc->write_pipefd[1]); // Close write end of write pipe
            return -1;
        }

        // Close unused ends in parent
        close(prc->pipefd[1]);
        close(prc->write_pipefd[0]);

        prc->pid = cpid;

        return cpid;
    }
}

Any help or ideas would be appreciated.

edit to add:
I found this, seems like it is kinda related #9492 I missed it when searched before.

ggerganov · 2024-12-31T11:24:56Z

ggerganov
Dec 31, 2024
Maintainer

Indeed #9492 seems related. Back then, I was able to reproduce the issue but didn't find what is causing the slowdown. If you could pinpoint the exact commit at which this starts happening, it would be very helpful.

1 reply

ggerganov Dec 31, 2024
Maintainer

Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help.

Btw, --no-kv-offload is only useful if you cannot fit the model in VRAM - it will keep the KV cache buffers in CPU RAM. This makes overall generation slower for cases where you have enough VRAM.

To disable prompt caching, add "cache_prompt": false to your requests, although I doubt it has anything to do with this problem.

indie-ai · 2024-12-31T20:30:29Z

indie-ai
Dec 31, 2024
Author

No problem, will work on that tomorrow.

0 replies

indie-ai · 2025-01-02T01:02:51Z

indie-ai
Jan 2, 2025
Author

Well, I worked on it a little bit today, figured out a few things. When I compiled with make it would work every time with no delays at least up until make was depreciated sometime in nov. So then I switched to cmake, and the delays started. Which got me thinking. I went back to the version that I was originally using and compiled with cmake and that commit was now delaying. I don't know much about make and cmake to be of any help with that area, but I can try any flags you want me too. The cmake compiles started to increase delay times in June. I'm still going to work on finding the actual commit where a decent jump in time is, but for now, I will just throw up what I have so far and maybe it will help. All of the tests used the same model, Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. I'm running a 4090 24gb, Ryzen 7 5800X, 128gb ram, Linux Mint 21.3, Linux 5.15.0-130-generic.

I'm not that familiar with github so I looked up how to build an older commit and I found this, hope its right.

git clone https://github.com/ggerganov/llama.cpp
git reset --hard <commit #>

Using make:
make LLAMA_CUDA=1 LLAMA_CUDA_DMMV_X=32 LLAMA_CUDA_MMV_Y=1 LLAMA_CUDA_FA_ALL_QUANTS=1
--------------------------------------------------------------------------------------------------------
Sep 17, 2024: **NO delay**
version: 3773 (37f3a381)
0.02.122.523 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.122.523 D srv  update_slots: decoding batch, n_tokens = 37
0.02.155.150 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Sep 23, 2024: **NO delay**
0.00.000.607 I build: 3809 (f3979df7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.142.658 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.142.659 D srv  update_slots: decoding batch, n_tokens = 37
0.02.176.467 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Sep 29, 2024: **NO delay**
0.00.000.455 I build: 3844 (0de8b203) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.172.025 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.172.026 D srv  update_slots: decoding batch, n_tokens = 37
0.02.206.576 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Oct 4, 2024: **NO delay**
0.00.000.375 I build: 3879 (133c7b46) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.072.088 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.072.088 D srv  update_slots: decoding batch, n_tokens = 37
0.02.106.705 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Oct 13, 2024: **NO delay**
0.00.000.380 I build: 3914 (c7181bd2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.056.048 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.056.049 D srv  update_slots: decoding batch, n_tokens = 37
0.02.090.362 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Oct 28, 2024: **NO delay**
0.00.000.444 I build: 3984 (8125e6cb) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.084.777 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.084.777 D srv  update_slots: decoding batch, n_tokens = 37
0.02.119.323 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Nov 3, 2024: **NO delay**
0.00.000.379 I build: 4019 (08828a6d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.088.791 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.088.792 D srv  update_slots: decoding batch, n_tokens = 37
0.02.125.602 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Make depreciated.

Using cmake:
cmake -B build -DGGML_CUDA=ON (unless otherwise stated)
cmake --build build --config Release
-------------------------------------------------------------------------------------------------------
Apr 30, 2024: **NO delay** **TIME=183ms** (timer starts before sending request to socket and ends with a "stop:true" in the json response)
cmake -B build -DLLAMA_CUDA=ON
using ./server
Commit: a8f9b076316e16aadd0791015b3bfd446fe1e904 (--version not implemented)

May 31, 2024: **NO delay** **TIME=174ms**
cmake -B build -DLLAMA_CUDA=ON
using ./server
Commit: a323ec60af14a33d560df98f2cc41b4112cb4f80 (--version not implemented)

Jun 15, 2024: **slight delay** **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3153 (0c7b3595)

Jun 16, 2024: **slight delay** **TIME=870ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3164 (df68d4fa)

Jun 17, 2024: **slight delay** **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3173 (a94e6ff8)

Jun 22, 2024: **TIME=2813ms** 
cmake -B build -DLLAMA_CUDA=ON
version: 3203 (b5a5f34e)

Jun 24, 2024: **TIME=2757ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3214 (9a590c82)

Jun 30, 2024: **delay**
version: 3268 (d0a7145b)

Jul 31, 2024: **delay**
version: 3499 (c8a00909)

Aug 27, 2024: **delay**
version: 3639 (20f1789d)

Sep 1, 2024: **delay**
version: 3651 (8f1d81a0)

Sep 13, 2024: **delay**
version: 3751 (feff4aa8)

Sep 16, 2024: **delay**
0.00.000.404 I build: 3772 (23e0d70b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.06.233.887 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.06.233.887 D srv  update_slots: decoding batch, n_tokens = 37
0.11.272.119 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Nov 15, 2024: **delay**
0.00.000.384 I build: 4089 (cbf5541a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.06.382.417 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.06.382.417 D srv  update_slots: decoding batch, n_tokens = 37
0.11.450.186 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

latest commit delay time TIME=5457ms

I didn't add the timer until I had seen a gradual increase in delay times and had no way to time it. So I didn't get the times for the other results, but they were slow enough to know.

I will work more on this in my spare time. Thanks for helping me with this.

0 replies

indie-ai · 2025-01-02T03:52:50Z

indie-ai
Jan 2, 2025
Author

Ok, I found one jump in delay on June 20'th.

Jun 15, 2024: **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3153 (0c7b3595)

Jun 16, 2024: **TIME=870ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3164 (df68d4fa)

Jun 17, 2024: **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3173 (a94e6ff8)

Jun 18, 2024: **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3179 (84f6de17)

Jun 19, 2024: **TIME=869ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3186 (ba589931)

Jun 20, 2024: **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3187 (2075a66a)

Jun 20, 2024: **TIME=2766ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3188 (d50f8897)

Jun 20, 2024: **TIME=2780ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3189 (de391e4c)

Jun 20, 2024: **TIME=2773ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3190 (abd894ad)

Jun 20, 2024: **TIME=2780ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3191 (17b291a6)

Jun 20, 2024: **TIME=2771ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3192 (b1ef562b)

Jun 22, 2024: **TIME=2813ms** 
cmake -B build -DLLAMA_CUDA=ON
version: 3203 (b5a5f34e)

Jun 24, 2024: **TIME=2757ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3214 (9a590c82)

0 replies

indie-ai · 2025-01-02T18:54:13Z

indie-ai
Jan 2, 2025
Author

The first jump happens on Jun 5.

Apr 30, 2024: **TIME=183ms**  Commit: a8f9b076316e16aadd0791015b3bfd446fe1e904
May 31, 2024: **TIME=174ms** Commit: a323ec60af14a33d560df98f2cc41b4112cb4f80
Jun 4, 2024: **TIME=164ms** version: 3089 (c90dbe02)
Jun 5, 2024: **TIME=163ms** version: 3091 (2b338967)
Jun 5, 2024: **TIME=850ms** version: 3092 (7d1a378b)
Jun 5, 2024: **TIME=845ms** version: 3093 (7672adee)
Jun 6, 2024: **TIME=852ms** version: 3101 (ee459f40)
Jun 8, 2024: **TIME=849ms* version: 3113 (5795b941)
Jun 15, 2024: **TIME=883ms** version: 3153 (0c7b3595)
Jun 16, 2024: **TIME=870ms** version: 3164 (df68d4fa)
Jun 17, 2024: **TIME=872ms** version: 3173 (a94e6ff8)
Jun 18, 2024: **TIME=872ms** version: 3179 (84f6de17)
Jun 19, 2024: **TIME=869ms** version: 3186 (ba589931)
Jun 20, 2024: **TIME=883ms** version: 3187 (2075a66a)
Jun 20, 2024: **TIME=2766ms** version: 3188 (d50f8897)
Jun 20, 2024: **TIME=2780ms** version: 3189 (de391e4c)
Jun 20, 2024: **TIME=2773ms** version: 3190 (abd894ad)

Both jumps reference MMQ. There is still at least one more time it jumps up because the latest commits time is around 5.5 seconds. The prompt I used is just "hello". Later commits compile times are long, so it will take a while to find them. I went through and looked for references of MMQ in later commits but there is many.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with delay in llama-server #11005

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help with delay in llama-server #11005

indie-ai Dec 28, 2024

Replies: 5 comments · 1 reply

ggerganov Dec 31, 2024 Maintainer

ggerganov Dec 31, 2024 Maintainer

indie-ai Dec 31, 2024 Author

indie-ai Jan 2, 2025 Author

indie-ai Jan 2, 2025 Author

indie-ai Jan 2, 2025 Author

indie-ai
Dec 28, 2024

Replies: 5 comments 1 reply

ggerganov
Dec 31, 2024
Maintainer

ggerganov Dec 31, 2024
Maintainer

indie-ai
Dec 31, 2024
Author

indie-ai
Jan 2, 2025
Author

indie-ai
Jan 2, 2025
Author

indie-ai
Jan 2, 2025
Author