You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems there is a problem with memory allocation when processing longer prompts, I have used a prompt with around 3500 tokens in LLMEval and when processing the prompt, the process hugs up to 12.5 GB of the memory initially, around 5GB of these are for the model weights which is fine, but the extra 7GB doesn't seem normal. The prompt is around 3500 tokens which means 2MB for each token! The memory usage gets lower (to 6GB) when the prompt processing phase is done. The issue gets worse when the full context is used (it hugs up to ~25GB).
I don't have this issue with llama.cpp since it just allocates the memory required for the weights with a little extra memory for the calculations.
Configuring the memory and cache limits also doesn't help, the process throws before processing.
This issue hinders running and developing applications for devices with lower than 32GB of RAM.
The text was updated successfully, but these errors were encountered:
@davidkoski I tried that already and it doesn't help. The initial jump in memory appears only when processing the prompt. When tokens are being generated one-by-one the memory usage is back to normal.
The memory needed for long prompts scales with the square of the prompt length. So in your case: 3500 * 3500 * num_heads * 2 would be the memory used in bytes for the attention scores with a prompt length of 3500.
What were you running when it jumped to 12GB?
Also #93 should bring LLMEval up to parity with our Python counter part which can handle much longer prompts with lower memory use.
It seems there is a problem with memory allocation when processing longer prompts, I have used a prompt with around 3500 tokens in
LLMEval
and when processing the prompt, the process hugs up to 12.5 GB of the memory initially, around 5GB of these are for the model weights which is fine, but the extra 7GB doesn't seem normal. The prompt is around 3500 tokens which means 2MB for each token! The memory usage gets lower (to 6GB) when the prompt processing phase is done. The issue gets worse when the full context is used (it hugs up to ~25GB).I don't have this issue with
llama.cpp
since it just allocates the memory required for the weights with a little extra memory for the calculations.Configuring the memory and cache limits also doesn't help, the process throws before processing.
This issue hinders running and developing applications for devices with lower than 32GB of RAM.
The text was updated successfully, but these errors were encountered: