-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't run llama3.1-70b at full context #2301
Comments
65k starts to work, gets closer, but even that fails!
gives:
2? Seems some bad math going on. |
Only 32k actually started:
|
TGI does not support it now ,updates are so slow |
Same problem on llama3.1-70b unquantized on 8xA6000: ..anything above ..After warmup, vram usage drops to 21GB per GPU and it works fine (but with 384 GB vram total you'd think 128k context should be possible): sudo docker run --rm --name meta-llama_Meta-Llama-3.1-70B-Instruct
--gpus all
--shm-size 4g
-p 7861:80
--ipc host
-v $HOME/.cache:/.cache/
-v $HOME/.cache/huggingface/hub/:/data
-e VALIDATION_WORKERS=15
-e FLASH_DECODING=1
ghcr.io/huggingface/text-generation-inference:sha-db7e043
--model-id meta-llama/Meta-Llama-3.1-70B-Instruct
--hostname 0.0.0.0
--num-shard 8
--max-total-tokens 42508
--max-input-tokens 40460
--max-batch-size 1
--cuda-graphs 1 output:
..then when i set
..something about the load & warmup using more VRAM per-GPU than it should, when context is large? |
Having the same issue. I run into OOM errors even when running llama 3.1 8b with 128k context on 2 80Gb A100. Feels like something in the prefill is taking up more VRAM |
Facing similar issue. I'm using 4xA100 80GB but it's throwing the same issue when trying to set context length more than 40k. Is there any fix for this? |
Same issue. Commenting for visibility. |
same here for 3.1-70b. just adding that I'm using AWQ and can only run something like ~23k tokens on 2x a6000 ada (96 GB total VRAM), while using VLLM I can run the full 128k no issue. |
same issue on 4xA100 80gb |
I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem) |
THe automatic inference of max-batch-prefill-tokens during the warmup phase is exceeding the VRAM. There seems to be no easy way to control the automatic estimation of that. |
Same issue on 4xA5000 (with Marlin FP8 quantization). |
Hi everyone 👋 Sorry for such a late reply. It seems that vLLM forces a prefix chunk of 32k (which TGI doesn't) which causes the discrepancy. |
Any update on the timing around this? |
@chuddlestonCBANC it's in the works 🙌 |
@ErikKaum After #2402 got merged, I still can't fit Llama 3.1 on my 4xA6000 The log says that prefix caching is active:
But even with only 16k input and 32k total tokens i get a CUDA out of Memory Error.
This is my docker compose file: services:
tgi-llama3.1-70b:
# image: ghcr.io/huggingface/text-generation-inference
build:
context: .
dockerfile: Dockerfile
restart: always
shm_size: 64g
env_file: .env
environment:
TRUST_REMOTE_CODE: true
MODEL_ID: meta-llama/Meta-Llama-3.1-70B-Instruct
HUGGINGFACE_HUB_CACHE: /data
MAX_TOTAL_TOKENS: 32768
MAX_INPUT_TOKENS: 16384
MAX_STOP_SEQUENCES: 5
USE_PREFIX_CACHING: true
FLASH_INFER: true
volumes:
- /data/huggingface/hub/:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4
capabilities: [ gpu ] And the docker file FROM ghcr.io/huggingface/text-generation-inference
RUN pip install --no-cache-dir flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
ENTRYPOINT ["/tgi-entrypoint.sh"] |
@ErikKaum any plans to look at the OOM issues with large contexts? I still get the OOM (mentioned above) regardless of prefix caching on the latest Docker images it seems. |
@ErikKaum @freegheist Yeah, I was evaluating this and trying to do napkin math for gpu memory. I am unable to run LLama3.1-8b even at 64k on an A100. This sheet from Meta seems to imply 128k should only take 16GB of VRAM |
Hi @freegheist 👋 Sorry for being unclear, so the PR was about prefix caching but we still need the prefix chunking in. We've had some issues with it so it's been a back and forth. Can't promise when it's in but we're working hard to get it out 🤞 |
Same issue. Commenting for increasing the priority. |
Same issue here. |
Same issue. |
Same issue |
next stop, vllm! |
Same issue! |
Is there a fix planned for this? I'm still unable to increase the context length to more than 40k. Or is there a workaround to increase the context length? |
Hi @2016bgeyer 👋 Sorry for not updating here. But I can confirm that with TGI version
I leave all the other fields undefined so the TGI auto selects them to max out on the hardware. So it's not fully up to the 128k context but close enough imo. And you'd get up to the 128k with more vRAM or more aggressively quantized. Hopefully this helps 👍 |
Fantastic, thank you for the update! In the future, is there any chance you guys could track and link issues in your MRs a bit more, at least when multiple people have been blocked by an issue? Thanks! |
System Info
2.2.0
Information
Tasks
Reproduction
On 4*H100:
get:
vLLM works fine without errors.
Expected behavior
able to launch and use without error like vLLM
The text was updated successfully, but these errors were encountered: