Can't run llama3.1-70b at full context #2301

pseudotensor · 2024-07-24T17:24:45Z

System Info

2.2.0

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

On 4*H100:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 131072 \
             --max-total-tokens 139264 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

get:

RuntimeError: Not enough memory to handle 131122 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

vLLM works fine without errors.

Expected behavior

able to launch and use without error like vLLM

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-07-24T17:32:12Z

65k starts to work, gets closer, but even that fails!

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 66560 \
             --max-total-tokens 74752 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

gives:

RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.553191Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.91 GiB. GPU  has a total capacity of 79.33 GiB of which 1.41 GiB is free. Process 1404711 has 77.91 GiB memory in use. Of the allocated memory 76.30 GiB is allocated by PyTorch, and 27.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.689631Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.699306Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.702328Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.728006Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED

2? Seems some bad math going on.

pseudotensor · 2024-07-24T17:36:09Z

Only 32k actually started:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 32768 \
             --max-total-tokens 40960 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

coderchem · 2024-07-25T02:46:45Z

TGI does not support it now ,updates are so slow

freegheist · 2024-07-25T09:08:12Z

Same problem on llama3.1-70b unquantized on 8xA6000:

..anything above --max-input-tokens=38412 causes OOM (each GPU goes to 36GB used vram of 48GB total during load, then OOM happens during the warmup phase in v2.2.0 docker. smaller values scrape through)

..After warmup, vram usage drops to 21GB per GPU and it works fine (but with 384 GB vram total you'd think 128k context should be possible):

sudo docker run --rm --name meta-llama_Meta-Llama-3.1-70B-Instruct 
   --gpus all 
   --shm-size 4g 
   -p 7861:80 
   --ipc host 
   -v $HOME/.cache:/.cache/
   -v $HOME/.cache/huggingface/hub/:/data
   -e VALIDATION_WORKERS=15 
   -e FLASH_DECODING=1 
   ghcr.io/huggingface/text-generation-inference:sha-db7e043 
   --model-id meta-llama/Meta-Llama-3.1-70B-Instruct 
   --hostname 0.0.0.0 
   --num-shard 8 
   --max-total-tokens 42508 
   --max-input-tokens 40460 
   --max-batch-size 1 
   --cuda-graphs 1

output:

2024-07-25T09:49:25.860540Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 40460
2024-07-25T09:49:25.860548Z  INFO text_generation_launcher: Sharding model on 8 processes
...
2024-07-25T09:50:57.501322Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T09:51:41.686292Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.67 GiB. GPU  has a total capacity of 47.44 GiB of which 9.30 GiB is free. Process 1462226 has 38.13 GiB memory in use. Of the allocated memory 37.41 GiB is allocated by PyTorch, and 276.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1103, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

..then when i set --max-input-tokens=38412, --max-total-tokens=42508 it connects, but not sure where it is getting this max batch total tokens value 69888 from:

2024-07-25T10:08:45.810270Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T10:09:29.103273Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1]
2024-07-25T10:09:29.651189Z  INFO text_generation_router::server: router/src/server.rs:1599: Using scheduler V3
2024-07-25T10:09:29.651204Z  INFO text_generation_router::server: router/src/server.rs:1651: Setting max batch total tokens to 69888
2024-07-25T10:09:30.484670Z  INFO text_generation_router::server: router/src/server.rs:1889: Connected

..something about the load & warmup using more VRAM per-GPU than it should, when context is large?

Jason-CKY · 2024-07-25T11:20:36Z

Having the same issue. I run into OOM errors even when running llama 3.1 8b with 128k context on 2 80Gb A100. Feels like something in the prefill is taking up more VRAM

rishu931997 · 2024-07-26T07:08:47Z

Facing similar issue. I'm using 4xA100 80GB but it's throwing the same issue when trying to set context length more than 40k. Is there any fix for this?

nrepesh · 2024-07-26T16:03:08Z

Same issue. Commenting for visibility.

mjsteele12 · 2024-07-28T19:21:09Z

same here for 3.1-70b. just adding that I'm using AWQ and can only run something like ~23k tokens on 2x a6000 ada (96 GB total VRAM), while using VLLM I can run the full 128k no issue.

weihanfeng · 2024-07-29T09:41:47Z

same issue on 4xA100 80gb

maziyarpanahi · 2024-08-01T11:03:48Z

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

badrisnps · 2024-08-02T06:14:55Z

THe automatic inference of max-batch-prefill-tokens during the warmup phase is exceeding the VRAM. There seems to be no easy way to control the automatic estimation of that.

localmind-ai · 2024-08-03T13:30:54Z

Same issue on 4xA5000 (with Marlin FP8 quantization).

ErikKaum · 2024-08-08T13:35:48Z

Hi everyone 👋

Sorry for such a late reply.
Thanks for reporting this issue and bringing it to our attention. We're currently rewriting a bunch of things and a fix for this is among those 👍

It seems that vLLM forces a prefix chunk of 32k (which TGI doesn't) which causes the discrepancy.

chuddlestonCBANC · 2024-08-15T10:10:28Z

Any update on the timing around this?

ErikKaum · 2024-08-15T19:39:51Z

@chuddlestonCBANC it's in the works 🙌
#2402

raimannma · 2024-08-20T11:55:28Z

@ErikKaum After #2402 got merged, I still can't fit Llama 3.1 on my 4xA6000

The log says that prefix caching is active:

tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171854Z  INFO text_generation_launcher: Using prefix caching = True
tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171911Z  INFO text_generation_launcher: Using Attention = flashinfer

But even with only 16k input and 32k total tokens i get a CUDA out of Memory Error.
With vLLM i can get 80k tokens context length on the same server.

tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1251, in warmup
tgi-llama3.1-70b-1  |     _, batch, _ = self.generate_token(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llama3.1-70b-1  |     return func(*args, **kwds)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1693, in generate_token
tgi-llama3.1-70b-1  |     prefill_logprobs_tensor = torch.log_softmax(out, -1)
tgi-llama3.1-70b-1  | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.93 GiB. GPU 2 has a total capacity of 47.53 GiB of which 1.18 GiB is free. Process 790346 has 46.33 GiB memory in use. Of the allocated memory 45.87 GiB is allocated by PyTorch, and 30.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | The above exception was the direct cause of the following exception:
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llama3.1-70b-1  |     sys.exit(app())
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llama3.1-70b-1  |     return get_command(self)(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llama3.1-70b-1  |     return self.main(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llama3.1-70b-1  |     return _main(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llama3.1-70b-1  |     rv = self.invoke(ctx)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llama3.1-70b-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llama3.1-70b-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llama3.1-70b-1  |     return __callback(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llama3.1-70b-1  |     return callback(**use_params)  # type: ignore
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
tgi-llama3.1-70b-1  |     server.serve(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
tgi-llama3.1-70b-1  |     asyncio.run(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llama3.1-70b-1  |     return loop.run_until_complete(main)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llama3.1-70b-1  |     self.run_forever()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llama3.1-70b-1  |     self._run_once()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llama3.1-70b-1  |     handle._run()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llama3.1-70b-1  |     self._context.run(self._callback, *self._args)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llama3.1-70b-1  |     return await self.intercept(
tgi-llama3.1-70b-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llama3.1-70b-1  |     return await response
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
tgi-llama3.1-70b-1  |     raise error
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
tgi-llama3.1-70b-1  |     return await behavior(request_or_iterator, context)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Warmup
tgi-llama3.1-70b-1  |     max_supported_total_tokens = self.model.warmup(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in warmup
tgi-llama3.1-70b-1  |     raise RuntimeError(
tgi-llama3.1-70b-1  | RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

This is my docker compose file:

services:
  tgi-llama3.1-70b:
#    image: ghcr.io/huggingface/text-generation-inference
    build:
      context: .
      dockerfile: Dockerfile
    restart: always
    shm_size: 64g
    env_file: .env
    environment:
      TRUST_REMOTE_CODE: true
      MODEL_ID: meta-llama/Meta-Llama-3.1-70B-Instruct
      HUGGINGFACE_HUB_CACHE: /data
      MAX_TOTAL_TOKENS: 32768
      MAX_INPUT_TOKENS: 16384
      MAX_STOP_SEQUENCES: 5
      USE_PREFIX_CACHING: true
      FLASH_INFER: true
    volumes:
      - /data/huggingface/hub/:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [ gpu ]

And the docker file

FROM ghcr.io/huggingface/text-generation-inference

RUN pip install --no-cache-dir flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

ENTRYPOINT ["/tgi-entrypoint.sh"]

freegheist · 2024-08-28T08:32:21Z

@ErikKaum any plans to look at the OOM issues with large contexts? I still get the OOM (mentioned above) regardless of prefix caching on the latest Docker images it seems.

dacox · 2024-09-05T17:54:15Z

@ErikKaum @freegheist Yeah, I was evaluating this and trying to do napkin math for gpu memory.

I am unable to run LLama3.1-8b even at 64k on an A100.

This sheet from Meta seems to imply 128k should only take 16GB of VRAM

ErikKaum · 2024-09-06T14:06:44Z

Hi @freegheist 👋

Sorry for being unclear, so the PR was about prefix caching but we still need the prefix chunking in. We've had some issues with it so it's been a back and forth.

Can't promise when it's in but we're working hard to get it out 🤞

imran3180 · 2024-09-11T18:21:47Z

Same issue. Commenting for increasing the priority.

osmalpkoras · 2024-09-13T12:31:38Z

Same issue here.

giladd123 · 2024-09-23T07:16:35Z

Same issue.

Simon-Stone · 2024-10-04T12:19:24Z

Same issue

cancelself · 2024-10-09T21:03:10Z

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

next stop, vllm!

imran3180 · 2024-10-09T21:42:15Z

@ErikKaum @drbh @Narsil Since many people are running into the same problem. Is there any plan to prioritize this bug?

nimishbongale · 2024-10-20T00:07:27Z

Same issue!

rishu931997 · 2024-10-29T17:13:35Z

Is there a fix planned for this? I'm still unable to increase the context length to more than 40k. Or is there a workaround to increase the context length?

2016bgeyer · 2024-12-20T04:02:15Z

@ErikKaum @Narsil

Was this issue addressed in TGI v3 or in any of the following specific MRs?:
2673 (#2673)
2797 (#2797)
2808 (#2808)

Many people are blocked by this issue and if it resolved, that would be really good to track here

ErikKaum · 2024-12-20T09:02:51Z

Hi @2016bgeyer 👋

Sorry for not updating here. But I can confirm that with TGI version 3.0.1 I ran this setup and it worked:

model: hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
GPU: Nvidia A100, 80 GB
TGI auto selects Maximum total tokens defaulted to 119178

I leave all the other fields undefined so the TGI auto selects them to max out on the hardware. So it's not fully up to the 128k context but close enough imo. And you'd get up to the 128k with more vRAM or more aggressively quantized.

Hopefully this helps 👍

2016bgeyer · 2024-12-21T05:00:37Z

Fantastic, thank you for the update!

In the future, is there any chance you guys could track and link issues in your MRs a bit more, at least when multiple people have been blocked by an issue? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run llama3.1-70b at full context #2301

Can't run llama3.1-70b at full context #2301

pseudotensor commented Jul 24, 2024

pseudotensor commented Jul 24, 2024 •

edited

Loading

pseudotensor commented Jul 24, 2024

coderchem commented Jul 25, 2024

freegheist commented Jul 25, 2024 •

edited

Loading

Jason-CKY commented Jul 25, 2024

rishu931997 commented Jul 26, 2024

nrepesh commented Jul 26, 2024

mjsteele12 commented Jul 28, 2024

weihanfeng commented Jul 29, 2024

maziyarpanahi commented Aug 1, 2024

badrisnps commented Aug 2, 2024

localmind-ai commented Aug 3, 2024

ErikKaum commented Aug 8, 2024

chuddlestonCBANC commented Aug 15, 2024

ErikKaum commented Aug 15, 2024

raimannma commented Aug 20, 2024 •

edited

Loading

freegheist commented Aug 28, 2024

dacox commented Sep 5, 2024 •

edited

Loading

ErikKaum commented Sep 6, 2024

imran3180 commented Sep 11, 2024

osmalpkoras commented Sep 13, 2024

giladd123 commented Sep 23, 2024

Simon-Stone commented Oct 4, 2024

cancelself commented Oct 9, 2024

imran3180 commented Oct 9, 2024

nimishbongale commented Oct 20, 2024

rishu931997 commented Oct 29, 2024

2016bgeyer commented Dec 20, 2024

ErikKaum commented Dec 20, 2024

2016bgeyer commented Dec 21, 2024

Can't run llama3.1-70b at full context #2301

Can't run llama3.1-70b at full context #2301

Comments

pseudotensor commented Jul 24, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

pseudotensor commented Jul 24, 2024 • edited Loading

pseudotensor commented Jul 24, 2024

coderchem commented Jul 25, 2024

freegheist commented Jul 25, 2024 • edited Loading

Jason-CKY commented Jul 25, 2024

rishu931997 commented Jul 26, 2024

nrepesh commented Jul 26, 2024

mjsteele12 commented Jul 28, 2024

weihanfeng commented Jul 29, 2024

maziyarpanahi commented Aug 1, 2024

badrisnps commented Aug 2, 2024

localmind-ai commented Aug 3, 2024

ErikKaum commented Aug 8, 2024

chuddlestonCBANC commented Aug 15, 2024

ErikKaum commented Aug 15, 2024

raimannma commented Aug 20, 2024 • edited Loading

freegheist commented Aug 28, 2024

dacox commented Sep 5, 2024 • edited Loading

ErikKaum commented Sep 6, 2024

imran3180 commented Sep 11, 2024

osmalpkoras commented Sep 13, 2024

giladd123 commented Sep 23, 2024

Simon-Stone commented Oct 4, 2024

cancelself commented Oct 9, 2024

imran3180 commented Oct 9, 2024

nimishbongale commented Oct 20, 2024

rishu931997 commented Oct 29, 2024

2016bgeyer commented Dec 20, 2024

ErikKaum commented Dec 20, 2024

2016bgeyer commented Dec 21, 2024

pseudotensor commented Jul 24, 2024 •

edited

Loading

freegheist commented Jul 25, 2024 •

edited

Loading

raimannma commented Aug 20, 2024 •

edited

Loading

dacox commented Sep 5, 2024 •

edited

Loading