Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run llama3.1-70b at full context #2301

Open
2 of 4 tasks
pseudotensor opened this issue Jul 24, 2024 · 30 comments
Open
2 of 4 tasks

Can't run llama3.1-70b at full context #2301

pseudotensor opened this issue Jul 24, 2024 · 30 comments

Comments

@pseudotensor
Copy link

System Info

2.2.0

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

On 4*H100:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 131072 \
             --max-total-tokens 139264 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

get:

RuntimeError: Not enough memory to handle 131122 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

vLLM works fine without errors.

Expected behavior

able to launch and use without error like vLLM

@pseudotensor
Copy link
Author

pseudotensor commented Jul 24, 2024

65k starts to work, gets closer, but even that fails!

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 66560 \
             --max-total-tokens 74752 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

gives:

RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.553191Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.91 GiB. GPU  has a total capacity of 79.33 GiB of which 1.41 GiB is free. Process 1404711 has 77.91 GiB memory in use. Of the allocated memory 76.30 GiB is allocated by PyTorch, and 27.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.689631Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.699306Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.702328Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.728006Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED

2? Seems some bad math going on.

@pseudotensor
Copy link
Author

Only 32k actually started:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 32768 \
             --max-total-tokens 40960 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

@coderchem
Copy link

TGI does not support it now ,updates are so slow

@freegheist
Copy link

freegheist commented Jul 25, 2024

Same problem on llama3.1-70b unquantized on 8xA6000:

..anything above --max-input-tokens=38412 causes OOM (each GPU goes to 36GB used vram of 48GB total during load, then OOM happens during the warmup phase in v2.2.0 docker. smaller values scrape through)

..After warmup, vram usage drops to 21GB per GPU and it works fine (but with 384 GB vram total you'd think 128k context should be possible):

sudo docker run --rm --name meta-llama_Meta-Llama-3.1-70B-Instruct 
   --gpus all 
   --shm-size 4g 
   -p 7861:80 
   --ipc host 
   -v $HOME/.cache:/.cache/
   -v $HOME/.cache/huggingface/hub/:/data
   -e VALIDATION_WORKERS=15 
   -e FLASH_DECODING=1 
   ghcr.io/huggingface/text-generation-inference:sha-db7e043 
   --model-id meta-llama/Meta-Llama-3.1-70B-Instruct 
   --hostname 0.0.0.0 
   --num-shard 8 
   --max-total-tokens 42508 
   --max-input-tokens 40460 
   --max-batch-size 1 
   --cuda-graphs 1

output:

2024-07-25T09:49:25.860540Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 40460
2024-07-25T09:49:25.860548Z  INFO text_generation_launcher: Sharding model on 8 processes
...
2024-07-25T09:50:57.501322Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T09:51:41.686292Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.67 GiB. GPU  has a total capacity of 47.44 GiB of which 9.30 GiB is free. Process 1462226 has 38.13 GiB memory in use. Of the allocated memory 37.41 GiB is allocated by PyTorch, and 276.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1103, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

..then when i set --max-input-tokens=38412, --max-total-tokens=42508 it connects, but not sure where it is getting this max batch total tokens value 69888 from:

2024-07-25T10:08:45.810270Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T10:09:29.103273Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1]
2024-07-25T10:09:29.651189Z  INFO text_generation_router::server: router/src/server.rs:1599: Using scheduler V3
2024-07-25T10:09:29.651204Z  INFO text_generation_router::server: router/src/server.rs:1651: Setting max batch total tokens to 69888
2024-07-25T10:09:30.484670Z  INFO text_generation_router::server: router/src/server.rs:1889: Connected

..something about the load & warmup using more VRAM per-GPU than it should, when context is large?

@Jason-CKY
Copy link
Contributor

Having the same issue. I run into OOM errors even when running llama 3.1 8b with 128k context on 2 80Gb A100. Feels like something in the prefill is taking up more VRAM

@rishu931997
Copy link

Facing similar issue. I'm using 4xA100 80GB but it's throwing the same issue when trying to set context length more than 40k. Is there any fix for this?

@nrepesh
Copy link

nrepesh commented Jul 26, 2024

Same issue. Commenting for visibility.

@mjsteele12
Copy link

same here for 3.1-70b. just adding that I'm using AWQ and can only run something like ~23k tokens on 2x a6000 ada (96 GB total VRAM), while using VLLM I can run the full 128k no issue.

@weihanfeng
Copy link

same issue on 4xA100 80gb

@maziyarpanahi
Copy link
Contributor

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

@badrisnps
Copy link

THe automatic inference of max-batch-prefill-tokens during the warmup phase is exceeding the VRAM. There seems to be no easy way to control the automatic estimation of that.

@localmind-ai
Copy link

Same issue on 4xA5000 (with Marlin FP8 quantization).

@ErikKaum
Copy link
Member

ErikKaum commented Aug 8, 2024

Hi everyone 👋

Sorry for such a late reply.
Thanks for reporting this issue and bringing it to our attention. We're currently rewriting a bunch of things and a fix for this is among those 👍

It seems that vLLM forces a prefix chunk of 32k (which TGI doesn't) which causes the discrepancy.

@chuddlestonCBANC
Copy link

Any update on the timing around this?

@ErikKaum
Copy link
Member

@chuddlestonCBANC it's in the works 🙌
#2402

@raimannma
Copy link

raimannma commented Aug 20, 2024

@ErikKaum After #2402 got merged, I still can't fit Llama 3.1 on my 4xA6000

The log says that prefix caching is active:

tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171854Z  INFO text_generation_launcher: Using prefix caching = True
tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171911Z  INFO text_generation_launcher: Using Attention = flashinfer

But even with only 16k input and 32k total tokens i get a CUDA out of Memory Error.
With vLLM i can get 80k tokens context length on the same server.

tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1251, in warmup
tgi-llama3.1-70b-1  |     _, batch, _ = self.generate_token(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llama3.1-70b-1  |     return func(*args, **kwds)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1693, in generate_token
tgi-llama3.1-70b-1  |     prefill_logprobs_tensor = torch.log_softmax(out, -1)
tgi-llama3.1-70b-1  | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.93 GiB. GPU 2 has a total capacity of 47.53 GiB of which 1.18 GiB is free. Process 790346 has 46.33 GiB memory in use. Of the allocated memory 45.87 GiB is allocated by PyTorch, and 30.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | The above exception was the direct cause of the following exception:
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llama3.1-70b-1  |     sys.exit(app())
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llama3.1-70b-1  |     return get_command(self)(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llama3.1-70b-1  |     return self.main(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llama3.1-70b-1  |     return _main(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llama3.1-70b-1  |     rv = self.invoke(ctx)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llama3.1-70b-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llama3.1-70b-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llama3.1-70b-1  |     return __callback(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llama3.1-70b-1  |     return callback(**use_params)  # type: ignore
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
tgi-llama3.1-70b-1  |     server.serve(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
tgi-llama3.1-70b-1  |     asyncio.run(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llama3.1-70b-1  |     return loop.run_until_complete(main)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llama3.1-70b-1  |     self.run_forever()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llama3.1-70b-1  |     self._run_once()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llama3.1-70b-1  |     handle._run()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llama3.1-70b-1  |     self._context.run(self._callback, *self._args)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llama3.1-70b-1  |     return await self.intercept(
tgi-llama3.1-70b-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llama3.1-70b-1  |     return await response
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
tgi-llama3.1-70b-1  |     raise error
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
tgi-llama3.1-70b-1  |     return await behavior(request_or_iterator, context)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Warmup
tgi-llama3.1-70b-1  |     max_supported_total_tokens = self.model.warmup(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in warmup
tgi-llama3.1-70b-1  |     raise RuntimeError(
tgi-llama3.1-70b-1  | RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

This is my docker compose file:

services:
  tgi-llama3.1-70b:
#    image: ghcr.io/huggingface/text-generation-inference
    build:
      context: .
      dockerfile: Dockerfile
    restart: always
    shm_size: 64g
    env_file: .env
    environment:
      TRUST_REMOTE_CODE: true
      MODEL_ID: meta-llama/Meta-Llama-3.1-70B-Instruct
      HUGGINGFACE_HUB_CACHE: /data
      MAX_TOTAL_TOKENS: 32768
      MAX_INPUT_TOKENS: 16384
      MAX_STOP_SEQUENCES: 5
      USE_PREFIX_CACHING: true
      FLASH_INFER: true
    volumes:
      - /data/huggingface/hub/:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [ gpu ]

And the docker file

FROM ghcr.io/huggingface/text-generation-inference

RUN pip install --no-cache-dir flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

ENTRYPOINT ["/tgi-entrypoint.sh"]

@freegheist
Copy link

@ErikKaum any plans to look at the OOM issues with large contexts? I still get the OOM (mentioned above) regardless of prefix caching on the latest Docker images it seems.

@dacox
Copy link

dacox commented Sep 5, 2024

@ErikKaum @freegheist Yeah, I was evaluating this and trying to do napkin math for gpu memory.

I am unable to run LLama3.1-8b even at 64k on an A100.

This sheet from Meta seems to imply 128k should only take 16GB of VRAM

@ErikKaum
Copy link
Member

ErikKaum commented Sep 6, 2024

Hi @freegheist 👋

Sorry for being unclear, so the PR was about prefix caching but we still need the prefix chunking in. We've had some issues with it so it's been a back and forth.

Can't promise when it's in but we're working hard to get it out 🤞

@imran3180
Copy link

Same issue. Commenting for increasing the priority.

@osmalpkoras
Copy link

Same issue here.

@giladd123
Copy link

Same issue.

@Simon-Stone
Copy link

Same issue

@cancelself
Copy link

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

next stop, vllm!

@imran3180
Copy link

@ErikKaum @drbh @Narsil Since many people are running into the same problem. Is there any plan to prioritize this bug?

@nimishbongale
Copy link

Same issue!

@rishu931997
Copy link

Is there a fix planned for this? I'm still unable to increase the context length to more than 40k. Or is there a workaround to increase the context length?

@2016bgeyer
Copy link

@ErikKaum @Narsil

Was this issue addressed in TGI v3 or in any of the following specific MRs?:
2673 (#2673)
2797 (#2797)
2808 (#2808)

Many people are blocked by this issue and if it resolved, that would be really good to track here

@ErikKaum
Copy link
Member

Hi @2016bgeyer 👋

Sorry for not updating here. But I can confirm that with TGI version 3.0.1 I ran this setup and it worked:

I leave all the other fields undefined so the TGI auto selects them to max out on the hardware. So it's not fully up to the 128k context but close enough imo. And you'd get up to the 128k with more vRAM or more aggressively quantized.

Hopefully this helps 👍

@2016bgeyer
Copy link

Fantastic, thank you for the update!

In the future, is there any chance you guys could track and link issues in your MRs a bit more, at least when multiple people have been blocked by an issue? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests