Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "Failed to buffer the request body: length limit exceeded" when supplying base64 encoded images greater than 1MB in prompt #1802

Closed
1 of 4 tasks
akowalsk opened this issue Apr 24, 2024 · 11 comments

Comments

@akowalsk
Copy link
Contributor

System Info

text-generation-launcher --env

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 2d0a7173d4891e7cd5f9b77f8e0987b82a339e51
Docker label: sha-2d0a717
nvidia-smi:
Wed Apr 24 19:58:49 2024
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
   |  0%   23C    P8             16W /  350W |    8450MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0 Off |                  N/A |
   |  0%   25C    P8             20W /  350W |    8418MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

model info

{
  "model_id": "/opt/ml/checkpoint/llava-v1.6-mistral-7b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 24576,
  "max_total_tokens": 32768,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 65536,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "max_client_batch_size": 4,
  "version": "2.0.1",
  "sha": "2d0a7173d4891e7cd5f9b77f8e0987b82a339e51",
  "docker_label": "sha-2d0a717"
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Use an image that is greater than 1MB, set IMAGE_PATH and API_ENDPOINT appropriately:

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
image = Image.open(IMAGE_PATH)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="PNG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/png;base64,{base64_image}"
query = "Describe the image?"
prompt=f"[INST] ![]({image_string})\n{query} [/INST]"

headers = {
	"Accept" : "application/json",
	"Content-Type": "application/json" 
}

payload = {"inputs":prompt}
response = requests.post(f"{API_ENDPOINT}/generate", headers=headers, json=payload)
try:
    print(response.json())
except:
    print(response.text)

this will print : Failed to buffer the request body: length limit exceeded

If using an image less than 1MB, it generates correctly.

Expected behavior

It should generate text for the image as long as it fits within the model's context. Based on the text of the error, it looks like it has something to do with the default body size in Axum based on the similarity to tokio-rs/axum#1652.

@ktrapeznikov
Copy link

Likely related to this #1777

@akowalsk
Copy link
Contributor Author

I've also encountered that problem, but the length limit exceeded thing also happens on the idefics-9b-instruct model. That model works with images of varying dimensionality, but still fails when the image is large (over 1MB).

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 26, 2024
@akowalsk
Copy link
Contributor Author

I will revalidate on the latest TGI version shortly.

@github-actions github-actions bot removed the Stale label May 31, 2024
@akowalsk
Copy link
Contributor Author

I tried this again with the latest version and the idefics-8b-chatty model instead of the llava model and the issue persists.

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.78.0
Commit sha: f426a3398d12808f20c101487329e563d32bfbaf
Docker label: sha-f426a33
nvidia-smi:
Fri Jun 21 20:35:18 2024
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
   |  0%   30C    P8             18W /  350W |   15380MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0 Off |                  N/A |
   |  0%   30C    P8             22W /  350W |   15380MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   +-----------------------------------------------------------------------------------------+

model info

{
	"model_id": "/opt/ml/checkpoint/idefics2-8b-chatty",
	"model_sha": null,
	"model_dtype": "torch.float16",
	"model_device_type": "cuda",
	"model_pipeline_tag": null,
	"max_concurrent_requests": 128,
	"max_best_of": 2,
	"max_stop_sequences": 4,
	"max_input_length": 24576,
	"max_total_tokens": 32768,
	"waiting_served_ratio": 0.3,
	"max_batch_total_tokens": 192080,
	"max_waiting_tokens": 20,
	"max_batch_size": null,
	"validation_workers": 2,
	"max_client_batch_size": 4,
	"router": "text-generation-router",
	"version": "2.0.4",
	"sha": "f426a3398d12808f20c101487329e563d32bfbaf",
	"docker_label": "sha-f426a33"
}

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 22, 2024
@akowalsk
Copy link
Contributor Author

I tried to replicate this on the latest TGI version (2.2) and ended up with a different error:

{"timestamp":"2024-07-25T17:50:30.156102Z","level":"ERROR","message":"Server error: 'Tensor' object has no attribute 'input_lengths'","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":46,"span":{"size":1,"name":"decode"},"spans":[{"batch_size":1,"name":"batch"},{"name":"decode"},{"size":1,"name":"decode"},{"size":1,"name":"decode"}]}
{"timestamp":"2024-07-25T17:50:30.149213Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 309, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 723, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 193, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 692, in wrapper\n return callback(**use_params)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 118, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 297, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n return await response\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 120, in _unary_interceptor\n raise error\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 111, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 183, in Decode\n generations, next_batch, timings = self.model.generate_token(batch)\n File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n return func(*args, **kwds)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1376, in generate_token\n out, speculative_logits = self.forward(batch, adapter_data)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py\", line 351, in forward\n logits, speculative_logits = self.model.forward(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py\", line 824, in forward\n hidden_states = self.text_model.model(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 447, in forward\n hidden_states, residual = layer(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 372, in forward\n attn_output = self.self_attn(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 235, in forward\n attn_output = paged_attention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py\", line 116, in paged_attention\n input_lengths = seqlen.input_lengths\nAttributeError: 'Tensor' object has no attribute 'input_lengths'"},"target":"text_generation_launcher"}

@github-actions github-actions bot removed the Stale label Jul 26, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 25, 2024
@akowalsk
Copy link
Contributor Author

Still experiencing the issue.

@github-actions github-actions bot removed the Stale label Aug 27, 2024
@giladd123
Copy link

giladd123 commented Sep 19, 2024

Also experiencing this issue when running with this model.

@akowalsk
Copy link
Contributor Author

The addition of the --payload-limit option in 3.0 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants