feat: supports openai chat completions API #1427

drbh · 2024-01-10T15:12:53Z

This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface.

Notes

TGI inits a single model at startup so the model field is unused in HTTP requests.
max_tokens and stream should work as expected but other params may be (unimplemented or not supported)

General approach

fetch the tokenizer_config at startup from the hub
pass tokenizer_config into Infer so we have it at request time
use the chat_template on the config to format chat request
parse jinja template and render chat string
pass inputs into existing generate function
wrap generation output in expected structure before returning

How to test

Streaming curl

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

It is also possible to use the openai python library and change the base url

🌊 STREAMING REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

# ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='')

🚗 SYNCHRONOUS REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176))

How to run dev

cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2

***note many of the existing chat_templates use non standard jinja (ie. adding a raise to the template) which will throw an error when parsing; hence using upstage/SOLAR-10.7B-Instruct-v1.0 since it has a valid template

cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0

trigger

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'

^ supports stream: true and stream: false requests

Narsil · 2024-01-10T15:23:23Z

Fixes #735

OlivierDehaene

Nice!
A few comments here and there but nothing big.

router/src/lib.rs

router/src/infer.rs

router/src/lib.rs

drbh · 2024-01-10T18:52:32Z

**commits above are from a rebase to resolve merge conflicts

router/src/lib.rs

Narsil

Pre-emptive LGTM so you can merge once you used the prefill.len() instead of the validated length.

It's more correct (since it goes through the real python tokenizer used by the model) and avoids creating that weird duplication in the struct (that I'm going to add anyway here: https://github.com/huggingface/text-generation-inference/pull/1436/files but for different reasons)

prefer PR from original repo rather than fork to run CI #1408

…param

…amResponse

drbh · 2024-01-11T18:53:09Z

**commits above are from a rebase to resolve merge conflicts

router/src/main.rs

drbh · 2024-01-15T14:07:07Z

update: the latest commit adds support for loading the config from a local file. in order to support as many configurations as possible a new cli argument was added --tokenizer-config-path. If this argument is supplied we will load the config from the file, otherwise we'll fallback to look for tokenizer_config.json in the working directory or attempt to fetch it from hf_hub

cargo run --y -tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 --tokenizer-config-path ./valid_tokenizer_config.json

Possible Config States

Condition	Local Model Available	Local Config Available	Revision Specified	Action for Tokenizer/Model Info	Action for Tokenizer Config
1	Yes	Yes	-	Load locally	Load locally
2	Yes	No	-	Load locally	Load from API
3	No	Yes	Yes	Load from API	Load locally
4	No	No	Yes	Load from API	Load from API
5	Yes	Yes	Yes	Load locally	Load locally
6	Yes	No	Yes	Load locally	Load from API
7	No	Yes	No	Load from API	Load locally
8	No	No	No	Load from API	Load from API

philschmid · 2024-01-15T15:42:02Z

Would it be possible to configure the "route" for that new chat completion API via an environment variable/cli arg? I am asking since sagemaker only exposes on route with /invocations. By using an env var, users could enable the chat API when they want to use it, e.g. we could have --oai-chat-route /invocations.

cc @jeffboudier

Narsil · 2024-01-15T17:38:22Z

@drbh we can do this in a follow-up. We're talking change this: https://github.com/huggingface/text-generation-inference/blob/main/router/src/server.rs#L697 Through a CLI argument (both in launcher and router I think)

paulcx · 2024-01-29T01:26:41Z

@drbh I can confirm the local tokenizer_config file is found by setting tokenizer-config-path to launcher args in #1495. I'm wondering something wrong about default path which cause "Could not find tokenizer config locally and no revision specified" error?

UniverseFly · 2024-01-29T21:52:58Z

Have the same issue with the quantized deepseek model

drbh · 2024-01-29T23:49:17Z

Hi @paulcx and @UniverseFly, unfortunately I cannot reproduce this issues on main.

I'm using the following command

text-generation-launcher \
  --model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
  --quantize awq \
  --tokenizer-config-path ~/deepseek-coder-tokenizer-config.json

where deepseek-coder-tokenizer-config.json contents is

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": "<｜begin▁of▁sentence｜>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|EOT|>",
  "legacy": true,
  "model_max_length": 16384,
  "pad_token": "<｜end▁of▁sentence｜>",
  "sp_model_kwargs": {},
  "unk_token": null,
  "tokenizer_class": "LlamaTokenizerFast",
  "chat_template": "{%- set found_item = false -%}\n{%- for message in messages -%}\n    {%- if message['role'] == 'system' -%}\n        {%- set found_item = true -%}\n    {%- endif -%}\n{%- endfor -%}\n{%- if not found_item -%}\n{{'You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\\n'}}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'system' %}\n{{ message['content'] }}\n    {%- else %}\n        {%- if message['role'] == 'user' %}\n{{'### Instruction:\\n' + message['content'] + '\\n'}}\n        {%- else %}\n{{'### Response:\\n' + message['content'] + '\\n<|EOT|>\\n'}}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{{'### Response:\\n'}}\n"
}

request

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "seed": 0 }' \
    -H 'Content-Type: application/json'

response

{"id":"","object":"text_completion","created":1706571866,"model":"TheBloke/deepseek-coder-6.7B-instruct-AWQ","system_fingerprint":"1.4.0-native","choices":[{"index":0,"message":{"role":"assistant","content":"As an AI Programming Assistant, I primarily focus on providing information and answering questions related to computer science"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":89,"completion_tokens":20,"total_tokens":109}}

Additionally the following will fail to load the tokenizer_config from the hub as expected, because the tokens specified in the config are not strings as mentioned above.

link to config

 text-generation-launcher \
  --model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
  --quantize awq

I hope this information is helpful! Thanks!

paulcx · 2024-01-30T00:43:44Z

@drbh I loaded the model from the local directory in launcher arguments (--model-id /data/deepseek-coder-6.7B-instruct-AWQ --quantize awq). In this instance, the model directory is "/data/deepseek-coder-6.7B-instruct-AWQ," within docker env and I'm curious if the launcher code will automatically locate the tokenizer_config.json in this directory by default or anything else. Could this be the cause of the error mentioned?

UniverseFly · 2024-01-30T01:44:51Z

@drbh thanks for your response. I was trying the v1.4.0 docker image. Unfortunately I do not have the privilege to install required dependencies to build the main branch from my server, but I am trying to build its docker image and will report back. Is it expected that this doesn't work on v1.4.0?

UniverseFly · 2024-01-30T02:05:42Z

@drbh thanks for your response. I was trying the v1.4.0 docker image. Unfortunately I do not have the privilege to install required dependencies to build the main branch from my server, but I am trying to build its docker image and will report back. Is it expected that this doesn't work on v1.4.0?

Oh I made a silly mistake.. Since I was using docker, the --tokenizer-config-path should point to the docker side. After I corrected the path it worked! Thank you @drbh

This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface#1427 (comment)

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface#1427

paulcx · 2024-02-02T13:33:06Z

@drbh I loaded the model from the local directory in launcher arguments (--model-id /data/deepseek-coder-6.7B-instruct-AWQ --quantize awq). In this instance, the model directory is "/data/deepseek-coder-6.7B-instruct-AWQ," within docker env and I'm curious if the launcher code will automatically locate the tokenizer_config.json in this directory by default or anything else. Could this be the cause of the error mentioned?

@drbh Would you please verify the issue for local model launcher args?

spew · 2024-02-02T14:23:19Z

I commented about the local model's tokenizer config not being found here: #1427 (comment)

I think that the server / launcher should look in the model directory instead of the working directory (or in addition to the working directory).

drbh · 2024-02-02T15:07:26Z

Hi @paulcx I believe this PR will resolve you're issue: #1518

originally the code was looking for the tokenizer_config in the local directory but now it should correctly check the same directory as the model files.

Would you be able to try the latest changes and let me know if that resolves it for you? Thanks!

paulcx · 2024-02-02T22:15:16Z

Hi @paulcx I believe this PR will resolve you're issue: #1518

originally the code was looking for the tokenizer_config in the local directory but now it should correctly check the same directory as the model files.

Would you be able to try the latest changes and let me know if that resolves it for you? Thanks!

cool, I can confirm that pr solved issue. @drbh

BTW 1. would you think it's better to put chat_template into /info api? 2. So far I can not debug the chat_template and can not see if it works as expected.

paulcx · 2024-02-04T00:32:25Z

@drbh I attempted to switch many plugins that support the OpenAI API (such as genieAI, sider) to the TGI API, but the result was that the conversations could not be stopped. Specifically, the tgi openai api (stream mode) seems to end differently compared to the output (length?) from the OpenAI API. I used to create custom openai api server and yield ServerSentEvent("[DONE]") at the end of stream output. Not sure if this help?

@drbh

Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in #1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).

This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface/text-generation-inference#1427 (comment)

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface/text-generation-inference#1427

@drbh

Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface/text-generation-inference#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).

This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface. Notes - TGI inits a single model at startup so the `model` field is unused in HTTP requests. - `max_tokens` and `stream` should work as expected but other params may be (unimplemented or not supported) General approach - fetch the `tokenizer_config` at startup from the hub - pass `tokenizer_config` into `Infer` so we have it at request time - use the `chat_template` on the config to format chat request - parse jinja template and render chat string - pass inputs into existing generate function - wrap generation output in expected structure before returning ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "tgi", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` It is also possible to use the `openai` python library and change the base url ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=True ) for message in chat_completion: print(message) ``` ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=False ) print(chat_completion) ``` ```bash cd text-generation-inference/server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2 ``` ***note many of the existing `chat_templates` use non standard `jinja` (ie. adding a `raise` to the template) which will throw an error when parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a valid template ```bash cd text-generation-inference/router cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 ``` trigger ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \ -H 'Content-Type: application/json' ``` ^ supports `stream: true` and `stream: false` requests

@drbh

…pass temp and top-k from API (huggingface#1470) This PR makes some minor tweaks to the new OpenAI-compatible chat endpoint huggingface#1427 in `GenerateParameters`: - Disables `decoder_input_details` when streaming is enabled. This was causing all streaming chat requests to fail before, since [`decoder_input_details`==true is not enabled when streaming tokens](https://github.com/huggingface/text-generation-inference/blob/98e5faff9daec6170cc2b0f963f2d73cf846b341/router/src/validation.rs#L406). - Passes through `temperature` and `top_p` hyperparameters from the API request to `GenerateParameters` ## Testing ```bash curl localhost:8080/v1/chat/completions \ -X POST \ -d '{ "model": "", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` Should work correctly. Currently, most recent release from `main` returns error: ``` data:{"error":"Input validation error: `decoder_input_details` == true is not supported when streaming tokens","error_type":"validation"} ``` It's my first time contributing to this project, so I could be missing something. Would especially appreciate @drbh's eyes on this one

This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface#1427 (comment)

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface#1427

@drbh

Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).

dongs0104 · 2024-07-03T11:39:38Z

Hi @drbh
Thanks for this PR,
When i use the Llama 3 instruction model on TGI, the chat_template inside the tokenizer contains the bos token twice, am i right? it is hard to debug in rust code, i am not good at rust code sorry ;(

huggingface/trl#1114

This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface/text-generation-inference#1427 (comment)

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface/text-generation-inference#1427

@drbh

Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface/text-generation-inference#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).

drbh mentioned this pull request Jan 10, 2024

feat: supports openai chat completions API #1408

Closed

OlivierDehaene reviewed Jan 10, 2024

View reviewed changes

router/src/lib.rs Outdated Show resolved Hide resolved

router/src/infer.rs Outdated Show resolved Hide resolved

router/src/lib.rs Outdated Show resolved Hide resolved

router/src/lib.rs Outdated Show resolved Hide resolved

router/src/lib.rs Outdated Show resolved Hide resolved

This was referenced Jan 10, 2024

How to send a request with system, user and assistant prompt? #1131

Closed

What is the default tokenizer behaviour? #1314

Closed

drbh force-pushed the support-chat-completions-endpoint branch from e7a6fb2 to 583e08b Compare January 10, 2024 18:50

drbh requested a review from OlivierDehaene January 10, 2024 19:20

Narsil reviewed Jan 11, 2024

View reviewed changes

router/src/lib.rs Outdated Show resolved Hide resolved

Narsil previously approved these changes Jan 11, 2024

View reviewed changes

drbh dismissed Narsil’s stale review via db3e152 January 11, 2024 17:10

drbh added 7 commits January 11, 2024 13:46

feat: supports openai chat completions API

9fdf47f

prefer PR from original repo rather than fork to run CI #1408

fix: remove trailing space for clippy

47ad7bf

fix: initialize chat template single time, fix defaults and add seed …

55455a1

…param

fix: re-add changes removed during rebase

d009aa3

fix: clippy tweaks

62e6661

fix: prefer only intput_length over full ValidRequest in GenerateStre…

c63551f

…amResponse

fix: remove duplicate input_length on Details

4555e87

drbh force-pushed the support-chat-completions-endpoint branch from db3e152 to 4555e87 Compare January 11, 2024 18:52

fix: add removed index from rebase and clippy

3513bc7

OlivierDehaene reviewed Jan 12, 2024

View reviewed changes

router/src/main.rs Show resolved Hide resolved

router/src/main.rs Outdated Show resolved Hide resolved

9876691 mentioned this pull request Jan 15, 2024

Support adding models in K8's bionic-gpt/bionic-gpt#328

Closed

9 tasks

feat: support local configs and prefer hf hub

fb6c220

drbh requested review from Narsil and OlivierDehaene January 15, 2024 17:39

fix: avoid program exit on repo fetch failures

4a47f66

edwardzjl mentioned this pull request Jan 30, 2024

Migrate to chat-openai edwardzjl/chatbot#294

Closed

Wauplin mentioned this pull request Feb 9, 2024

Error when calling InferenceClient.conversational huggingface/huggingface_hub#2023

Closed

amihalik mentioned this pull request Feb 13, 2024

Add name field to OpenAI compatible API Messages #1558

Closed

Wauplin mentioned this pull request Mar 15, 2024

Fix index in ChatCompletionChunk #1648

Merged

kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024

feat: add tokenizer-config-path to launcher args (huggingface#1495)

ac580f5

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface#1427

alfredgui2 pushed a commit to mlsys-io/kv.run that referenced this pull request Jul 6, 2024

feat: add tokenizer-config-path to launcher args (#1495)

2b275c6

This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface/text-generation-inference#1427

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: supports openai chat completions API #1427

feat: supports openai chat completions API #1427

drbh commented Jan 10, 2024

Narsil commented Jan 10, 2024

OlivierDehaene left a comment

drbh commented Jan 10, 2024

Narsil left a comment

drbh commented Jan 11, 2024

drbh commented Jan 15, 2024

philschmid commented Jan 15, 2024 •

edited

Loading

Narsil commented Jan 15, 2024

paulcx commented Jan 29, 2024

UniverseFly commented Jan 29, 2024

drbh commented Jan 29, 2024 •

edited

Loading

paulcx commented Jan 30, 2024

UniverseFly commented Jan 30, 2024

UniverseFly commented Jan 30, 2024

paulcx commented Feb 2, 2024

spew commented Feb 2, 2024

drbh commented Feb 2, 2024

paulcx commented Feb 2, 2024 •

edited

Loading

paulcx commented Feb 4, 2024 •

edited

Loading

dongs0104 commented Jul 3, 2024

feat: supports openai chat completions API #1427

feat: supports openai chat completions API #1427

Conversation

drbh commented Jan 10, 2024

How to test

Streaming curl

🌊 STREAMING REQUEST

🚗 SYNCHRONOUS REQUEST

How to run dev

Narsil commented Jan 10, 2024

OlivierDehaene left a comment

Choose a reason for hiding this comment

drbh commented Jan 10, 2024

Narsil left a comment

Choose a reason for hiding this comment

drbh commented Jan 11, 2024

drbh commented Jan 15, 2024

Possible Config States

philschmid commented Jan 15, 2024 • edited Loading

Narsil commented Jan 15, 2024

paulcx commented Jan 29, 2024

UniverseFly commented Jan 29, 2024

drbh commented Jan 29, 2024 • edited Loading

paulcx commented Jan 30, 2024

UniverseFly commented Jan 30, 2024

UniverseFly commented Jan 30, 2024

paulcx commented Feb 2, 2024

spew commented Feb 2, 2024

drbh commented Feb 2, 2024

paulcx commented Feb 2, 2024 • edited Loading

paulcx commented Feb 4, 2024 • edited Loading

dongs0104 commented Jul 3, 2024

philschmid commented Jan 15, 2024 •

edited

Loading

drbh commented Jan 29, 2024 •

edited

Loading

paulcx commented Feb 2, 2024 •

edited

Loading

paulcx commented Feb 4, 2024 •

edited

Loading