Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: supports openai chat completions API #1427

Merged
merged 10 commits into from
Jan 16, 2024

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Jan 10, 2024

This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface.

Notes

  • TGI inits a single model at startup so the model field is unused in HTTP requests.
  • max_tokens and stream should work as expected but other params may be (unimplemented or not supported)

General approach

  • fetch the tokenizer_config at startup from the hub
  • pass tokenizer_config into Infer so we have it at request time
  • use the chat_template on the config to format chat request
  • parse jinja template and render chat string
  • pass inputs into existing generate function
  • wrap generation output in expected structure before returning

How to test

Streaming curl

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

It is also possible to use the openai python library and change the base url

🌊 STREAMING REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

# ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='')

🚗 SYNCHRONOUS REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176))

How to run dev

cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2

***note many of the existing chat_templates use non standard jinja (ie. adding a raise to the template) which will throw an error when parsing; hence using upstage/SOLAR-10.7B-Instruct-v1.0 since it has a valid template

cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0

trigger

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'

^ supports stream: true and stream: false requests

@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2024

Fixes #735

Copy link
Member

@OlivierDehaene OlivierDehaene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
A few comments here and there but nothing big.

router/src/lib.rs Outdated Show resolved Hide resolved
router/src/infer.rs Outdated Show resolved Hide resolved
router/src/lib.rs Outdated Show resolved Hide resolved
router/src/lib.rs Outdated Show resolved Hide resolved
router/src/lib.rs Outdated Show resolved Hide resolved
@drbh
Copy link
Collaborator Author

drbh commented Jan 10, 2024

**commits above are from a rebase to resolve merge conflicts

@drbh drbh requested a review from OlivierDehaene January 10, 2024 19:20
router/src/lib.rs Outdated Show resolved Hide resolved
Narsil
Narsil previously approved these changes Jan 11, 2024
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-emptive LGTM so you can merge once you used the prefill.len() instead of the validated length.

It's more correct (since it goes through the real python tokenizer used by the model) and avoids creating that weird duplication in the struct (that I'm going to add anyway here: https://github.com/huggingface/text-generation-inference/pull/1436/files but for different reasons)

@drbh drbh force-pushed the support-chat-completions-endpoint branch from db3e152 to 4555e87 Compare January 11, 2024 18:52
@drbh
Copy link
Collaborator Author

drbh commented Jan 11, 2024

**commits above are from a rebase to resolve merge conflicts

router/src/main.rs Show resolved Hide resolved
router/src/main.rs Outdated Show resolved Hide resolved
@drbh
Copy link
Collaborator Author

drbh commented Jan 15, 2024

update: the latest commit adds support for loading the config from a local file. in order to support as many configurations as possible a new cli argument was added --tokenizer-config-path. If this argument is supplied we will load the config from the file, otherwise we'll fallback to look for tokenizer_config.json in the working directory or attempt to fetch it from hf_hub

cargo run --y -tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 --tokenizer-config-path ./valid_tokenizer_config.json

Possible Config States

Condition Local Model Available Local Config Available Revision Specified Action for Tokenizer/Model Info Action for Tokenizer Config
1 Yes Yes - Load locally Load locally
2 Yes No - Load locally Load from API
3 No Yes Yes Load from API Load locally
4 No No Yes Load from API Load from API
5 Yes Yes Yes Load locally Load locally
6 Yes No Yes Load locally Load from API
7 No Yes No Load from API Load locally
8 No No No Load from API Load from API

@philschmid
Copy link
Member

philschmid commented Jan 15, 2024

Would it be possible to configure the "route" for that new chat completion API via an environment variable/cli arg? I am asking since sagemaker only exposes on route with /invocations. By using an env var, users could enable the chat API when they want to use it, e.g. we could have --oai-chat-route /invocations.

cc @jeffboudier

@Narsil
Copy link
Collaborator

Narsil commented Jan 15, 2024

@drbh we can do this in a follow-up. We're talking change this: https://github.com/huggingface/text-generation-inference/blob/main/router/src/server.rs#L697 Through a CLI argument (both in launcher and router I think)

@drbh drbh requested review from Narsil and OlivierDehaene January 15, 2024 17:39
@paulcx
Copy link

paulcx commented Jan 29, 2024

@drbh I can confirm the local tokenizer_config file is found by setting tokenizer-config-path to launcher args in #1495. I'm wondering something wrong about default path which cause "Could not find tokenizer config locally and no revision specified" error?

@UniverseFly
Copy link

Have the same issue with the quantized deepseek model

@drbh
Copy link
Collaborator Author

drbh commented Jan 29, 2024

Hi @paulcx and @UniverseFly, unfortunately I cannot reproduce this issues on main.

I'm using the following command

text-generation-launcher \
  --model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
  --quantize awq \
  --tokenizer-config-path ~/deepseek-coder-tokenizer-config.json

where deepseek-coder-tokenizer-config.json contents is

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": "<|begin▁of▁sentence|>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|EOT|>",
  "legacy": true,
  "model_max_length": 16384,
  "pad_token": "<|end▁of▁sentence|>",
  "sp_model_kwargs": {},
  "unk_token": null,
  "tokenizer_class": "LlamaTokenizerFast",
  "chat_template": "{%- set found_item = false -%}\n{%- for message in messages -%}\n    {%- if message['role'] == 'system' -%}\n        {%- set found_item = true -%}\n    {%- endif -%}\n{%- endfor -%}\n{%- if not found_item -%}\n{{'You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\\n'}}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'system' %}\n{{ message['content'] }}\n    {%- else %}\n        {%- if message['role'] == 'user' %}\n{{'### Instruction:\\n' + message['content'] + '\\n'}}\n        {%- else %}\n{{'### Response:\\n' + message['content'] + '\\n<|EOT|>\\n'}}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{{'### Response:\\n'}}\n"
}

request

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "seed": 0 }' \
    -H 'Content-Type: application/json'

response

{"id":"","object":"text_completion","created":1706571866,"model":"TheBloke/deepseek-coder-6.7B-instruct-AWQ","system_fingerprint":"1.4.0-native","choices":[{"index":0,"message":{"role":"assistant","content":"As an AI Programming Assistant, I primarily focus on providing information and answering questions related to computer science"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":89,"completion_tokens":20,"total_tokens":109}}

Additionally the following will fail to load the tokenizer_config from the hub as expected, because the tokens specified in the config are not strings as mentioned above.

link to config

 text-generation-launcher \
  --model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
  --quantize awq 

I hope this information is helpful! Thanks!

@paulcx
Copy link

paulcx commented Jan 30, 2024

@drbh I loaded the model from the local directory in launcher arguments (--model-id /data/deepseek-coder-6.7B-instruct-AWQ --quantize awq). In this instance, the model directory is "/data/deepseek-coder-6.7B-instruct-AWQ," within docker env and I'm curious if the launcher code will automatically locate the tokenizer_config.json in this directory by default or anything else. Could this be the cause of the error mentioned?

@UniverseFly
Copy link

@drbh thanks for your response. I was trying the v1.4.0 docker image. Unfortunately I do not have the privilege to install required dependencies to build the main branch from my server, but I am trying to build its docker image and will report back. Is it expected that this doesn't work on v1.4.0?

@UniverseFly
Copy link

@drbh thanks for your response. I was trying the v1.4.0 docker image. Unfortunately I do not have the privilege to install required dependencies to build the main branch from my server, but I am trying to build its docker image and will report back. Is it expected that this doesn't work on v1.4.0?

Oh I made a silly mistake.. Since I was using docker, the --tokenizer-config-path should point to the docker side. After I corrected the path it worked! Thank you @drbh

helena-intel pushed a commit to helena-intel/text-generation-inference-hf that referenced this pull request Feb 1, 2024
This tiny PR just prints the parsing error when a tokenizer config fails
to load.

This is helpful when a chat_template wont load due to formatting issues
huggingface#1427 (comment)
helena-intel pushed a commit to helena-intel/text-generation-inference-hf that referenced this pull request Feb 1, 2024
This PR adds the `tokenizer-config-path` to the launcher and passes it
to the router

Fixes:
huggingface#1427
@paulcx
Copy link

paulcx commented Feb 2, 2024

@drbh I loaded the model from the local directory in launcher arguments (--model-id /data/deepseek-coder-6.7B-instruct-AWQ --quantize awq). In this instance, the model directory is "/data/deepseek-coder-6.7B-instruct-AWQ," within docker env and I'm curious if the launcher code will automatically locate the tokenizer_config.json in this directory by default or anything else. Could this be the cause of the error mentioned?

@drbh Would you please verify the issue for local model launcher args?

@spew
Copy link

spew commented Feb 2, 2024

I commented about the local model's tokenizer config not being found here: #1427 (comment)

I think that the server / launcher should look in the model directory instead of the working directory (or in addition to the working directory).

@drbh
Copy link
Collaborator Author

drbh commented Feb 2, 2024

Hi @paulcx I believe this PR will resolve you're issue: #1518

originally the code was looking for the tokenizer_config in the local directory but now it should correctly check the same directory as the model files.

Would you be able to try the latest changes and let me know if that resolves it for you? Thanks!

@paulcx
Copy link

paulcx commented Feb 2, 2024

Hi @paulcx I believe this PR will resolve you're issue: #1518

originally the code was looking for the tokenizer_config in the local directory but now it should correctly check the same directory as the model files.

Would you be able to try the latest changes and let me know if that resolves it for you? Thanks!

cool, I can confirm that pr solved issue. @drbh

BTW 1. would you think it's better to put chat_template into /info api? 2. So far I can not debug the chat_template and can not see if it works as expected.

@paulcx
Copy link

paulcx commented Feb 4, 2024

@drbh I attempted to switch many plugins that support the OpenAI API (such as genieAI, sider) to the TGI API, but the result was that the conversations could not be stopped. Specifically, the tgi openai api (stream mode) seems to end differently compared to the output (length?) from the OpenAI API. I used to create custom openai api server and yield ServerSentEvent("[DONE]") at the end of stream output. Not sure if this help?

drbh pushed a commit that referenced this pull request Mar 16, 2024
Fix a small inconsistency compared the OpenAI's chat-completion behavior
(introduced in
#1427 cc
@drbh). When using `stream=True`, each chunk has an `index` value in
`ChatCompletionChoice`. This index is not meant to be the index of the
generated token but the index of the choice, which is always 0 (since
TGI always return a single choice).

See https://platform.openai.com/docs/api-reference/chat/object:
> index _integer_
> The index of the choice in the list of choices.

---

So instead of 

```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

if should return
```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

**EDIT:** I also edited ToolCall.index to be always `0` (instead of the
generated token index) but for this one I'm actually unsure. It might be
the index of the tool in the array of tools? OpenAI's documentation
doesn't provide any information about it:
> index _integer_

---

I also noticed that in OpenAI's example, the last chunk doesn't have a
delta and is the only one that has a `finish_reason` returning. TGI is
slightly different since the last chunk has both the last delta (i.e.
the last generated token) + the finish reason. I don't think this is
worth fixing since it is not a requirement according to the docs/specs
(at least not that I know of).
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this pull request Apr 19, 2024
This tiny PR just prints the parsing error when a tokenizer config fails
to load.

This is helpful when a chat_template wont load due to formatting issues
huggingface/text-generation-inference#1427 (comment)
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this pull request Apr 19, 2024
This PR adds the `tokenizer-config-path` to the launcher and passes it
to the router

Fixes:
huggingface/text-generation-inference#1427
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this pull request Apr 19, 2024
Fix a small inconsistency compared the OpenAI's chat-completion behavior
(introduced in
huggingface/text-generation-inference#1427 cc
@drbh). When using `stream=True`, each chunk has an `index` value in
`ChatCompletionChoice`. This index is not meant to be the index of the
generated token but the index of the choice, which is always 0 (since
TGI always return a single choice).

See https://platform.openai.com/docs/api-reference/chat/object:
> index _integer_
> The index of the choice in the list of choices.

---

So instead of 

```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

if should return
```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

**EDIT:** I also edited ToolCall.index to be always `0` (instead of the
generated token index) but for this one I'm actually unsure. It might be
the index of the tool in the array of tools? OpenAI's documentation
doesn't provide any information about it:
> index _integer_

---

I also noticed that in OpenAI's example, the last chunk doesn't have a
delta and is the only one that has a `finish_reason` returning. TGI is
slightly different since the last chunk has both the last delta (i.e.
the last generated token) + the finish reason. I don't think this is
worth fixing since it is not a requirement according to the docs/specs
(at least not that I know of).
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
This PR adds support to make TGI a drop in replacement for OpenAI
clients by exposing the same HTTP interface.

Notes
- TGI inits a single model at startup so the `model` field is unused in
HTTP requests.
- `max_tokens` and `stream` should work as expected but other params may
be (unimplemented or not supported)

General approach
- fetch the `tokenizer_config` at startup from the hub
- pass `tokenizer_config` into `Infer` so we have it at request time
- use the `chat_template` on the config to format chat request
- parse jinja template and render chat string
- pass inputs into existing generate function
- wrap generation output in expected structure before returning

```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

It is also possible to use the `openai` python library and change the
base url

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

for message in chat_completion:
    print(message)

```

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
```

```bash
cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2
```

***note many of the existing `chat_templates` use non standard `jinja`
(ie. adding a `raise` to the template) which will throw an error when
parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a
valid template
```bash
cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0
```

trigger
```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'
```

^ supports `stream: true` and `stream: false` requests
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
…pass temp and top-k from API (huggingface#1470)

This PR makes some minor tweaks to the new OpenAI-compatible chat
endpoint huggingface#1427 in `GenerateParameters`:
- Disables `decoder_input_details` when streaming is enabled. This was
causing all streaming chat requests to fail before, since
[`decoder_input_details`==true is not enabled when streaming
tokens](https://github.com/huggingface/text-generation-inference/blob/98e5faff9daec6170cc2b0f963f2d73cf846b341/router/src/validation.rs#L406).
- Passes through `temperature` and `top_p` hyperparameters from the API
request to `GenerateParameters`

## Testing

```bash
curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true, 
  "max_tokens": 20
}' \                                   
    -H 'Content-Type: application/json'
```

Should work correctly. Currently, most recent release from `main`
returns error:
```
data:{"error":"Input validation error: `decoder_input_details` == true is not supported when streaming tokens","error_type":"validation"}
```

It's my first time contributing to this project, so I could be missing
something. Would especially appreciate @drbh's eyes on this one
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
This tiny PR just prints the parsing error when a tokenizer config fails
to load.

This is helpful when a chat_template wont load due to formatting issues
huggingface#1427 (comment)
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
This PR adds the `tokenizer-config-path` to the launcher and passes it
to the router

Fixes:
huggingface#1427
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
Fix a small inconsistency compared the OpenAI's chat-completion behavior
(introduced in
huggingface#1427 cc
@drbh). When using `stream=True`, each chunk has an `index` value in
`ChatCompletionChoice`. This index is not meant to be the index of the
generated token but the index of the choice, which is always 0 (since
TGI always return a single choice).

See https://platform.openai.com/docs/api-reference/chat/object:
> index _integer_
> The index of the choice in the list of choices.

---

So instead of 

```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

if should return
```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

**EDIT:** I also edited ToolCall.index to be always `0` (instead of the
generated token index) but for this one I'm actually unsure. It might be
the index of the tool in the array of tools? OpenAI's documentation
doesn't provide any information about it:
> index _integer_

---

I also noticed that in OpenAI's example, the last chunk doesn't have a
delta and is the only one that has a `finish_reason` returning. TGI is
slightly different since the last chunk has both the last delta (i.e.
the last generated token) + the finish reason. I don't think this is
worth fixing since it is not a requirement according to the docs/specs
(at least not that I know of).
@dongs0104
Copy link
Contributor

Hi @drbh
Thanks for this PR,
When i use the Llama 3 instruction model on TGI, the chat_template inside the tokenizer contains the bos token twice, am i right? it is hard to debug in rust code, i am not good at rust code sorry ;(

huggingface/trl#1114

alfredgui2 pushed a commit to mlsys-io/kv.run that referenced this pull request Jul 6, 2024
This tiny PR just prints the parsing error when a tokenizer config fails
to load.

This is helpful when a chat_template wont load due to formatting issues
huggingface/text-generation-inference#1427 (comment)
alfredgui2 pushed a commit to mlsys-io/kv.run that referenced this pull request Jul 6, 2024
This PR adds the `tokenizer-config-path` to the launcher and passes it
to the router

Fixes:
huggingface/text-generation-inference#1427
alfredgui2 pushed a commit to mlsys-io/kv.run that referenced this pull request Jul 6, 2024
Fix a small inconsistency compared the OpenAI's chat-completion behavior
(introduced in
huggingface/text-generation-inference#1427 cc
@drbh). When using `stream=True`, each chunk has an `index` value in
`ChatCompletionChoice`. This index is not meant to be the index of the
generated token but the index of the choice, which is always 0 (since
TGI always return a single choice).

See https://platform.openai.com/docs/api-reference/chat/object:
> index _integer_
> The index of the choice in the list of choices.

---

So instead of 

```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

if should return
```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

**EDIT:** I also edited ToolCall.index to be always `0` (instead of the
generated token index) but for this one I'm actually unsure. It might be
the index of the tool in the array of tools? OpenAI's documentation
doesn't provide any information about it:
> index _integer_

---

I also noticed that in OpenAI's example, the last chunk doesn't have a
delta and is the only one that has a `finish_reason` returning. TGI is
slightly different since the last chunk has both the last delta (i.e.
the last generated token) + the finish reason. I don't think this is
worth fixing since it is not a requirement according to the docs/specs
(at least not that I know of).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants