-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: supports openai chat completions API #1427
Conversation
Fixes #735 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
A few comments here and there but nothing big.
e7a6fb2
to
583e08b
Compare
**commits above are from a rebase to resolve merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-emptive LGTM so you can merge once you used the prefill.len()
instead of the validated length.
It's more correct (since it goes through the real python tokenizer used by the model) and avoids creating that weird duplication in the struct (that I'm going to add anyway here: https://github.com/huggingface/text-generation-inference/pull/1436/files but for different reasons)
prefer PR from original repo rather than fork to run CI #1408
db3e152
to
4555e87
Compare
**commits above are from a rebase to resolve merge conflicts |
update: the latest commit adds support for loading the config from a local file. in order to support as many configurations as possible a new cli argument was added cargo run --y -tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 --tokenizer-config-path ./valid_tokenizer_config.json Possible Config States
|
Would it be possible to configure the "route" for that new chat completion API via an environment variable/cli arg? I am asking since sagemaker only exposes on route with cc @jeffboudier |
@drbh we can do this in a follow-up. We're talking change this: https://github.com/huggingface/text-generation-inference/blob/main/router/src/server.rs#L697 Through a CLI argument (both in launcher and router I think) |
Have the same issue with the quantized deepseek model |
Hi @paulcx and @UniverseFly, unfortunately I cannot reproduce this issues on I'm using the following command text-generation-launcher \
--model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
--quantize awq \
--tokenizer-config-path ~/deepseek-coder-tokenizer-config.json where {
"add_bos_token": true,
"add_eos_token": false,
"bos_token": "<|begin▁of▁sentence|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|EOT|>",
"legacy": true,
"model_max_length": 16384,
"pad_token": "<|end▁of▁sentence|>",
"sp_model_kwargs": {},
"unk_token": null,
"tokenizer_class": "LlamaTokenizerFast",
"chat_template": "{%- set found_item = false -%}\n{%- for message in messages -%}\n {%- if message['role'] == 'system' -%}\n {%- set found_item = true -%}\n {%- endif -%}\n{%- endfor -%}\n{%- if not found_item -%}\n{{'You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\\n'}}\n{%- endif %}\n{%- for message in messages %}\n {%- if message['role'] == 'system' %}\n{{ message['content'] }}\n {%- else %}\n {%- if message['role'] == 'user' %}\n{{'### Instruction:\\n' + message['content'] + '\\n'}}\n {%- else %}\n{{'### Response:\\n' + message['content'] + '\\n<|EOT|>\\n'}}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{{'### Response:\\n'}}\n"
} request curl localhost:3000/v1/chat/completions \
-X POST \
-d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "seed": 0 }' \
-H 'Content-Type: application/json' response {"id":"","object":"text_completion","created":1706571866,"model":"TheBloke/deepseek-coder-6.7B-instruct-AWQ","system_fingerprint":"1.4.0-native","choices":[{"index":0,"message":{"role":"assistant","content":"As an AI Programming Assistant, I primarily focus on providing information and answering questions related to computer science"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":89,"completion_tokens":20,"total_tokens":109}} Additionally the following will fail to load the tokenizer_config from the hub as expected, because the tokens specified in the config are not strings as mentioned above. link to config text-generation-launcher \
--model-id TheBloke/deepseek-coder-6.7B-instruct-AWQ \
--quantize awq I hope this information is helpful! Thanks! |
@drbh I loaded the model from the local directory in launcher arguments (--model-id /data/deepseek-coder-6.7B-instruct-AWQ --quantize awq). In this instance, the model directory is "/data/deepseek-coder-6.7B-instruct-AWQ," within docker env and I'm curious if the launcher code will automatically locate the tokenizer_config.json in this directory by default or anything else. Could this be the cause of the error mentioned? |
@drbh thanks for your response. I was trying the v1.4.0 docker image. Unfortunately I do not have the privilege to install required dependencies to build the |
Oh I made a silly mistake.. Since I was using docker, the |
This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface#1427 (comment)
This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface#1427
@drbh Would you please verify the issue for local model launcher args? |
I commented about the local model's tokenizer config not being found here: #1427 (comment) I think that the server / launcher should look in the model directory instead of the working directory (or in addition to the working directory). |
Hi @paulcx I believe this PR will resolve you're issue: #1518 originally the code was looking for the tokenizer_config in the local directory but now it should correctly check the same directory as the model files. Would you be able to try the latest changes and let me know if that resolves it for you? Thanks! |
cool, I can confirm that pr solved issue. @drbh BTW 1. would you think it's better to put chat_template into /info api? 2. So far I can not debug the chat_template and can not see if it works as expected. |
@drbh I attempted to switch many plugins that support the OpenAI API (such as genieAI, sider) to the TGI API, but the result was that the conversations could not be stopped. Specifically, the tgi openai api (stream mode) seems to end differently compared to the output (length?) from the OpenAI API. I used to create custom openai api server and yield ServerSentEvent("[DONE]") at the end of stream output. Not sure if this help? |
Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in #1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).
This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface/text-generation-inference#1427 (comment)
This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface/text-generation-inference#1427
Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface/text-generation-inference#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).
This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface. Notes - TGI inits a single model at startup so the `model` field is unused in HTTP requests. - `max_tokens` and `stream` should work as expected but other params may be (unimplemented or not supported) General approach - fetch the `tokenizer_config` at startup from the hub - pass `tokenizer_config` into `Infer` so we have it at request time - use the `chat_template` on the config to format chat request - parse jinja template and render chat string - pass inputs into existing generate function - wrap generation output in expected structure before returning ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "tgi", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` It is also possible to use the `openai` python library and change the base url ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=True ) for message in chat_completion: print(message) ``` ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=False ) print(chat_completion) ``` ```bash cd text-generation-inference/server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2 ``` ***note many of the existing `chat_templates` use non standard `jinja` (ie. adding a `raise` to the template) which will throw an error when parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a valid template ```bash cd text-generation-inference/router cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 ``` trigger ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \ -H 'Content-Type: application/json' ``` ^ supports `stream: true` and `stream: false` requests
…pass temp and top-k from API (huggingface#1470) This PR makes some minor tweaks to the new OpenAI-compatible chat endpoint huggingface#1427 in `GenerateParameters`: - Disables `decoder_input_details` when streaming is enabled. This was causing all streaming chat requests to fail before, since [`decoder_input_details`==true is not enabled when streaming tokens](https://github.com/huggingface/text-generation-inference/blob/98e5faff9daec6170cc2b0f963f2d73cf846b341/router/src/validation.rs#L406). - Passes through `temperature` and `top_p` hyperparameters from the API request to `GenerateParameters` ## Testing ```bash curl localhost:8080/v1/chat/completions \ -X POST \ -d '{ "model": "", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` Should work correctly. Currently, most recent release from `main` returns error: ``` data:{"error":"Input validation error: `decoder_input_details` == true is not supported when streaming tokens","error_type":"validation"} ``` It's my first time contributing to this project, so I could be missing something. Would especially appreciate @drbh's eyes on this one
This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface#1427 (comment)
This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface#1427
Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).
Hi @drbh |
This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues huggingface/text-generation-inference#1427 (comment)
This PR adds the `tokenizer-config-path` to the launcher and passes it to the router Fixes: huggingface/text-generation-inference#1427
Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in huggingface/text-generation-inference#1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` **EDIT:** I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).
This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface.
Notes
model
field is unused in HTTP requests.max_tokens
andstream
should work as expected but other params may be (unimplemented or not supported)General approach
tokenizer_config
at startup from the hubtokenizer_config
intoInfer
so we have it at request timechat_template
on the config to format chat requestHow to test
Streaming curl
It is also possible to use the
openai
python library and change the base url🌊 STREAMING REQUEST
🚗 SYNCHRONOUS REQUEST
How to run dev
cd text-generation-inference/server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2
***note many of the existing
chat_templates
use non standardjinja
(ie. adding araise
to the template) which will throw an error when parsing; hence usingupstage/SOLAR-10.7B-Instruct-v1.0
since it has a valid templatecd text-generation-inference/router cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0
trigger
^ supports
stream: true
andstream: false
requests