feat: supports openai chat completions API #1408

drbh · 2024-01-05T20:44:04Z

This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface.

Notes

TGI inits a single model at startup so the model field is unused in HTTP requests.
max_tokens and stream should work as expected but other params may be (unimplemented or not supported)

General approach

fetch the tokenizer_config at startup from the hub
pass tokenizer_config into Infer so we have it at request time
use the chat_template on the config to format chat request
parse jinja template and render chat string
pass inputs into existing generate function
wrap generation output in expected structure before returning

How to test

Streaming curl

curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

It is also possible to use the openai python library and change the base url

🌊 STREAMING REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

# ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='')

🚗 SYNCHRONOUS REQUEST

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176))

drbh · 2024-01-05T20:45:10Z

@Narsil and @OlivierDehaene please let me know if any changes should be made!

router/src/infer.rs

router/src/lib.rs

router/src/server.rs

router/src/infer.rs

router/src/server.rs

…aram

prefer PR from original repo rather than fork to run CI #1408

Michellehbn · 2024-01-10T15:10:52Z

Noting we can probably close #735 when this is done!

drbh · 2024-01-10T15:16:37Z

closing in favor of CI enabled PR #1427

prefer PR from original repo rather than fork to run CI #1408

feat: supports openai chat completions API

3ae9cd6

Narsil reviewed Jan 8, 2024

View reviewed changes

router/src/infer.rs Outdated Show resolved Hide resolved

fix: remove ChatTemplateError and add index to stream messages

ddf7412

drbh requested a review from OlivierDehaene January 8, 2024 15:45

OlivierDehaene reviewed Jan 9, 2024

View reviewed changes

drbh added 7 commits January 9, 2024 11:54

fix: adds index, model id, system fingerprint and updates do_sample p…

f82ff3f

…aram

fix: prefer index on StreamResponse

446b3b6

fix: prefer apply_chat_template logic in HubTokenizerConfig struct

adad67e

fix: add prompt_token_count to InferResponse for chat completions

fba1953

feat: support repetition_penalty and improve non stream response

8c4ab53

feat: support FinishReason in streaming and non streaming chat

65c913b

feat: support logprobs in streaming and non streaming chat

9a79c2f

drbh requested a review from OlivierDehaene January 9, 2024 19:06

drbh added a commit that referenced this pull request Jan 10, 2024

feat: supports openai chat completions API

8d7bbff

prefer PR from original repo rather than fork to run CI #1408

drbh closed this Jan 10, 2024

drbh added a commit that referenced this pull request Jan 10, 2024

feat: supports openai chat completions API

53fca4c

prefer PR from original repo rather than fork to run CI #1408

drbh added a commit that referenced this pull request Jan 11, 2024

feat: supports openai chat completions API

9fdf47f

prefer PR from original repo rather than fork to run CI #1408

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: supports openai chat completions API #1408

feat: supports openai chat completions API #1408

drbh commented Jan 5, 2024 •

edited

Loading

drbh commented Jan 5, 2024

Michellehbn commented Jan 10, 2024

drbh commented Jan 10, 2024

feat: supports openai chat completions API #1408

feat: supports openai chat completions API #1408

Conversation

drbh commented Jan 5, 2024 • edited Loading

How to test

Streaming curl

🌊 STREAMING REQUEST

🚗 SYNCHRONOUS REQUEST

drbh commented Jan 5, 2024

Michellehbn commented Jan 10, 2024

drbh commented Jan 10, 2024

drbh commented Jan 5, 2024 •

edited

Loading