feat: accept list as prompt and use first string #1702

drbh · 2024-04-03T20:09:53Z

This PR allows the CompletionRequest.prompt to be sent as a string or array of strings. When an array is sent the first value will be used if it's a string; otherwise the according error will be thrown

Fixes: #1690
Similar to: https://github.com/vllm-project/vllm/pull/323/files

Narsil · 2024-04-08T14:20:43Z

router/src/lib.rs

+        match value {
+            Value::String(s) => Ok(s),
+            Value::Array(arr) => arr
+                .first()


If we're not treating the array properly (as multiple queries) I suggest we just don't do support it.

I don't think we want to support arrays (it was done this way for along time in pipelines and create so much headaches it's not worth it.).

If we still want that exact functionality we need to YELL if the array contains more than 1 element (instead of silently ignoring)

OlivierDehaene · 2024-04-08T16:19:35Z

It would be pretty easy to support arrays like we do in TEI. Just push all requests in the internal queue and wait.
But I feel that the client would timeout very often waiting on the slowest request from the batch and that could lead to a lot of wasted compute.

HuggingFaceDocBuilderDev · 2024-04-09T00:42:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

drbh · 2024-04-09T16:49:17Z

notes

update to handle multiple requests instead of just the first
stream responses back with an index
docs https://platform.openai.com/docs/api-reference/completions/create

drbh · 2024-04-10T04:15:10Z

example requests:

streaming with openai

from openai import OpenAI

YOUR_TOKEN = "YOUR_API_KEY"

# Initialize the client, pointing it to one of the available models
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key=YOUR_TOKEN,
)

completion = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=["Say", "this", "is", "a", "test"],
    echo=True,
    n=1,
    stream=True,
    max_tokens=10,
)

for chunk in completion:
    print(chunk)

# Completion(id='', choices=[CompletionChoice(finish_reason='', index=4, logprobs=None, text=' =')], created=1712722135, model='google/gemma-7b', object='text_completion', system_fingerprint='1.4.5-native', usage=None)
# Completion(id='', choices=[CompletionChoice(finish_reason='', index=4, logprobs=None, text=' ')], created=1712722135, model='google/gemma-7b', object='text_completion', system_fingerprint='1.4.5-native', usage=None)
# Completion(id='', choices=[CompletionChoice(finish_reason='', index=4, logprobs=None, text='1')], created=1712722136, model='google/gemma-7b', object='text_completion', system_fingerprint='1.4.5-native', usage=None)
# Completion(id='', choices=[CompletionChoice(finish_reason='', index=4, logprobs=None, text='0')], created=1712722136, model='google/gemma-7b', object='text_completion', system_fingerprint='1.4.5-native', usage=None)
# ...

with aiohttp (streaming)

from aiohttp import ClientSession
import json
import asyncio

base_url = "http://localhost:3000"


request = {
    "model": "tgi",
    "prompt": [
        "What color is the sky?",
        "Is water wet?",
        "What is the capital of France?",
        "def mai",
    ],
    "max_tokens": 10,
    "seed": 0,
    "stream": True,
}

url = f"{base_url}/v1/completions"


async def main():

    async with ClientSession() as session:
        async with session.post(url, json=request) as response:
            async for chunk in response.content.iter_any():
                chunk = chunk.decode().split("\n\n")
                chunk = [c.replace("data:", "") for c in chunk]
                chunk = [c for c in chunk if c]
                chunk = [json.loads(c) for c in chunk]

                for c in chunk:
                    print(c)

asyncio.run(main())

# {'id': '', 'object': 'text_completion', 'created': 1712863765, 'choices': [{'index': 1, 'text': ' a', 'logprobs': None, 'finish_reason': ''}], 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native'}
# {'id': '', 'object': 'text_completion', 'created': 1712863765, 'choices': [{'index': 2, 'text': ' Paris', 'logprobs': None, 'finish_reason': ''}], 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native'}
# {'id': '', 'object': 'text_completion', 'created': 1712863765, 'choices': [{'index': 3, 'text': 'nic', 'logprobs': None, 'finish_reason': ''}], 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native'}
# {'id': '', 'object': 'text_completion', 'created': 1712863765, 'choices': [{'index': 0, 'text': ' blue', 'logprobs': None, 'finish_reason': ''}], 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native'}
# {'id': '', 'object': 'text_completion', 'created': 1712863765, 'choices': [{'index': 1, 'text': ' liquid', 'logprobs': None, 'finish_reason': ''}], 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native'}

sync with requests (non streaming)

import requests

base_url = "http://localhost:3000"

response = requests.post(
    f"{base_url}/v1/completions",
    json={
        "model": "tgi",
        "prompt": ["Say", "this", "is", "a", "test"],
        "max_tokens": 2,
        "seed": 0,
    },
    stream=False,
)
response = response.json()

print(response)
# {'id': '', 'object': 'text_completion', 'created': 1712722405, 'model': 'google/gemma-7b', 'system_fingerprint': '1.4.5-native', 'choices': [{'index': 0, 'text': " you'", 'logprobs': None, 'finish_reason': 'length'}, {'index': 1, 'text': ' the sequence', 'logprobs': None, 'finish_reason': 'length'}, {'index': 2, 'text': '_cases', 'logprobs': None, 'finish_reason': 'length'}, {'index': 3, 'text': '.\n\n', 'logprobs': None, 'finish_reason': 'length'}, {'index': 4, 'text': '. ', 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 10, 'total_tokens': 20}}

drbh · 2024-04-10T15:30:30Z

**note the client library intentionally does not include a completions method because this is a legacy API. The changes in this PR are to align with the API and address integrations with existing tools (langchain retrieval chain)

Narsil · 2024-04-10T16:30:39Z

integration-tests/models/test_completion_prompts.py

+        json={
+            "model": "tgi",
+            "prompt": ["Say", "this", "is", "a", "test"],
+            "max_tokens": 5,


Can we use different numbers than 5 in both dimensions ?Make it hard to understand what is what.
What happens if the length of the completions vary ?
Can you make the prompt of various sizes too ?

Does that mean that both queries have to wait on each other to send back chunks to the client ?

**updates

Tests are updated with "max_tokens": 10, and four prompts of varying lengths.

Theres is not a way to specify different max_tokens in the openai api and the same value is applied to each prompt.

In the recent change responses do not need to wait on each other and are interleaved, responses can complete at different times (chunks with that index stop being emitted)

integration-tests/models/test_completion_prompts.py

launcher/src/main.rs

router/src/server.rs

Narsil · 2024-04-10T16:58:08Z

router/src/server.rs

+    let mut x_compute_type = "unknown".to_string();
+    let mut x_compute_characters = 0u32;
+    let mut x_accel_buffering = "no".to_string();


Not a big fan of mutables here. Not sure I have an easy better way atm.

router/src/server.rs

Narsil

If a user kills the connection, make sure the inference is not running in the background
The logs are rather poor compared to the regular endpoints.

2024-04-16T10:42:49.931556Z  INFO text_generation_router::server: router/src/server.rs:500: Success

vs

2024-04-16T10:42:56.302342Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(10), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="429.831681ms" validation_time="217.73µs" queue_time="64.823µs" inference_time="429.549248ms" time_per_token="42.954924ms" seed="None"}: text_generation_router::server: router/src/server.rs:500: Success

integration-tests/models/test_completion_prompts.py

Narsil · 2024-04-16T14:58:19Z

Should be good after rebase.

drbh · 2024-04-16T17:04:28Z

**failing client tests do not seem related to these changes and are resolved here: #1751

drbh · 2024-04-16T17:16:53Z

... The logs are rather poor compared to the regular endpoints.

2024-04-16T10:42:49.931556Z  INFO text_generation_router::server: router/src/server.rs:500: Success

vs

2024-04-16T10:42:56.302342Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(10), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="429.831681ms" validation_time="217.73µs" queue_time="64.823µs" inference_time="429.549248ms" time_per_token="42.954924ms" seed="None"}: text_generation_router::server: router/src/server.rs:500: Success

yea its a bit strange that the same logging line produces more output in one case. Any ideas on how to have it emit the same output?

Narsil

LGTM

Narsil · 2024-04-16T20:35:34Z

... The logs are rather poor compared to the regular endpoints.

2024-04-16T10:42:49.931556Z  INFO text_generation_router::server: router/src/server.rs:500: Success

vs

2024-04-16T10:42:56.302342Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(10), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="429.831681ms" validation_time="217.73µs" queue_time="64.823µs" inference_time="429.549248ms" time_per_token="42.954924ms" seed="None"}: text_generation_router::server: router/src/server.rs:500: Success

yea its a bit strange that the same logging line produces more output in one case. Any ideas on how to have it emit the same output?

Should be about the span capture

drbh · 2024-04-17T02:51:13Z

logs are now bubbled up to the calling function and output the same information as generate and generate_stream

change: generate_internal and generate_stream_internal now take a span as an argument and is passed to tracing::info as a parent span.

Narsil

LGTM very nice PR in the end.

This PR allows the `CompletionRequest.prompt` to be sent as a string or array of strings. When an array is sent the first value will be used if it's a string; otherwise the according error will be thrown Fixes: huggingface#1690 Similar to: https://github.com/vllm-project/vllm/pull/323/files

drbh requested review from OlivierDehaene and Narsil April 5, 2024 17:29

Narsil reviewed Apr 8, 2024

View reviewed changes

drbh force-pushed the extract-first-prompt-if-list branch from 02414b5 to b08038c Compare April 9, 2024 00:43

drbh requested a review from Narsil April 10, 2024 16:45

Narsil reviewed Apr 10, 2024

View reviewed changes

drbh requested a review from Narsil April 12, 2024 02:19

Narsil reviewed Apr 16, 2024

View reviewed changes

router/src/server.rs Outdated Show resolved Hide resolved

router/src/server.rs Outdated Show resolved Hide resolved

Narsil reviewed Apr 16, 2024

View reviewed changes

integration-tests/models/test_completion_prompts.py Outdated Show resolved Hide resolved

Narsil reviewed Apr 16, 2024

View reviewed changes

integration-tests/models/test_completion_prompts.py Outdated Show resolved Hide resolved

Narsil reviewed Apr 16, 2024

View reviewed changes

integration-tests/models/test_completion_prompts.py Outdated Show resolved Hide resolved

drbh self-assigned this Apr 16, 2024

drbh force-pushed the extract-first-prompt-if-list branch from 46d97d8 to 52d234f Compare April 16, 2024 15:55

Narsil previously approved these changes Apr 16, 2024

View reviewed changes

drbh added 6 commits April 16, 2024 18:32

feat: accept list as prompt and use first string

424b24f

feat: better error if array len >=1

c1afdcc

feat: handle batch completions requests

57606c4

fix: improve headers and add streaming test

942e002

feat: interleave streams and improve tests

16be5a1

fix: decrease default batch, refactors and include index in batch

908acc5

drbh added 7 commits April 16, 2024 18:33

fix: doc tweak

25f5e78

fix: improve header init and error handling

a62e304

fix: update tests for new behavior

c7b4cd3

fix: graceful stream close and fix tests

f2080c4

fix: adjust assert typo

0b82080

fix: refactor tests to support completions snapshot

ef2363c

fix: adjust naming and tests and rebase typo

4fec982

drbh dismissed Narsil’s stale review via 4fec982 April 16, 2024 22:34

drbh force-pushed the extract-first-prompt-if-list branch from 35626b9 to 4fec982 Compare April 16, 2024 22:34

drbh added 2 commits April 16, 2024 22:42

fix: adjust rebase removals

593c443

feat: emit params in logs for each request

bd28c36

drbh requested a review from Narsil April 17, 2024 01:42

feat: improve logs by passing span to internal functions

a7bf319

Narsil approved these changes Apr 17, 2024

View reviewed changes

Narsil merged commit 06c3d4b into main Apr 17, 2024
8 of 9 checks passed

Narsil deleted the extract-first-prompt-if-list branch April 17, 2024 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: accept list as prompt and use first string #1702

feat: accept list as prompt and use first string #1702

drbh commented Apr 3, 2024

Narsil Apr 8, 2024

OlivierDehaene commented Apr 8, 2024

HuggingFaceDocBuilderDev commented Apr 9, 2024

drbh commented Apr 9, 2024

drbh commented Apr 10, 2024 •

edited

Loading

drbh commented Apr 10, 2024

Narsil Apr 10, 2024

drbh Apr 11, 2024 •

edited

Loading

Narsil Apr 10, 2024

Narsil left a comment

Narsil commented Apr 16, 2024

drbh commented Apr 16, 2024

drbh commented Apr 16, 2024

Narsil left a comment

Narsil commented Apr 16, 2024

drbh commented Apr 17, 2024

Narsil left a comment

feat: accept list as prompt and use first string #1702

feat: accept list as prompt and use first string #1702

Conversation

drbh commented Apr 3, 2024

Narsil Apr 8, 2024

Choose a reason for hiding this comment

OlivierDehaene commented Apr 8, 2024

HuggingFaceDocBuilderDev commented Apr 9, 2024

drbh commented Apr 9, 2024

drbh commented Apr 10, 2024 • edited Loading

drbh commented Apr 10, 2024

Narsil Apr 10, 2024

Choose a reason for hiding this comment

drbh Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Narsil Apr 10, 2024

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil commented Apr 16, 2024

drbh commented Apr 16, 2024

drbh commented Apr 16, 2024

Narsil left a comment

Choose a reason for hiding this comment

Narsil commented Apr 16, 2024

drbh commented Apr 17, 2024

Narsil left a comment

Choose a reason for hiding this comment

drbh commented Apr 10, 2024 •

edited

Loading

drbh Apr 11, 2024 •

edited

Loading