Add input and output tokens to response #41

kebe7jun · 2024-05-16T09:53:02Z

Like usage of OpenAI: https://platform.openai.com/docs/api-reference/chat/object#chat/object-usage

kebe7jun · 2024-05-27T04:57:39Z

Can someone please take a look at this PR? Give me some suggestions or feedback?

oandreeva-nv · 2024-07-22T17:29:04Z

Hi @kebe7jun , Could you please let me know if you've already signed a CLA?

Additionally, Could you please let me know the motivation behind these changes>

kebe7jun · 2024-07-31T05:32:06Z

Hi @kebe7jun , Could you please let me know if you've already signed a CLA?

Additionally, Could you please let me know the motivation behind these changes>

I have signed the CLA. (In fact, I have merged several PRs in triton-inference-server, but every time I am asked whether I have signed the CLA. 🤣)

Triton is an excellent software. We have integrated it into our system and now need to provide external services.
However, we currently cannot stats how many resources are actually consumed by user requests (currently everyone is billed according to Tokens).
So we hope to When responding to a request, we can have specific input/output Tokens data to facilitate our statistics.

oandreeva-nv · 2024-07-31T18:01:20Z

Thanks @kebe7jun for confirming and clarifications. Regarding CLA, this is our policy to make sure that external contributions are merged with the CLA signed and received on our end. Feel free to notify in advance that the CLA is signed for any future PRs.

oandreeva-nv · 2024-07-31T18:40:58Z

One question, is this something that can be sent to a metric endpoint, or you need this info being associated with the request?

kebe7jun · 2024-08-01T01:03:18Z

This information is generally associated with users and needs to be able to count the usage of each user. However, Triton does not provide the concept of users, so it is usually implemented by the gateway.
Therefore, it is difficult to meet the requirement.

gcalmettes · 2024-08-27T13:25:12Z

Hello, confirming the usefulness of this feature, we'd love to see it being added to the triton vllm backend !
Thanks @kebe7jun for it.

gcalmettes · 2024-08-27T13:49:34Z

src/model.py

+        triton_input_tokens_tensor = pb_utils.Tensor(
+            "input_tokens", np.asarray(len(vllm_output.prompt_token_ids), dtype=self.input_tokens_dtype),
+        )
+        return pb_utils.InferenceResponse(output_tensors=[triton_output_tensor, triton_tokens_tensor, triton_input_tokens_tensor])


This would also need to be done for the create_stream_response function, so the usage outputs are also present in stream responses.

Add input and output tokens to response

1a3989e

kebe7jun force-pushed the feature/tokens-to-api branch from 96cb28e to 1a3989e Compare July 31, 2024 05:31

gcalmettes reviewed Aug 27, 2024

View reviewed changes

rmccorm4 mentioned this pull request Nov 7, 2024

feat: Support sending additional outputs from vLLM inference #70

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add input and output tokens to response #41

Add input and output tokens to response #41

kebe7jun commented May 16, 2024 •

edited

Loading

kebe7jun commented May 27, 2024

oandreeva-nv commented Jul 22, 2024

kebe7jun commented Jul 31, 2024 •

edited

Loading

oandreeva-nv commented Jul 31, 2024

oandreeva-nv commented Jul 31, 2024

kebe7jun commented Aug 1, 2024

gcalmettes commented Aug 27, 2024

gcalmettes Aug 27, 2024

Add input and output tokens to response #41

Are you sure you want to change the base?

Add input and output tokens to response #41

Conversation

kebe7jun commented May 16, 2024 • edited Loading

kebe7jun commented May 27, 2024

oandreeva-nv commented Jul 22, 2024

kebe7jun commented Jul 31, 2024 • edited Loading

oandreeva-nv commented Jul 31, 2024

oandreeva-nv commented Jul 31, 2024

kebe7jun commented Aug 1, 2024

gcalmettes commented Aug 27, 2024

gcalmettes Aug 27, 2024

Choose a reason for hiding this comment

kebe7jun commented May 16, 2024 •

edited

Loading

kebe7jun commented Jul 31, 2024 •

edited

Loading