-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add input and output tokens to response #41
base: main
Are you sure you want to change the base?
Add input and output tokens to response #41
Conversation
Can someone please take a look at this PR? Give me some suggestions or feedback? |
96cb28e
to
1a3989e
Compare
I have signed the CLA. (In fact, I have merged several PRs in triton-inference-server, but every time I am asked whether I have signed the CLA. 🤣) Triton is an excellent software. We have integrated it into our system and now need to provide external services. |
Thanks @kebe7jun for confirming and clarifications. Regarding CLA, this is our policy to make sure that external contributions are merged with the CLA signed and received on our end. Feel free to notify in advance that the CLA is signed for any future PRs. |
One question, is this something that can be sent to a metric endpoint, or you need this info being associated with the request? |
This information is generally associated with users and needs to be able to count the usage of each user. However, Triton does not provide the concept of users, so it is usually implemented by the gateway. |
Hello, confirming the usefulness of this feature, we'd love to see it being added to the triton vllm backend ! |
triton_input_tokens_tensor = pb_utils.Tensor( | ||
"input_tokens", np.asarray(len(vllm_output.prompt_token_ids), dtype=self.input_tokens_dtype), | ||
) | ||
return pb_utils.InferenceResponse(output_tensors=[triton_output_tensor, triton_tokens_tensor, triton_input_tokens_tensor]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also need to be done for the create_stream_response
function, so the usage outputs are also present in stream responses.
Like
usage
of OpenAI: https://platform.openai.com/docs/api-reference/chat/object#chat/object-usage