Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add input and output tokens to response #41

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kebe7jun
Copy link

@kebe7jun kebe7jun commented May 16, 2024

@kebe7jun
Copy link
Author

Can someone please take a look at this PR? Give me some suggestions or feedback?

@oandreeva-nv
Copy link
Collaborator

Hi @kebe7jun , Could you please let me know if you've already signed a CLA?

Additionally, Could you please let me know the motivation behind these changes>

@kebe7jun
Copy link
Author

kebe7jun commented Jul 31, 2024

Hi @kebe7jun , Could you please let me know if you've already signed a CLA?

Additionally, Could you please let me know the motivation behind these changes>

I have signed the CLA. (In fact, I have merged several PRs in triton-inference-server, but every time I am asked whether I have signed the CLA. 🤣)

Triton is an excellent software. We have integrated it into our system and now need to provide external services.
However, we currently cannot stats how many resources are actually consumed by user requests (currently everyone is billed according to Tokens).
So we hope to When responding to a request, we can have specific input/output Tokens data to facilitate our statistics.

@oandreeva-nv
Copy link
Collaborator

Thanks @kebe7jun for confirming and clarifications. Regarding CLA, this is our policy to make sure that external contributions are merged with the CLA signed and received on our end. Feel free to notify in advance that the CLA is signed for any future PRs.

@oandreeva-nv
Copy link
Collaborator

One question, is this something that can be sent to a metric endpoint, or you need this info being associated with the request?

@kebe7jun
Copy link
Author

kebe7jun commented Aug 1, 2024

This information is generally associated with users and needs to be able to count the usage of each user. However, Triton does not provide the concept of users, so it is usually implemented by the gateway.
Therefore, it is difficult to meet the requirement.

@gcalmettes
Copy link

Hello, confirming the usefulness of this feature, we'd love to see it being added to the triton vllm backend !
Thanks @kebe7jun for it.

triton_input_tokens_tensor = pb_utils.Tensor(
"input_tokens", np.asarray(len(vllm_output.prompt_token_ids), dtype=self.input_tokens_dtype),
)
return pb_utils.InferenceResponse(output_tensors=[triton_output_tensor, triton_tokens_tensor, triton_input_tokens_tensor])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also need to be done for the create_stream_response function, so the usage outputs are also present in stream responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants