Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage stats and crash reports #2220

Merged
merged 27 commits into from
Jul 19, 2024
Merged

usage stats and crash reports #2220

merged 27 commits into from
Jul 19, 2024

Conversation

ErikKaum
Copy link
Member

@ErikKaum ErikKaum commented Jul 11, 2024

What does this PR do?

High level:

  • adds usage stats to TGI
  • implementation in router + docs
  • send 2 events: one on start and one on stop (which might potentially be a crash)
  • omit sensitive data
  • device info from cuda or xpu, tpu deferred for now due to not having similarly a straightforward way of getting device info
  • only gather stats when TGI is run inside of docker
  • default in and opt-out with --disable-usage-stats flag, can also partially opt out with flag --disable-crash-reports which only allows sending of usage stats.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@ErikKaum ErikKaum requested a review from OlivierDehaene July 11, 2024 13:15
@ErikKaum ErikKaum force-pushed the feature/usage-stats branch from 9e22816 to 0c16204 Compare July 15, 2024 08:45
Copy link
Member

@OlivierDehaene OlivierDehaene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Can you also send error events on InferError::Generation?
It's deeper in the stack but these errors are super important to catch.

router/src/main.rs Outdated Show resolved Hide resolved
router/src/usage_stats.rs Outdated Show resolved Hide resolved
Copy link
Member

@Hugoch Hugoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool 👏 !!
I think that container detection needs to be improved as I think it only works in specific Docker envs and not with all container engines.

router/src/main.rs Outdated Show resolved Hide resolved
router/src/main.rs Outdated Show resolved Hide resolved
router/src/main.rs Outdated Show resolved Hide resolved
router/src/usage_stats.rs Outdated Show resolved Hide resolved
router/src/usage_stats.rs Outdated Show resolved Hide resolved
docs/source/usage_statistics.md Outdated Show resolved Hide resolved
@ErikKaum
Copy link
Member Author

Notes from offline discussion with @OlivierDehaene

  • we decided to defer the implementation of bubbling up the InferError::Generation
  • reason is that Olivier is writing an overhaul of the scheduler + backend which anyways allows the bubbling of errors

--> better to get this forward now and not to block on things
--> once the rewrite of the backend is in, we can add the InferError::Generation to the usage stats collection

Copy link
Member

@Hugoch Hugoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool! 👍
Tried e2e telemetry and a header is missing.

router/src/usage_stats.rs Show resolved Hide resolved
@ErikKaum ErikKaum requested a review from Hugoch July 19, 2024 10:06
Copy link
Member

@Hugoch Hugoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job 🥳

@ErikKaum ErikKaum merged commit 4c19593 into main Jul 19, 2024
9 checks passed
@ErikKaum ErikKaum deleted the feature/usage-stats branch July 19, 2024 14:17
ErikKaum added a commit that referenced this pull request Jul 25, 2024
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <[email protected]>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <[email protected]>
Co-authored-by: Erik Kaunismäki <[email protected]>
ErikKaum added a commit that referenced this pull request Jul 26, 2024
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <[email protected]>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <[email protected]>
Co-authored-by: Erik Kaunismäki <[email protected]>
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Sep 26, 2024
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <[email protected]>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <[email protected]>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <[email protected]>
Co-authored-by: Erik Kaunismäki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants