Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(module: eks-monitoring) Add NVIDIA gpu monitoring dashboards #257

Merged
merged 11 commits into from
Jan 24, 2024
Merged

Conversation

lewinkedrs
Copy link
Contributor

@lewinkedrs lewinkedrs commented Jan 17, 2024

What does this PR do?

This will add support for existing clusters with gpu-operator. The NVIDIA DCIM dashboards will be deployed and be connected to AMP.

Closes #233

@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 17, 2024 17:20 — with GitHub Actions Inactive
@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 17, 2024 17:50 — with GitHub Actions Inactive
@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 17, 2024 18:00 — with GitHub Actions Inactive
@lewinkedrs lewinkedrs requested a review from bonclay7 January 17, 2024 22:26
Copy link
Member

@bonclay7 bonclay7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Please see my comments, need to setup NVIDIA to actually tests metrics collection now

docs/eks/gpumon.md Outdated Show resolved Hide resolved
docs/eks/gpumon.md Outdated Show resolved Hide resolved
docs/eks/gpumon.md Outdated Show resolved Hide resolved
@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 19, 2024 17:33 — with GitHub Actions Inactive
@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 19, 2024 17:57 — with GitHub Actions Inactive
@bonclay7 bonclay7 temporarily deployed to Observability Test January 19, 2024 19:57 — with GitHub Actions Inactive
@lewinkedrs lewinkedrs temporarily deployed to Observability Test January 19, 2024 21:04 — with GitHub Actions Inactive
@bonclay7 bonclay7 temporarily deployed to Observability Test January 23, 2024 15:28 — with GitHub Actions Inactive
Copy link
Member

@bonclay7 bonclay7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bonclay7 bonclay7 temporarily deployed to Observability Test January 24, 2024 16:58 — with GitHub Actions Inactive
@bonclay7 bonclay7 changed the title gpu dashboards feat(module: eks-monitoring) Add NVIDIA gpu monitoring dashboards Jan 24, 2024
@bonclay7 bonclay7 merged commit ada16d5 into main Jan 24, 2024
36 checks passed
@bonclay7 bonclay7 deleted the gpu branch January 24, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Example that shows how to configure ADOT, AMP and AMG for NVIDIA GPU Operator
2 participants