Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft PR][Don't merge until upgrade release rolls out] Fix telemetry spike issue in telegraf removal #926

Closed
wants to merge 34 commits into from

Conversation

Sohamdg081992
Copy link
Contributor

PR Description

This PR fixes the telemetry spike issue after removing telegraf. This fixes the bug in the ticker in telemetry aggregation and incorporates mutex locks for correctly getting metric values.

test cluster

test image: 6.8.14-fixhispiketelemetryNew-06-24-2024-c6cbed86

AI resource

AI query:

customMetrics
| where customDimensions contains "testrecalertssohamksm"
| extend agentversion=tostring(customDimensions.agentversion)
|where agentversion !contains "win"
| where customDimensions.agentversion contains "6.8.14-fixhispiketelemetryNew-06-24-2024-c6cbed86"
|extend agentversion=strcat(agentversion, "/", name)
| summarize count() by bin(timestamp,5m),agentversion
| render timechart

The below screenshot shows that volume has gone down now with the ticker fix. The telemetry spike was happening on the below metrics which uses the ticker - otelcollector_cpu_usage_050,otelcollector_cpu_usage_095,metricsextension_cpu_usage_050,metricsextension_cpu_usage_095,metricsextension_memory_rss_050, metricsextension_memory_rss_095,otelcollector_memory_rss_050,otelcollector_memory_rss_095.

image

The memory usage of the pods is not high anymore.

image

New Feature Checklist

  • List telemetry added about the feature.
  • Link to the one-pager about the feature.
  • List any tasks necessary for release (3P docs, AKS RP chart changes, etc.) after merging the PR.
  • Attach results of scale and perf testing.

Tests Checklist

  • Have end-to-end Ginkgo tests been run on your cluster and passed? To bootstrap your cluster to run the tests, follow these instructions.
    • Labels used when running the tests on your cluster:
      • operator
      • windows
      • arm64
      • arc-extension
      • fips
  • Have new tests been added? For features, have tests been added for this feature? For fixes, is there a test that could have caught this issue and could validate that the fix works?

@Sohamdg081992 Sohamdg081992 requested a review from a team as a code owner June 25, 2024 18:46
@Sohamdg081992 Sohamdg081992 changed the title Fixhispiketelemetry new [Draft PR][Don't merge] Fix telemetry spike issue in telegraf removal Jun 25, 2024
@Sohamdg081992 Sohamdg081992 changed the title [Draft PR][Don't merge] Fix telemetry spike issue in telegraf removal [Draft PR][Don't merge until upgrade release rolls out] Fix telemetry spike issue in telegraf removal Jun 25, 2024
Copy link

github-actions bot commented Jul 3, 2024

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link

This PR was closed because it has been stalled for 12 days with no activity.

Copy link

github-actions bot commented Aug 6, 2024

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link

This PR was closed because it has been stalled for 12 days with no activity.

@github-actions github-actions bot closed this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants