[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

SuperQ · 2024-12-15T22:17:13Z

What this PR does / why we need it

Improve the collection of cAdvisor metrics by adjusting the scrape interval to match the kubelet hardcoded minimum housekeeping interval.

Set the cAdvisor, and resource, interval to 10s by default.
Use the user configured a kubelet metrics interval if not default.
Enforce honorTimestamps for cAdvisor and resource if timetamp staleness is enabled.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

fixes #

Special notes for your reviewer

Checklist

DCO signed
Chart Version bumped
Title of the PR starts with chart name (e.g. [prometheus-couchdb-exporter])

SuperQ · 2024-12-15T22:19:36Z

Followup to #5050 (CC @Breee).

For reference, this is the hardcoded housekeeping interval.

Note, this will likely cause PrometheusDuplicateTimestamps to fire for most users. We should probably remove/disable this alert.

We could disable it by default in the values.yaml, or we could remove it from the upstream kube-prometheus alert mixin.

jkroepke · 2024-12-16T09:02:43Z

Hey @SuperQ

I feel the information about trackTimestampsStaleness and cAdvisor interesting for some users (and for maintainers very important).

WDYT about leaving a note or link to the trackTimestampsStaleness that explain that a little bit? The link to the slack discussion will be unavailable in 90-180 days. one or two sentences are fine.

Note, this will likely cause PrometheusDuplicateTimestamps to fire for most users. We should probably remove/disable this alert.

Does this include a lot of warnings Error on ingesting samples with different value but same timestamp on the prometheus logs as well? If so, I see some incoming GitHub issues that users may complain about this. :/

From my point of view, it should be sufficient to exclude the kubelet scrape job from PrometheusDuplicateTimestamps alert, or I'm wrong?

SuperQ · 2024-12-16T09:16:11Z

Does this include a lot of warnings Error on ingesting samples with different value but same timestamp

It shouldn't. This is more related to ingesting the same timestamps more than once. Maybe this is a Prometheus metric "bug", that we're counting the same condition for multiple reasons. I haven't investigated the code for this.

SuperQ · 2024-12-16T11:32:48Z

Sure, so here's the details on trackTimestampsStaleness.

Normal metrics exposed to Prometheus do not include timestamps. Metrics are supposed to be produced "on demand", so the data timestamp is the same as the scrape timestamp.

cAdvisor does not follow this best practice. Instead cAdvisor has its own internal data gathering loop, it exposes timestamps on container_... metrics for the exact time in which the data was collected.

Prometheus scrapes have millisecond time accuracy and are intentionally splayed over time avoid correlated network traffic. But user queries do have a correlated timestamp, and systems like Grafana will snap the query to a round number like 00:30:00. This means that when you query at 00:30:00, but the metric you want was scraped at 00:29:55.231, Prometheus has to "look back" some amount of time to find that sample.

This lookback defaults to 5 minutes in order to handle a variety of scrape interval use cases. The down side to lookback is that if a metric is removed from an endpoint, or a target is down, Prometheus would continue to show data up to 5 minutes after it really existed. This was confusing for users.

In Prometheus 2.0 we improved on this situation by introducing a staleness handler. When a metric label combo is removed from a scrape it will be marked as "stale" in the TSDB.

Prometheus, due to historical reasons, handles metrics that include the optional timestamp field without this staleness marker.

Because cAdvisor has its own internal data gathering loop, it exposes timestamps on container_... metrics for the exact time in which the data was collected. This allows the data to be more accurate since it can be many seconds between when the data is collected and when Prometheus gets the data.

But now, without the staleness markers, Prometheus will continue to apply the lookback of up to 5 minutes to find metrics. This can be very confusing for users as they will "see" cAdvisor metrics that don't exist anymore. The common example of this is metrics like container_memory_rss. What users will see is that for 5 minutes after a Pod is deleted, the memory RSS will continue to exist.

The new track timestamp staleness feature was introduced to fix this behavior. Now with trackTimestampsStaleness enabled metrics like container_memory_rss will correctly be removed when the Pod is removed.

The previous workaround was to disable "honorTimestamps" for cAdvisor. This worked to fix the staleness handling issue. But it causes the data to be slightly inaccurate as cAdvisor's internal loop is not in sync with Prometheus. So in order to fully enable trackTimestampsStaleness, we also need to force enable the honorTimestamps feature.

The second half of this change has to do with the fact that cAdvisor does not have a strict timer for internal data collection. It uses variable timing from 10 to 15 seconds in Kubernetes. So in order to capture every sample produced by cAvisor, we need to scrape slightly faster than the minimum housekeeping interval.

jkroepke

Have some question here and I guess one go template condition is set to the wrong endpoint.

Thanks for the detailed comment. The link should be added to kubelet.serviceMonitor.trackTimestampsStaleness in the value.yaml!

charts/kube-prometheus-stack/templates/exporters/kubelet/servicemonitor.yaml

Improve the collection of cAdvisor metrics by adjusting the scrape interval to match the kubelet hardcoded minimum housekeeping interval. * Set the cAdvisor, and resource, interval to 10s by default. * Use the user configured a kubelet metrics interval if not default. * Enforce honorTimestamps for cAdvisor and resource if timetamp staleness is enabled. Signed-off-by: SuperQ <[email protected]>

jkroepke · 2024-12-16T20:14:39Z

@SuperQ what about the PrometheusDuplicateTimestamps alert that you mention earlier? Should it be covered in this or an follow-up PR, if users complain about?

SuperQ · 2024-12-16T20:55:25Z

I've been testing the 10s interval on one of my k3s test clusters all day and haven't been able to reproduce the duplicate timestamp issue there. Perhaps there is something wrong with my production setup that isn't normal. Maybe it's a cAdvisor under heavy load problem? I'm not sure at this point.

jkroepke

Thanks a lot!

festeveira · 2024-12-18T17:12:58Z

Hello, after these changes I started repeatedly getting the following log messages like this one, for every kubelet target, in the prometheus container:

time=2024-12-18T17:08:26.335Z level=WARN source=scrape.go:1877 msg="Error on ingesting out-of-order samples" component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-agent-kube-prom-kubelet/1 target=https://10.134.60.34:10250/metrics/cadvisor num_dropped=2646

My prometheus setup (multiple agents and one receiver) is configured with a scrapeInterval of 1m. If I set the kubelet scrape to 1m as well, the messages disappear. Furthermore, when scraping with 10s, data points in grafana stay 1m apart (querying the remote receiver prometheus and not the prometheus scraping these metrics). However, the graphs produced by prometheus have gaps in them, for cAdvisor metrics.

Are these log messages expected? Thank you in advance.

SuperQ force-pushed the superq/cadvisor_improvement branch from 2adb9e3 to 23c1934 Compare December 16, 2024 14:27

jkroepke requested changes Dec 16, 2024

View reviewed changes

SuperQ force-pushed the superq/cadvisor_improvement branch from 23c1934 to 5727923 Compare December 16, 2024 20:03

SuperQ force-pushed the superq/cadvisor_improvement branch from 5727923 to eece396 Compare December 16, 2024 20:07

SuperQ marked this pull request as ready for review December 16, 2024 20:07

SuperQ requested review from andrewgkew, gianrubio, gkarthiks, GMartinez-Sisti, QuentinBisson, scottrigby and Xtigyro as code owners December 16, 2024 20:07

SuperQ requested a review from jkroepke December 16, 2024 20:07

jkroepke approved these changes Dec 16, 2024

View reviewed changes

jkroepke merged commit 22d1213 into main Dec 16, 2024
5 checks passed

jkroepke deleted the superq/cadvisor_improvement branch December 16, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

SuperQ commented Dec 15, 2024 •

edited

Loading

SuperQ commented Dec 15, 2024

jkroepke commented Dec 16, 2024

SuperQ commented Dec 16, 2024

SuperQ commented Dec 16, 2024

jkroepke left a comment

jkroepke commented Dec 16, 2024

SuperQ commented Dec 16, 2024

jkroepke left a comment

festeveira commented Dec 18, 2024 •

edited

Loading

[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

Conversation

SuperQ commented Dec 15, 2024 • edited Loading

What this PR does / why we need it

Which issue this PR fixes

Special notes for your reviewer

Checklist

SuperQ commented Dec 15, 2024

jkroepke commented Dec 16, 2024

SuperQ commented Dec 16, 2024

SuperQ commented Dec 16, 2024

jkroepke left a comment

Choose a reason for hiding this comment

jkroepke commented Dec 16, 2024

SuperQ commented Dec 16, 2024

jkroepke left a comment

Choose a reason for hiding this comment

festeveira commented Dec 18, 2024 • edited Loading

SuperQ commented Dec 15, 2024 •

edited

Loading

festeveira commented Dec 18, 2024 •

edited

Loading