run-away cpu usage #352

DavidS-ovm · 2023-03-17T08:37:19Z

Versions

Image:          honeycombio/honeycomb-kubernetes-agent:2.6.0
Image ID:       docker.io/honeycombio/honeycomb-kubernetes-agent@sha256:1f68553ba8db5c86a48355f288a97485905d75bf81c919064c6fc864316ba182

Steps to reproduce

deploy the agent through the helm chart:

resource "helm_release" "honeycomb" {
  name       = "honeycomb"
  repository = "https://honeycombio.github.io/helm-charts"
  chart      = "honeycomb"
  timeout    = 60
  values = [yamlencode({
    honeycomb = { apiKey = var.honeycomb_api_key },
    metrics = {
      enabled      = true
      interval     = 1 * 60 * 1000 * 1000 * 1000 # nanoseconds
      clusterName  = "k8s-${var.terraform_env_name}"
      metricGroups = ["node", "pod", "volume"]
    },
    watchers = [ /* 10 watchers */ ]

Wait

Additional context

Here you can see the cpu usage of the two honeycomb agents (one per node) over the last 28 days. As you can see, the cpu usage is steadily increasing:

whereas the actual collected logs are not growing in volume:

deleting the agent's pod and letting k8s redeploy a new one restarts the growth at 0.

There is nothing obvious in the logs.

This is a development cluster that is not seeing a lot of log traffic, but a lot of k8s traffic (i.e. pods getting replaced, see deployment markers).

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2023-03-17T13:25:44Z

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

How does the growth compare to its requests/limits? Is it ever crashing?

DavidS-ovm · 2023-03-20T07:35:56Z

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

From the freshly-started state not needing any kind of CPU (and the overall nature of what the agent does) this behaviour is suspect, irrespective of limits set.

How does the growth compare to its requests/limits? Is it ever crashing?

As explained in the description, this agent is deployed using all the defaults from the helm chart, whatever they might be. Assuming that the pod would recycle on a crash, it doesn't seem like it did crash in the last 28 days.

The other thing I'm noticing is that there are bursts of extremely high latency reported for the kubernetes-logs dataset:

I should also have noted in the original report that this cluster is running on eks with relatively low overall load. Here's the overall node CPU usage for the cluster:

The correlation between the agent CPU use, when the second node (and hence the second agent) came online and when I restarted it is clear to see.

TylerHelmuth · 2023-03-20T14:44:15Z

@DavidS-om I will keep investigating this, but note that the helm chart sets no default requests/limits: https://github.com/honeycombio/helm-charts/blob/0cba10473077edc7fbf56d3259ac6d135a67e4cb/charts/honeycomb/values.yaml#LL92-L98C19. Best practice is to set those values.

DavidS-ovm · 2023-03-20T14:59:42Z

@TylerHelmuth I can see how that would keep this from impacting the rest of my cluster, but assuming that k8s only throttles, not kills, pods for hitting limits, that'll likely just mean that I'm not gonna get (timely) logs and metrics from the agent.

That reminded me to check the event latency of the metrics dataset, and something curious is happening there:

Since restarting the pod last week, latency for metrics also through the roof

DavidS-ovm · 2023-03-27T14:18:28Z

since we stopped having a bunch of crash-looping pods in our test systems, the cpu and memory usage of the agent remained flat at a very low level:

MikeGoldsmith · 2024-03-13T15:47:11Z

Looks like this was resolved.

DavidS-ovm added the type: bug label Mar 17, 2023

TylerHelmuth added the status: oncall Flagged for awareness from Honeycomb Telemetry Oncall label Mar 20, 2023

kentquirk removed the status: oncall Flagged for awareness from Honeycomb Telemetry Oncall label Mar 27, 2023

MikeGoldsmith closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-away cpu usage #352

run-away cpu usage #352

DavidS-ovm commented Mar 17, 2023

TylerHelmuth commented Mar 17, 2023 •

edited

Loading

DavidS-ovm commented Mar 20, 2023

TylerHelmuth commented Mar 20, 2023

DavidS-ovm commented Mar 20, 2023

DavidS-ovm commented Mar 27, 2023

MikeGoldsmith commented Mar 13, 2024

run-away cpu usage #352

run-away cpu usage #352

Comments

DavidS-ovm commented Mar 17, 2023

TylerHelmuth commented Mar 17, 2023 • edited Loading

DavidS-ovm commented Mar 20, 2023

TylerHelmuth commented Mar 20, 2023

DavidS-ovm commented Mar 20, 2023

DavidS-ovm commented Mar 27, 2023

MikeGoldsmith commented Mar 13, 2024

TylerHelmuth commented Mar 17, 2023 •

edited

Loading