Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run-away cpu usage #352

Closed
DavidS-ovm opened this issue Mar 17, 2023 · 6 comments
Closed

run-away cpu usage #352

DavidS-ovm opened this issue Mar 17, 2023 · 6 comments

Comments

@DavidS-ovm
Copy link

Versions

Image:          honeycombio/honeycomb-kubernetes-agent:2.6.0
Image ID:       docker.io/honeycombio/honeycomb-kubernetes-agent@sha256:1f68553ba8db5c86a48355f288a97485905d75bf81c919064c6fc864316ba182                                                                                                                                                                                                             

Steps to reproduce

  1. deploy the agent through the helm chart:
resource "helm_release" "honeycomb" {
  name       = "honeycomb"
  repository = "https://honeycombio.github.io/helm-charts"
  chart      = "honeycomb"
  timeout    = 60
  values = [yamlencode({
    honeycomb = { apiKey = var.honeycomb_api_key },
    metrics = {
      enabled      = true
      interval     = 1 * 60 * 1000 * 1000 * 1000 # nanoseconds
      clusterName  = "k8s-${var.terraform_env_name}"
      metricGroups = ["node", "pod", "volume"]
    },
    watchers = [ /* 10 watchers */ ]
  1. Wait

Additional context

Here you can see the cpu usage of the two honeycomb agents (one per node) over the last 28 days. As you can see, the cpu usage is steadily increasing:
2023-03-17_09MS+0100_1794x888

whereas the actual collected logs are not growing in volume:
2023-03-17_09MS+0100_1804x877

deleting the agent's pod and letting k8s redeploy a new one restarts the growth at 0.

There is nothing obvious in the logs.

This is a development cluster that is not seeing a lot of log traffic, but a lot of k8s traffic (i.e. pods getting replaced, see deployment markers).

@TylerHelmuth
Copy link
Contributor

TylerHelmuth commented Mar 17, 2023

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

How does the growth compare to its requests/limits? Is it ever crashing?

@DavidS-ovm
Copy link
Author

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

From the freshly-started state not needing any kind of CPU (and the overall nature of what the agent does) this behaviour is suspect, irrespective of limits set.

How does the growth compare to its requests/limits? Is it ever crashing?

As explained in the description, this agent is deployed using all the defaults from the helm chart, whatever they might be. Assuming that the pod would recycle on a crash, it doesn't seem like it did crash in the last 28 days.

The other thing I'm noticing is that there are bursts of extremely high latency reported for the kubernetes-logs dataset:

2023-03-20_08MS+0100_774x591

I should also have noted in the original report that this cluster is running on eks with relatively low overall load. Here's the overall node CPU usage for the cluster:

2023-03-20_08MS+0100_1800x833

The correlation between the agent CPU use, when the second node (and hence the second agent) came online and when I restarted it is clear to see.

@TylerHelmuth
Copy link
Contributor

@DavidS-om I will keep investigating this, but note that the helm chart sets no default requests/limits: https://github.com/honeycombio/helm-charts/blob/0cba10473077edc7fbf56d3259ac6d135a67e4cb/charts/honeycomb/values.yaml#LL92-L98C19. Best practice is to set those values.

@TylerHelmuth TylerHelmuth added the status: oncall Flagged for awareness from Honeycomb Telemetry Oncall label Mar 20, 2023
@DavidS-ovm
Copy link
Author

@TylerHelmuth I can see how that would keep this from impacting the rest of my cluster, but assuming that k8s only throttles, not kills, pods for hitting limits, that'll likely just mean that I'm not gonna get (timely) logs and metrics from the agent.

That reminded me to check the event latency of the metrics dataset, and something curious is happening there:

2023-03-20_15MS+0100_717x448

Since restarting the pod last week, latency for metrics also through the roof

@DavidS-ovm
Copy link
Author

since we stopped having a bunch of crash-looping pods in our test systems, the cpu and memory usage of the agent remained flat at a very low level:

2023-03-27_16MS+0200_3724x1168

@kentquirk kentquirk removed the status: oncall Flagged for awareness from Honeycomb Telemetry Oncall label Mar 27, 2023
@MikeGoldsmith
Copy link
Contributor

Looks like this was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants