-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run-away cpu usage #352
Comments
I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though. How does the growth compare to its requests/limits? Is it ever crashing? |
From the freshly-started state not needing any kind of CPU (and the overall nature of what the agent does) this behaviour is suspect, irrespective of limits set.
As explained in the description, this agent is deployed using all the defaults from the helm chart, whatever they might be. Assuming that the pod would recycle on a crash, it doesn't seem like it did crash in the last 28 days. The other thing I'm noticing is that there are bursts of extremely high latency reported for the kubernetes-logs dataset: I should also have noted in the original report that this cluster is running on eks with relatively low overall load. Here's the overall node CPU usage for the cluster: The correlation between the agent CPU use, when the second node (and hence the second agent) came online and when I restarted it is clear to see. |
@DavidS-om I will keep investigating this, but note that the helm chart sets no default requests/limits: https://github.com/honeycombio/helm-charts/blob/0cba10473077edc7fbf56d3259ac6d135a67e4cb/charts/honeycomb/values.yaml#LL92-L98C19. Best practice is to set those values. |
@TylerHelmuth I can see how that would keep this from impacting the rest of my cluster, but assuming that k8s only throttles, not kills, pods for hitting limits, that'll likely just mean that I'm not gonna get (timely) logs and metrics from the agent. That reminded me to check the event latency of the metrics dataset, and something curious is happening there: Since restarting the pod last week, latency for metrics also through the roof |
Looks like this was resolved. |
Versions
Steps to reproduce
Additional context
Here you can see the cpu usage of the two honeycomb agents (one per node) over the last 28 days. As you can see, the cpu usage is steadily increasing:
whereas the actual collected logs are not growing in volume:
deleting the agent's pod and letting k8s redeploy a new one restarts the growth at 0.
There is nothing obvious in the logs.
This is a development cluster that is not seeing a lot of log traffic, but a lot of k8s traffic (i.e. pods getting replaced, see deployment markers).
The text was updated successfully, but these errors were encountered: