-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amon Agent Stops Sending Data #17
Comments
@jyksnw Can you check the log file, I think the default is set to INFO and logs every request made + a timestamp. Maybe it will be easier to debug if we know when the agent stopped sending data. |
Here are the lines of the log from one of the servers (I have obfuscated the URI and API key).
|
I think I see the issue. It appears to be a two part issue:
Luckily this is easy to fix by creating the http.Client with a specified timeout. I can add this in without any issue but wanted to know if the timeout should be a configuration option or a statically set value (say 10 seconds). I am going to create a local build to test this theory out, but after reading through our logs and looking into the http.Client request handling I am highly confident this was the cause of this issue. |
Sorry I didn't look into how the transport was constructed before commenting. Looks like a 10 second timeout is already being utilized via the transport. |
@jyksnw It could be a goroutine leak somewhere, although I do check for data races before releasing. One way to determine if that is the case is to monitor the memory usage. What makes this one difficult to catch I think is that it has some parts of this bug which are hardware / distro related. I personally have 5 agents that have been running since last August |
We have 3 other servers running with similar hardware/distro configuration that we haven't seen any issues on. CloudFlare has en excellent writeup and graph outlining Go's client connection sequence and where each of the various timeout settings come into play.
So though a timeout is being set for ResponseHeaderTimeout, the request might not have reached that point and still stuck. There is a suggested Transport structure setup in the write-up that could be implemented. I will create a build for just these two servers with the suggested Transport setup and see if the issue presents itself again. |
@jyksnw Thanks for sharing the guide. Yes, this could be the issue - the amonagent does not have a cancel request policy, just timeout |
I have a local branch that implements a more fine grained timeout along with adding a cancel request policy that currently cancels the request after a 10 second delay. I will test this out a bit against the two servers we have been having issues with to see if it solves the problem as well as see if it introduces any other potential issues. |
@jyksnw Cool. If it works - you can submit as a pull request and I will merge / push a new release for the agent with the fix |
Fix #17 - Amon Agent Stops Sending Data
I believe this might be similar the issue/request brought up on #9.
We have 2 servers that randomly stop sending data (ironically I restart them around the same time and they seem to stop sending data around the same time). When I check the amonagent.log I can see that no new entries are logged starting around the time they stop sending data.
I am still in the process of trying to debug/troubleshoot what might be causing this. None of our other servers have this issue, which is leading me to think that a) some other process on these servers are interrupting amonagent or b) something about how these servers are configured (at the OS level) is causing the problem.
Some facts:
I am going to try updating them to the latest amonagent today to see if that helps resolve the issue.
I will gladly update this issue with any additional findings or potential patches if I am able to track this down.
The text was updated successfully, but these errors were encountered: