Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amon Agent Stops Sending Data #17

Closed
jyksnw opened this issue Mar 22, 2017 · 9 comments · Fixed by #18
Closed

Amon Agent Stops Sending Data #17

jyksnw opened this issue Mar 22, 2017 · 9 comments · Fixed by #18

Comments

@jyksnw
Copy link
Contributor

jyksnw commented Mar 22, 2017

I believe this might be similar the issue/request brought up on #9.

We have 2 servers that randomly stop sending data (ironically I restart them around the same time and they seem to stop sending data around the same time). When I check the amonagent.log I can see that no new entries are logged starting around the time they stop sending data.

I am still in the process of trying to debug/troubleshoot what might be causing this. None of our other servers have this issue, which is leading me to think that a) some other process on these servers are interrupting amonagent or b) something about how these servers are configured (at the OS level) is causing the problem.

Some facts:

  • both servers are Ubuntu 14.04 server
  • we have not enabled any health checks or plugins on these two servers
  • a manual restart of the amonagent service resolves the issue
  • a status check on the amonagent service before restarting the service indicates that the amonagent is running.
  • both are running amonagent 7.2

I am going to try updating them to the latest amonagent today to see if that helps resolve the issue.

I will gladly update this issue with any additional findings or potential patches if I am able to track this down.

@martinrusev
Copy link
Member

@jyksnw Can you check the log file, I think the default is set to INFO and logs every request made + a timestamp. Maybe it will be easier to debug if we know when the agent stopped sending data.

@jyksnw
Copy link
Contributor Author

jyksnw commented Mar 22, 2017

@martinrusev

Here are the lines of the log from one of the servers (I have obfuscated the URI and API key).

time="2017-03-21T17:44:05-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-21T17:44:05-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Starting Amon Agent (Version: 0.7.2)\n"
time="2017-03-22T06:25:11-04:00" level=info msg="Agent Config: Interval:1m0s\n"
time="2017-03-22T06:25:19-04:00" level=info msg="Metrics collected (Interval:1m0s)\n"
time="2017-03-22T06:25:20-04:00" level=info msg="Sending data to https://server1.example.com/api/system/v2/?api_key=XXXXXXXXXX\n"

@jyksnw
Copy link
Contributor Author

jyksnw commented Mar 23, 2017

I think I see the issue. It appears to be a two part issue:

  1. After searching the logs I found that named had a number of logged errors indicating that it couldn't resolve to our amon server.
  2. Go's http.Client defaults to a timeout of 0 which is no timeout. It appears that in the scenario above the call to SendData never returns as the client is stuck looking to complete the connection to a hostname that it can't resolve or reach.

Luckily this is easy to fix by creating the http.Client with a specified timeout. I can add this in without any issue but wanted to know if the timeout should be a configuration option or a statically set value (say 10 seconds).

I am going to create a local build to test this theory out, but after reading through our logs and looking into the http.Client request handling I am highly confident this was the cause of this issue.

@jyksnw
Copy link
Contributor Author

jyksnw commented Mar 23, 2017

Sorry I didn't look into how the transport was constructed before commenting. Looks like a 10 second timeout is already being utilized via the transport.

@martinrusev
Copy link
Member

@jyksnw It could be a goroutine leak somewhere, although I do check for data races before releasing. One way to determine if that is the case is to monitor the memory usage.

What makes this one difficult to catch I think is that it has some parts of this bug which are hardware / distro related. I personally have 5 agents that have been running since last August

@jyksnw
Copy link
Contributor Author

jyksnw commented Mar 23, 2017

We have 3 other servers running with similar hardware/distro configuration that we haven't seen any issues on.

CloudFlare has en excellent writeup and graph outlining Go's client connection sequence and where each of the various timeout settings come into play.


Source - CloudFlare: The complete guide to Go net/http timeouts

So though a timeout is being set for ResponseHeaderTimeout, the request might not have reached that point and still stuck. There is a suggested Transport structure setup in the write-up that could be implemented. I will create a build for just these two servers with the suggested Transport setup and see if the issue presents itself again.

@martinrusev
Copy link
Member

@jyksnw Thanks for sharing the guide. Yes, this could be the issue - the amonagent does not have a cancel request policy, just timeout

@jyksnw
Copy link
Contributor Author

jyksnw commented Mar 23, 2017

I have a local branch that implements a more fine grained timeout along with adding a cancel request policy that currently cancels the request after a 10 second delay. I will test this out a bit against the two servers we have been having issues with to see if it solves the problem as well as see if it introduces any other potential issues.

@martinrusev
Copy link
Member

@jyksnw Cool. If it works - you can submit as a pull request and I will merge / push a new release for the agent with the fix

martinrusev added a commit that referenced this issue Mar 25, 2017
Fix #17 - Amon Agent Stops Sending Data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants