-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pulse-go doesn't recover from dropped amqp #7
Comments
Oh wow, I had no idea this is used anywhere in production. It wasn't really intended for production use, but that should have probably been written explicitly in the README. Presumably, somebody found it and started using it. Apologies about that. I wrote it back in 2015 before we even had generic-worker mostly as a command line utility during development for sniffing pulse messages to help troubleshoot issues. My memory is a little hazy about the internals, but yes, it seems sensible to try to reconnect if the connections drops, or to exit the process, rather than let it remain running but not working. |
By the way, if you are using it to listen to taskcluster exchanges, there are bindings available in the taskcluster go client, e.g. see this test example. In any case, if this really is to be used in production, we should certainly harden it, and make sure it is tested for robustness. |
Heh, it's being used by https://github.com/mozilla-services/cloudops-deployment-proxy for listening for changes to ci-admin/ci-configuration and then triggering Jenkins builds |
@jbuck if we have pulse-go exit on failure, can we then reattach with another process easily? If so, that seems like a fairly straightforward solution. Other options: pulse-go reattaches; we switch cloudops-deployment-proxy to the taskcluster go client. Between those two I'd probably look at taskcluster go client first, but I defer to the two of you :) |
I think I'd agree - explicitly exiting in pulse-go if the connection drops, together with a script like: #!/bin/bash
# if pulse-go dies, restart it - can happen if amqp connection drops
while true; do
pulse-go ..... || true
done |
it looks like we're using the go integration directly, not a separate process. If the go library killed the process though that'd be fine, we can just tell systemd to restart infinitely |
Do we replace the |
I think the cleanest would be for the library to signal that the connection dropped (e.g. by closing a channel), and the calling code to receive the signal and then it choose to panic or e.g. call os.Exit. The issue with having a panic directly inside the library code (where the issue is detected) is that other apps that import the library may not want their process to be terminated if the connection drops. @jbuck Can you link to the go code that imports the library? |
It looks like we drop a message in the logs:
pulse-go/pulse/pulse.go
Line 340 in 515edd0
This resulted in two instances of cloudops-jenkins not responding to hg.m.o events, which halts rolling out changes to the FirefoxCI tc cluster.
We were wondering if we could either
t
time orn
failed attempts or what.or both.
@petemoore any thoughts?
The text was updated successfully, but these errors were encountered: