pulse-go doesn't recover from dropped amqp #7

escapewindow · 2022-01-12T21:46:56Z

It looks like we drop a message in the logs:

Line 340 in 515edd0

fmt.Println("AMQP channel closed - has the connection dropped?")

This resulted in two instances of cloudops-jenkins not responding to hg.m.o events, which halts rolling out changes to the FirefoxCI tc cluster.

We were wondering if we could either

add louder notifications: a slack alert, email, ?
auto-recover, whether that's killing pulse-go for a restart, killing the container for a restart, reconnecting to amqp, ? I'm not sure if this would be on the first failure or after t time or n failed attempts or what.

or both.

@petemoore any thoughts?

The text was updated successfully, but these errors were encountered:

petemoore · 2022-01-13T13:49:46Z

Oh wow, I had no idea this is used anywhere in production. It wasn't really intended for production use, but that should have probably been written explicitly in the README. Presumably, somebody found it and started using it. Apologies about that. I wrote it back in 2015 before we even had generic-worker mostly as a command line utility during development for sniffing pulse messages to help troubleshoot issues.

My memory is a little hazy about the internals, but yes, it seems sensible to try to reconnect if the connections drops, or to exit the process, rather than let it remain running but not working.

petemoore · 2022-01-13T13:53:18Z

By the way, if you are using it to listen to taskcluster exchanges, there are bindings available in the taskcluster go client, e.g. see this test example.

In any case, if this really is to be used in production, we should certainly harden it, and make sure it is tested for robustness.

jbuck · 2022-01-13T15:27:38Z

Heh, it's being used by https://github.com/mozilla-services/cloudops-deployment-proxy for listening for changes to ci-admin/ci-configuration and then triggering Jenkins builds

escapewindow · 2022-01-13T16:41:46Z

@jbuck if we have pulse-go exit on failure, can we then reattach with another process easily? If so, that seems like a fairly straightforward solution.

Other options: pulse-go reattaches; we switch cloudops-deployment-proxy to the taskcluster go client. Between those two I'd probably look at taskcluster go client first, but I defer to the two of you :)

petemoore · 2022-01-13T17:03:20Z

I think I'd agree - explicitly exiting in pulse-go if the connection drops, together with a script like:

#!/bin/bash

# if pulse-go dies, restart it - can happen if amqp connection drops
while true; do
  pulse-go ..... || true
done

jbuck · 2022-01-13T19:33:38Z

it looks like we're using the go integration directly, not a separate process. If the go library killed the process though that'd be fine, we can just tell systemd to restart infinitely

escapewindow · 2022-01-13T19:48:48Z

Do we replace the fmt.Println() with a panic() ? Would it make sense to have a counter+sleep and die on attempt 5 or something, so we avoid going in an infinite crash/restart loop?

petemoore · 2022-01-14T08:54:47Z

I think the cleanest would be for the library to signal that the connection dropped (e.g. by closing a channel), and the calling code to receive the signal and then it choose to panic or e.g. call os.Exit. The issue with having a panic directly inside the library code (where the issue is detected) is that other apps that import the library may not want their process to be terminated if the connection drops.

@jbuck Can you link to the go code that imports the library?

jcristau · 2022-01-14T09:07:48Z

Seems to be here: https://github.com/mozilla-services/cloudops-deployment-proxy/blob/master/main.go#L107-L124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pulse-go doesn't recover from dropped amqp #7

pulse-go doesn't recover from dropped amqp #7

escapewindow commented Jan 12, 2022 •

edited

Loading

petemoore commented Jan 13, 2022 •

edited

Loading

petemoore commented Jan 13, 2022 •

edited

Loading

jbuck commented Jan 13, 2022

escapewindow commented Jan 13, 2022 •

edited

Loading

petemoore commented Jan 13, 2022

jbuck commented Jan 13, 2022

escapewindow commented Jan 13, 2022

petemoore commented Jan 14, 2022 •

edited

Loading

jcristau commented Jan 14, 2022

pulse-go doesn't recover from dropped amqp #7

pulse-go doesn't recover from dropped amqp #7

Comments

escapewindow commented Jan 12, 2022 • edited Loading

petemoore commented Jan 13, 2022 • edited Loading

petemoore commented Jan 13, 2022 • edited Loading

jbuck commented Jan 13, 2022

escapewindow commented Jan 13, 2022 • edited Loading

petemoore commented Jan 13, 2022

jbuck commented Jan 13, 2022

escapewindow commented Jan 13, 2022

petemoore commented Jan 14, 2022 • edited Loading

jcristau commented Jan 14, 2022

escapewindow commented Jan 12, 2022 •

edited

Loading

petemoore commented Jan 13, 2022 •

edited

Loading

petemoore commented Jan 13, 2022 •

edited

Loading

escapewindow commented Jan 13, 2022 •

edited

Loading

petemoore commented Jan 14, 2022 •

edited

Loading