Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client fails to resubscribe after cluster restart #22

Open
phijor opened this issue Apr 22, 2021 · 1 comment
Open

Client fails to resubscribe after cluster restart #22

phijor opened this issue Apr 22, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@phijor
Copy link
Collaborator

phijor commented Apr 22, 2021

During a cluster restart, clients failed to correctly obtain a new data queue, thus they reported a CRITICAL timeout.

Here's a log:

[2021-04-22 12:23:48,959] [INFO ] [metricq.agent       ] Connection closed: (320, 'CONNECTION_FORCED - Node was put into maintenance mode') (ConnectionClosed)
[2021-04-22 12:23:48,960] [INFO ] [aio_pika.robust_connection] Connection to amqps://user_xxx:******@host_yyy:5671/ closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:51,960] [INFO ] [metricq.agent       ] Connection closed: 0 bytes read on a total of 1 expected bytes (ConnectionError)
[2021-04-22 12:23:51,961] [INFO ] [aio_pika.robust_connection] Connection to amqps://user_xxx:******@host_yyy:5671/data closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:54,000] [INFO ] [metricq.agent       ] Reconnected to amqps://user_xxx:******@host_yyy:5671/
[2021-04-22 12:23:56,985] [INFO ] [metricq.agent       ] Reconnected to amqps://user_xxx:******@host_yyy:5671/data
[2021-04-22 12:23:56,985] [INFO ] [metricq.sink        ] Sink data connection (amqps://user_xxx:******@host_yyy:5671/data) reestablished!
[2021-04-22 12:23:56,985] [INFO ] [metricq.sink        ] Resubscribing to 3881 metric(s) with RPC parameters {'dataQueue': '<redacted>'}...
[2021-04-22 12:23:57,001] [INFO ] [metricq.agent       ] sending RPC sink.subscribe, ex: metricq.management, rk: sink.subscribe, ci: <redacted>, args: {"function": "sink.subscribe", "metrics": [...] }
[2021-04-22 12:24:25,498] [INFO ] [metricq.agent       ] received message from manager, correlation id: <redacted>, reply_to: None, length: 1291442
{"dataServerAddress": "amqps://host_yyy:5671/data", "dataQueue": "<redacted>", "metrics": { ... }
[2021-04-22 12:24:25,515] [INFO ] [metricq.agent       ] rpc completed in 28.5290430621244 s
[2021-04-22 12:31:06,112] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-ABC': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,126] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,126] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
@phijor phijor added the bug Something isn't working label Apr 22, 2021
@phijor phijor self-assigned this Apr 22, 2021
@phijor
Copy link
Collaborator Author

phijor commented Apr 23, 2021

It looks like another client ran into the same problem as well:

[2021-04-22 12:23:48,976] [INFO ] [metricq.agent       ] Connection closed: (320, 'CONNECTION_FORCED - Node was put into maintenance mode') (ConnectionClosed)
[2021-04-22 12:23:48,977] [INFO ] [aio_pika.robust_connection] Connection to amqps://nsca:******@<host>:5671/data closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:49,038] [INFO ] [metricq.agent       ] Connection closed: (320, 'CONNECTION_FORCED - Node was put into maintenance mode') (ConnectionClosed)
[2021-04-22 12:23:49,039] [INFO ] [aio_pika.robust_connection] Connection to amqps://nsca:******@<host> closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:54,450] [ERROR] [metricq.agent       ] Exception in event loop: Task was destroyed but it is pending!
[2021-04-22 12:23:54,452] [INFO ] [metricq.agent       ] Reconnected to amqps://nsca:******@<host>:5671/data
[2021-04-22 12:23:54,453] [INFO ] [metricq.sink        ] Sink data connection (amqps://nsca:******@<host>:5671/data) reestablished!
[2021-04-22 12:23:54,453] [INFO ] [metricq.sink        ] Resubscribing to 11 metric(s) with RPC parameters {'dataQueue': '<redacted>'}...
[2021-04-22 12:24:47,000] [WARNING] [metricq_sink_nsca.check] Check '<check redacted>': <redacted>.s0.dram.power.1Hz timed out after 900.0s
[2021-04-22 12:24:47,000] [WARNING] [metricq_sink_nsca.check] Check '<check redacted>': <redacted>.s1.dram.power.1Hz timed out after 900.0s
[2021-04-22 12:24:47,000] [WARNING] [metricq_sink_nsca.check] Check '<check redacted>': <redacted>.s1.package.power.1Hz timed out after 900.0s
[2021-04-22 12:24:47,000] [WARNING] [metricq_sink_nsca.check] Check '<check redacted>': <redacted>.s0.package.power.1Hz timed out after 900.0s
[2021-04-22 12:24:47,000] [WARNING] [metricq_sink_nsca.check] Check '<check redacted>': <redacted>.sum.power timed out after 900.0s
[2021-04-22 12:24:49,036] [WARNING] [metricq.connection_watchdog] Data connection failed to reconnect after 60 seconds
[2021-04-22 12:24:49,036] [INFO ] [metricq.data_client ] closing data channel and connection.
[2021-04-22 12:24:49,037] [WARNING] [metricq.connection_watchdog] Watchdog task data connection already done with result: None
[2021-04-22 12:24:49,038] [INFO ] [metricq.agent       ] Stopping Agent ReporterSink (Failed to reestablish data connection after 60 seconds)...
[2021-04-22 12:24:49,039] [INFO ] [metricq.agent       ] Closing management channel and connection...
Traceback (most recent call last):
  File "<venv>/lib/python3.9/site-packages/metricq/agent.py", line 251, in run
    self.event_loop.run_until_complete(wait_for_stop())
  File "<venv>/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "<venv>/lib/python3.9/site-packages/metricq/agent.py", line 247, in wait_for_stop
    return stopped_task.result()
  File "<venv>/lib/python3.9/site-packages/metricq/agent.py", line 497, in stopped
    await self._stop_future
metricq.exceptions.ReconnectTimeout: Failed to reestablish data connection after 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<venv>/bin/metricq-sink-nsca", line 8, in <module>
    sys.exit(main())
  File "<venv>/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "<venv>/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "<venv>/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "<venv>/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "<venv>/lib/python3.9/site-packages/metricq_sink_nsca/main.py", line 58, in main
    reporter.run(cancel_on_exception=True)
  File "<venv>/lib/python3.9/site-packages/metricq/agent.py", line 254, in run
    self.event_loop.run_until_complete(self.event_loop.shutdown_asyncgens())
  File "<venv>/lib/python3.9/asyncio/base_events.py", line 640, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

This one is differen though, as it seems to have failed to re-establish its data connection altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant