Add resilience for heartbeats from unknown managers #3643

benclifford · 2024-10-22T18:54:53Z

Description

Make interchange survive a heartbeat message from an unregistered manager.

#3262 and #3632 report situations where a heartbeat is received by the interchange after the heartbeat period has expired, and so the relevant manager has been unregistered. Before this PR, the interchange crashed when this happened. After this PR, it will log a warning.

This PR adds a test for such a heartbeat message as well as a couple of other bad protocol messages.

Fixes

Fixes #3262, fixes #3632

Type of change

Bug fix

make my own test: start a block, suspend it until it disappears by heartbeat, then continue it so it sends a heartbeat. assert that interchange doesn't crash / can still run stuff?

khk-globus

Looks on target. I recognize that at least the 3.12 tests aren't passing now, but this basic thrust looks good, with just a couple of non-obligatory inline suggestions.

parsl/executors/high_throughput/interchange.py

khk-globus · 2024-10-23T13:18:18Z

parsl/tests/test_scaling/test_missing_heartbeat_3262.py

+T_s = 1
+


Clever; remove some of the magic from the hard-coded values.

parsl/tests/test_scaling/test_missing_heartbeat_3262.py

khk-globus · 2024-10-23T13:21:32Z

parsl/tests/test_scaling/test_missing_heartbeat_3262.py

+        (task_port, result_port) = htex.command_client.run("WORKER_PORTS")
+
+        context = zmq.Context()
+        channel_timeout = 10000  # in milliseconds
+        task_channel = context.socket(zmq.DEALER)
+        task_channel.setsockopt(zmq.LINGER, 0)
+        task_channel.setsockopt(zmq.IDENTITY, b'testid')
+
+        task_channel.set_hwm(0)
+        task_channel.setsockopt(zmq.SNDTIMEO, channel_timeout)
+        task_channel.connect(f"tcp://localhost:{task_port}")
+
+        task_channel.send(msg)


This feels super boiler-plate-y, and I wonder if other tests might (or could be refactored to) make use of it. Good for this test, but I'm wondering if this might be wrapped up into a fixture.

(Maybe not, but that's where my mind is going at the moment.)

it's a repeat of code in the process worker pool to initialize this same "tasks flowing from interchange to worker pool" zmq socket.

If refactoring, I think it would make sense to put that into its own module that is then usable by both the real process worker pool and tests that are mocking parts of the worker pool (what this test does) - along the lines of the zmq_pipes module at https://github.com/Parsl/parsl/blob/master/parsl/executors/high_throughput/zmq_pipes.py for the submit side.

… for unregistered

benclifford added 2 commits October 22, 2024 13:06

test for 3262

3dff023

try out fix from @EricLee543

3844453

make my own test: start a block, suspend it until it disappears by heartbeat, then continue it so it sends a heartbeat. assert that interchange doesn't crash / can still run stuff?

khk-globus approved these changes Oct 23, 2024

View reviewed changes

benclifford added 4 commits October 25, 2024 10:30

Merge remote-tracking branch 'origin/master' into benc-heartbeat

3019071

fix log line x2: add substition target and use a slightly better work…

7968acc

… for unregistered

move test to htex directory

9367134

rework comments

e891f0e

benclifford merged commit 6af844f into master Oct 25, 2024
7 checks passed

benclifford deleted the benc-heartbeat branch October 25, 2024 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resilience for heartbeats from unknown managers #3643

Add resilience for heartbeats from unknown managers #3643

benclifford commented Oct 22, 2024

khk-globus left a comment

khk-globus Oct 23, 2024

khk-globus Oct 23, 2024 •

edited

Loading

benclifford Oct 25, 2024

		T_s = 1

Add resilience for heartbeats from unknown managers #3643

Add resilience for heartbeats from unknown managers #3643

Conversation

benclifford commented Oct 22, 2024

Description

Fixes

Type of change

khk-globus left a comment

Choose a reason for hiding this comment

khk-globus Oct 23, 2024

Choose a reason for hiding this comment

khk-globus Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

benclifford Oct 25, 2024

Choose a reason for hiding this comment

khk-globus Oct 23, 2024 •

edited

Loading