Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

benclifford · 2024-07-10T15:14:07Z

Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information.

This PR attempts to remove some dangerous behaviour there:

i) creation of ZMQ context and socket is moved into the thread which makes use of them - before this PR, the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads.

ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread)

On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests without seeing any hangs.

Fixes

Fixes #3484

Type of change

Bug fix

Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information. This PR attempts to remove some dangerous behaviour there: i) creation of ZMQ context and socket is moved into the thread which makes use of them - previous the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads. ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread) On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests with seeing any hangs.

benclifford · 2024-07-10T15:14:18Z

cc @jameshcorbett

jameshcorbett

This looks like a big improvement, thank you so much for tracking this down!

jameshcorbett · 2024-07-10T16:52:54Z

There is another instance of ZMQ sockets not being cleaned up in flux-executor-related code. I'm happy to open this as a separate PR if you like.

diff --git a/parsl/executors/flux/flux_instance_manager.py b/parsl/executors/flux/flux_instance_manager.py
index 3d760bb5..e6111796 100644
--- a/parsl/executors/flux/flux_instance_manager.py
+++ b/parsl/executors/flux/flux_instance_manager.py
@@ -27,30 +27,29 @@ def main():
     parser.add_argument("hostname", help="hostname of the parent executor's socket")
     parser.add_argument("port", help="Port of the parent executor's socket")
     args = parser.parse_args()
-    context = zmq.Context()
-    socket = context.socket(zmq.REQ)
-    socket.connect(
-        args.protocol + "://" + gethostbyname(args.hostname) + ":" + args.port
-    )
-    # send the path to the ``flux.job`` package
-    socket.send(dirname(dirname(os.path.realpath(flux.__file__))).encode())
-    logging.debug("Flux package path sent.")
-    # collect the encapsulating Flux instance's URI
-    local_uri = flux.Flux().attr_get("local-uri")
-    hostname = gethostname()
-    if args.hostname == hostname:
-        flux_uri = local_uri
-    else:
-        flux_uri = "ssh://" + gethostname() + local_uri.replace("local://", "")
-    logging.debug("Flux URI is %s", flux_uri)
-    response = socket.recv()  # get acknowledgment
-    logging.debug("Received acknowledgment %s", response)
-    socket.send(flux_uri.encode())  # send URI
-    logging.debug("URI sent. Blocking for response...")
-    response = socket.recv()  # wait for shutdown message
-    logging.debug("Response %s received, draining flux jobs...", response)
-    flux.Flux().rpc("job-manager.drain").get()
-    logging.debug("Flux jobs drained, exiting.")
+    with zmq.Context() as context, context.socket(zmq.REQ) as socket:
+        socket.connect(
+            args.protocol + "://" + gethostbyname(args.hostname) + ":" + args.port
+        )
+        # send the path to the ``flux.job`` package
+        socket.send(dirname(dirname(os.path.realpath(flux.__file__))).encode())
+        logging.debug("Flux package path sent.")
+        # collect the encapsulating Flux instance's URI
+        local_uri = flux.Flux().attr_get("local-uri")
+        hostname = gethostname()
+        if args.hostname == hostname:
+            flux_uri = local_uri
+        else:
+            flux_uri = "ssh://" + gethostname() + local_uri.replace("local://", "")
+        logging.debug("Flux URI is %s", flux_uri)
+        response = socket.recv()  # get acknowledgment
+        logging.debug("Received acknowledgment %s", response)
+        socket.send(flux_uri.encode())  # send URI
+        logging.debug("URI sent. Blocking for response...")
+        response = socket.recv()  # wait for shutdown message
+        logging.debug("Response %s received, draining flux jobs...", response)
+        flux.Flux().rpc("job-manager.drain").get()
+        logging.debug("Flux jobs drained, exiting.")
 
 
 if __name__ == "__main__":

benclifford · 2024-07-10T16:58:37Z

@jameshcorbett yeah please make a PR for that.

The main problem that I encounter with ZMQ in Parsl is to do with multithreading (and perhaps multiprocessing fork) and I think that isn't an issue in flux_instance_manager.py

khk-globus

Good find. Once identified, this change is a no-brainer.

Before this PR, this thread stays running forever this *requires* socket to be closed at exit -- and this PR introduces code to do that context: see recent flux PR for same problem. because stopping this thread is now allowing garbage collection to happen, it looks like? or something similar... see PR #3517 for the same problem in Flux counts: before this PR, on parsl/tests/test_monitoring/ 451 fds, 32 threads after this PR, 48 fds, 1 thread

benclifford mentioned this pull request Jul 10, 2024

[not for merge] benc poking at CI hangs in flux test #3259

Closed

jameshcorbett approved these changes Jul 10, 2024

View reviewed changes

benclifford mentioned this pull request Jul 10, 2024

flux-in-parsl-CI testing is very hangy #3484

Closed

jameshcorbett mentioned this pull request Jul 10, 2024

flux: cleanup zmq context and socket #3518

Merged

Merge branch 'master' into benc-flux-hang-2

ba3450d

benclifford requested a review from yadudoc July 12, 2024 15:23

khk-globus approved these changes Jul 23, 2024

View reviewed changes

Merge branch 'master' into benc-flux-hang-2

c8e0014

benclifford merged commit 9798260 into master Jul 23, 2024
7 checks passed

benclifford deleted the benc-flux-hang-2 branch July 23, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

benclifford commented Jul 10, 2024 •

edited

Loading

benclifford commented Jul 10, 2024

jameshcorbett left a comment

jameshcorbett commented Jul 10, 2024

benclifford commented Jul 10, 2024

khk-globus left a comment

Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

Conversation

benclifford commented Jul 10, 2024 • edited Loading

Fixes

Type of change

benclifford commented Jul 10, 2024

jameshcorbett left a comment

Choose a reason for hiding this comment

jameshcorbett commented Jul 10, 2024

benclifford commented Jul 10, 2024

khk-globus left a comment

Choose a reason for hiding this comment

benclifford commented Jul 10, 2024 •

edited

Loading