Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify rpcq:start-server to support graceful shutdown of worker threads #75

Open
appleby opened this issue Jun 18, 2019 · 3 comments
Open
Assignees

Comments

@appleby
Copy link
Contributor

appleby commented Jun 18, 2019

As discussed in the comment thread for PR #74, it's worth investigating whether rcpq:start-server can be modified to support graceful shutdown of worker threads, rather than relying on bt:destory-thread for testing.

The reason for suggesting this change is that just calling bt:destroy-thread was causing test failures on CCL. Those failures were addressed by the changes in #74, but bt:destroy-thread is something of a nuclear option for killing threads, and the bordeaux-threads documentation for destroy-threads warns:

This should be used with caution: it is implementation-defined whether the thread runs cleanup forms or releases its locks first.

So we'd like to do graceful termination of server threads, if possible.

This work might dovetail nicely with the non-blocking start-server changes suggested in #61.

@appleby
Copy link
Contributor Author

appleby commented Jun 18, 2019

See also ecpeterson's comment regarding possibly trickiness of handling the pzmq:device call in rpcq:start-server.

@appleby appleby mentioned this issue Jun 18, 2019
@appleby appleby self-assigned this Jul 26, 2019
appleby added a commit that referenced this issue Jan 14, 2020
Add a (SLEEP 1) after every BT:DESTROY-THREAD call.

This is a work-around that enables switching over to EC2 runners for
the gitlab CI. Something about the EC2 environment was tickling a race
in our combination of using BT:DESTROY-THREAD + our reliance on PZMQ's
implicit global *DEFAULT-CONTEXT*.

Previously, each test called BT:DESTROY-THREAD to terminate it's
SERVER-THREAD, and when that sever-thread unwinds, it would terminate
each of of it's worker threads, then terminate the (implicit global)
PZQM:*DEFAULT-CONTEXT*. If teardown of the PZMQ/ZMQ context has not
fully completed by the time the next test runs, we might be in a
situation where PQZM:*DEFAULT-CONTEXT* is non-NIL, but in some zombie
half-dead state, yet the next test case will attempt to re-use it.

A proper fix for this likely entails some combination of

1) Explicitly creating our own PZMQ:CONTEXT in RPCQ:START-SERVER and
passing that down to worker threads.

2) Adding a graceful shutdown option to the RPCQ server to avoid the
need to BT:DESTROY-THREAD (see #75).
appleby added a commit that referenced this issue Jan 14, 2020
Add a (SLEEP 1) after every BT:DESTROY-THREAD call.

This is a work-around that enables switching over to EC2 runners for
the gitlab CI. Something about the EC2 environment was tickling a race
in our combination of using BT:DESTROY-THREAD + our reliance on PZMQ's
implicit global *DEFAULT-CONTEXT*.

Previously, each test called BT:DESTROY-THREAD to terminate it's
SERVER-THREAD, and when that sever-thread unwinds, it would terminate
each of of it's worker threads, then terminate the (implicit global)
PZQM:*DEFAULT-CONTEXT*. If teardown of the PZMQ/ZMQ context has not
fully completed by the time the next test runs, we might be in a
situation where PQZM:*DEFAULT-CONTEXT* is non-NIL, but in some zombie
half-dead state, yet the next test case will attempt to re-use it.

A proper fix for this likely entails some combination of

1) Explicitly creating our own PZMQ:CONTEXT in RPCQ:START-SERVER and
passing that down to worker threads.

2) Adding a graceful shutdown option to the RPCQ server to avoid the
need to BT:DESTROY-THREAD (see #75).
appleby added a commit that referenced this issue Jan 14, 2020
Add a (SLEEP 1) after every BT:DESTROY-THREAD call.

This is a work-around that enables switching over to EC2 runners for
the gitlab CI. Something about the EC2 environment was tickling a race
in our combination of using BT:DESTROY-THREAD + our reliance on PZMQ's
implicit global *DEFAULT-CONTEXT*.

A proper fix for this likely entails some combination of

1) Explicitly creating our own PZMQ:CONTEXT in RPCQ:START-SERVER and
passing that down to worker threads.

2) Adding a graceful shutdown option to the RPCQ server to avoid the
need to BT:DESTROY-THREAD (see #75).
@appleby
Copy link
Contributor Author

appleby commented Jan 14, 2020

Don't forget to cleanup the rpcq tests and replace the calls there to KILL-THREAD-SLOWLY will a graceful shutdown.

See @jmbr's comment here: #113 (comment)

stylewarning pushed a commit that referenced this issue Jan 14, 2020
Add a (SLEEP 1) after every BT:DESTROY-THREAD call.

This is a work-around that enables switching over to EC2 runners for
the gitlab CI. Something about the EC2 environment was tickling a race
in our combination of using BT:DESTROY-THREAD + our reliance on PZMQ's
implicit global *DEFAULT-CONTEXT*.

A proper fix for this likely entails some combination of

1) Explicitly creating our own PZMQ:CONTEXT in RPCQ:START-SERVER and
passing that down to worker threads.

2) Adding a graceful shutdown option to the RPCQ server to avoid the
need to BT:DESTROY-THREAD (see #75).
@appleby
Copy link
Contributor Author

appleby commented Jan 14, 2020

Also don't forget to make the corresponding change to remove KILL-THREAD-SLOWLY in the quilc rpcq tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant