Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gui crash on server restart #16

Closed
hartytp opened this issue May 7, 2019 · 3 comments
Closed

gui crash on server restart #16

hartytp opened this issue May 7, 2019 · 3 comments

Comments

@hartytp
Copy link
Contributor

hartytp commented May 7, 2019

If a server dies while the GUI client is mid-way through establishing a connection, the GUI sometimes crashes. This seems to be non-deterministic and is most likely a race condition somewhere.

It's a bit hard to debug, since I haven't been able to get a proper traceback.

It appears that a ConnectionReset error is raised in asyncio.windows_events.IocpProactor.recv.finish_recv. This leads to asuncio.Future.set_exception being called (via _OverlappedFuture), however the future's state is already _CANCELLED, so an InvalidStateError is raised and the program terminated.

I did try hacking some print(traceback.print_exc())s into the local copy of asyncio in various places, but that always prints None. Not totally sure why.

As part of debugging this, I stripped the GUI down until the only asyncio tasks (asside from QT) are hooking up some sync_struct Subscribers with disconnect_cb hooked up to a trivial function that waits 10s before reconnecting (see below). I can't see any obvious way that my code could be responsible for these symtoms, so I suspect that it's a race condition somewhere in asyncio or quamash. It probably doesn't happen on Linux...

     def subscriber_reconnect_cb(self, server, db):
            print("cb")
            print(traceback.print_exc())
            subscriber, fut = self.subscribers[server][db]
            try:
                fut = asyncio.ensure_future(
                    subscriber_reconnect_coro(self, server, db))
            except Exception as e:
                print(e)
            self.subscribers[server][db] = subscriber, fut

        async def subscriber_reconnect_coro(self, server, db):
            try:
                if self.win.exit_request.is_set():
                    for display in self.laser_displays:
                        display.wake_loop.set()
                    return
            except Exception as e:
                print(e)

            logger.error("No connection to server '{}'".format(server))

            for _, display in self.laser_displays.items():
                if display.server == server:
                    display.server = ""
                    display.wake_loop.set()

            server_cfg = self.config["servers"][server]
            subscriber, fut = self.subscribers[server][db]

            try:
                await subscriber.close()
            except:
                pass

            subscriber.disconnect_cb = functools.partial(
                subscriber_reconnect_cb, self, server, db)

            while not self.win.exit_request.is_set():
                try:
                    print("connecting")
                    await subscriber.connect(server_cfg["host"],
                                             server_cfg["notify"])
                    print("done!")
                    logger.info("Reconnected to server '{}'".format(server))
                    break
                except (ConnectionRefusedError, OSError, ConnectionResetError):
                    pass
                except:
                    logger.info("could not connect to '{}' retry in 10s..."
                                .format(server))
                    await asyncio.sleep(10)
            print("cb complete!")

        for server, server_cfg in self.config["servers"].items():
            self.subscribers[server] = {}

            #ask the servers to keep us updated with changes to laser settings
            # (exposures, references, etc)
            subscriber = Subscriber(
                "laser_db",
                functools.partial(init_cb, self.laser_db),
                functools.partial(self.notifier_cb, "laser_db", server),
                disconnect_cb=functools.partial(
                    subscriber_reconnect_cb, self, server, "laser_db"))
            self.subscribers[server]["laser_db"] = subscriber, None
            subscriber_reconnect_cb(self, server, "laser_db")

            # ask the servers to keep us updated with the latest frequency data
            subscriber = Subscriber(
                "freq_db",
                functools.partial(init_cb, self.freq_db),
                functools.partial(self.notifier_cb, "freq_db", server),
                disconnect_cb=functools.partial(
                    subscriber_reconnect_cb, self, server, "freq_db"))
            self.subscribers[server]["freq_db"] = subscriber, None
            subscriber_reconnect_cb(self, server, "freq_db")

            # ask the servers to keep us updated with the latest osa traces
            subscriber = Subscriber(
                "osa_db",
                functools.partial(init_cb, self.osa_db),
                functools.partial(self.notifier_cb, "osa_db", server),
                disconnect_cb=functools.partial(
                    subscriber_reconnect_cb, self, server, "osa_db"))
            self.subscribers[server]["osa_db"] = subscriber, None
            subscriber_reconnect_cb(self, server, "osa_db")
@hartytp
Copy link
Contributor Author

hartytp commented May 7, 2019

To further debug this, I uninstalled the conda version of quamash and installed a new version from source. NB the conda version packaged in Artiq (0.5.5) is really quite old.

After doing that, I stopped being able to reproduce this issue. Since it's probably a race, the symptoms going away doesn't necessarily mean that the issue is resolved, but I don't see anything else I can do to debug it, and haven't got any more time to sink in to this now.

cc @cjbe @klickverbot

@hartytp
Copy link
Contributor Author

hartytp commented May 7, 2019

aah, I think this is harvimt/quamash#77

Optimisitcally closing on the assumption that it's now fixed by moving to our local conda packages. Will re-open if I see it again.

@hartytp hartytp closed this as completed May 7, 2019
@hartytp
Copy link
Contributor Author

hartytp commented May 7, 2019

cf harvimt/quamash#109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant