Unable to restart train.py after crash due to thread not being closed properly #1

cmaclell · 2019-07-13T22:47:16Z

(migrated this issue over from the AL repo as it should be here instead)

Whenever there is some kind of error that causes AL to crash, then I am unable to restart it. I have to manually kill all the python processes running on my machine. Even then, I still get the error shown in the attached screenshot for 3-5 minutes afterwards.

Is this because we're doing some kind of special threading? Is this only an issue for mac?

Flagging this here as a bug.

The main issue here appears to be that when AL crashes due to an exception being raised, whatever threads get spawned are not properly closed before exiting.

As a short term fix, it seems like it is possible to manually kill all related Python processes AND close the browser window that automatically opened for AL. Then you wait maybe 30 sec before running train.py again. Kind of annoying, but the fix is workable while a fix to properly close the python processes is implemented.

cmaclell · 2019-07-13T22:57:09Z

Currently processes are killed using the atexit module, which runs registered functions when the python program is terminated. However, it has the following edge case:

Note: The functions registered via this module are not called when the program is killed by a signal not handled by Python, when a Python fatal internal error is detected, or when os._exit() is called.

cmaclell · 2019-07-13T22:59:53Z

An alternative suggested on stack overflow (https://stackoverflow.com/questions/930519/how-to-run-one-last-function-before-getting-killed-in-python) is to use the signal package.

from signal import *
import sys, time

def clean(*args):
    print "clean me"
    sys.exit(0)

for sig in (SIGABRT, SIGBREAK, SIGILL, SIGINT, SIGSEGV, SIGTERM):
    signal(sig, clean)

time.sleep(10)

cmaclell · 2019-07-14T15:17:45Z

So I tried replacing atexit with signal as specified above and it did not seem to fix the problem.

I'm not really sure what is causing the processes to stay up and the port to be blocked.

DannyWeitekamp · 2019-07-14T15:59:37Z

My solution to this is to leave the Port id blank in the net.conf. That will force it to search for an open port. Unfortunately this solution leaves a bunch of running agents which need to be killed. The issue is that django doesn’t want to die. For the host server I have an explicit QUIT request, but there is no such thing in django, because obviously the client shouldn’t be able to kill the server. Unfortunately it also doesn’t seem to respect sig TERM calls which is really annoying. I’ve been trying to figure out how to do this right for a while.

eharpste · 2020-06-05T18:36:05Z

@cmaclell is this still an issue?

cmaclell added the bug Something isn't working label Jul 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to restart train.py after crash due to thread not being closed properly #1

Unable to restart train.py after crash due to thread not being closed properly #1

cmaclell commented Jul 13, 2019

cmaclell commented Jul 13, 2019 •

edited

Loading

cmaclell commented Jul 13, 2019

cmaclell commented Jul 14, 2019

DannyWeitekamp commented Jul 14, 2019

eharpste commented Jun 5, 2020

Unable to restart train.py after crash due to thread not being closed properly #1

Unable to restart train.py after crash due to thread not being closed properly #1

Comments

cmaclell commented Jul 13, 2019

cmaclell commented Jul 13, 2019 • edited Loading

cmaclell commented Jul 13, 2019

cmaclell commented Jul 14, 2019

DannyWeitekamp commented Jul 14, 2019

eharpste commented Jun 5, 2020

cmaclell commented Jul 13, 2019 •

edited

Loading