Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to restart train.py after crash due to thread not being closed properly #1

Open
cmaclell opened this issue Jul 13, 2019 · 5 comments
Labels
bug Something isn't working

Comments

@cmaclell
Copy link
Collaborator

(migrated this issue over from the AL repo as it should be here instead)

Whenever there is some kind of error that causes AL to crash, then I am unable to restart it. I have to manually kill all the python processes running on my machine. Even then, I still get the error shown in the attached screenshot for 3-5 minutes afterwards.

Is this because we're doing some kind of special threading? Is this only an issue for mac?

Flagging this here as a bug.

image

The main issue here appears to be that when AL crashes due to an exception being raised, whatever threads get spawned are not properly closed before exiting.

As a short term fix, it seems like it is possible to manually kill all related Python processes AND close the browser window that automatically opened for AL. Then you wait maybe 30 sec before running train.py again. Kind of annoying, but the fix is workable while a fix to properly close the python processes is implemented.

@cmaclell cmaclell added the bug Something isn't working label Jul 13, 2019
@cmaclell
Copy link
Collaborator Author

cmaclell commented Jul 13, 2019

Currently processes are killed using the atexit module, which runs registered functions when the python program is terminated. However, it has the following edge case:

Note: The functions registered via this module are not called when the program is killed by a signal not handled by Python, when a Python fatal internal error is detected, or when os._exit() is called.

@cmaclell
Copy link
Collaborator Author

An alternative suggested on stack overflow (https://stackoverflow.com/questions/930519/how-to-run-one-last-function-before-getting-killed-in-python) is to use the signal package.

from signal import *
import sys, time

def clean(*args):
    print "clean me"
    sys.exit(0)

for sig in (SIGABRT, SIGBREAK, SIGILL, SIGINT, SIGSEGV, SIGTERM):
    signal(sig, clean)

time.sleep(10)

@cmaclell
Copy link
Collaborator Author

So I tried replacing atexit with signal as specified above and it did not seem to fix the problem.

I'm not really sure what is causing the processes to stay up and the port to be blocked.

@DannyWeitekamp
Copy link
Collaborator

My solution to this is to leave the Port id blank in the net.conf. That will force it to search for an open port. Unfortunately this solution leaves a bunch of running agents which need to be killed. The issue is that django doesn’t want to die. For the host server I have an explicit QUIT request, but there is no such thing in django, because obviously the client shouldn’t be able to kill the server. Unfortunately it also doesn’t seem to respect sig TERM calls which is really annoying. I’ve been trying to figure out how to do this right for a while.

@eharpste
Copy link
Member

eharpste commented Jun 5, 2020

@cmaclell is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants