-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to restart train.py after crash due to thread not being closed properly #1
Comments
Currently processes are killed using the atexit module, which runs registered functions when the python program is terminated. However, it has the following edge case:
|
An alternative suggested on stack overflow (https://stackoverflow.com/questions/930519/how-to-run-one-last-function-before-getting-killed-in-python) is to use the signal package.
|
So I tried replacing atexit with signal as specified above and it did not seem to fix the problem. I'm not really sure what is causing the processes to stay up and the port to be blocked. |
My solution to this is to leave the Port id blank in the net.conf. That will force it to search for an open port. Unfortunately this solution leaves a bunch of running agents which need to be killed. The issue is that django doesn’t want to die. For the host server I have an explicit QUIT request, but there is no such thing in django, because obviously the client shouldn’t be able to kill the server. Unfortunately it also doesn’t seem to respect sig TERM calls which is really annoying. I’ve been trying to figure out how to do this right for a while. |
@cmaclell is this still an issue? |
(migrated this issue over from the AL repo as it should be here instead)
Whenever there is some kind of error that causes AL to crash, then I am unable to restart it. I have to manually kill all the python processes running on my machine. Even then, I still get the error shown in the attached screenshot for 3-5 minutes afterwards.
Is this because we're doing some kind of special threading? Is this only an issue for mac?
Flagging this here as a bug.
The main issue here appears to be that when AL crashes due to an exception being raised, whatever threads get spawned are not properly closed before exiting.
As a short term fix, it seems like it is possible to manually kill all related Python processes AND close the browser window that automatically opened for AL. Then you wait maybe 30 sec before running train.py again. Kind of annoying, but the fix is workable while a fix to properly close the python processes is implemented.
The text was updated successfully, but these errors were encountered: