-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: DataLoader worker is killed by signal: Floating point exception. #42
Comments
Ooof; thank you for catching and reporting this. We have never seen this. A few questions to see if anything is different from your setup then ours. Then, a reproducibility step:
|
Once I know if it's the mini or full-dataset and how many files you are using, I'll run the dataloader over the relevant files and see if we can find the file where this error occurs. |
Sounds good. I am using the full dataset with num_files set to -1 (entire dataset). I'll let you when I know the file name. |
Thanks for that info! @xiaomengy I'm going to write a quick script tomorrow to search through the dataset and build samples from the dataloader and return any files if they throw an error. Would you be able to run it on the cluster and send me any file-names it flags? |
One last question, does it ever train to completion or is this blocking you from completing any training run? Just trying to get a sense of how rare it is. |
It does train to completion most often. It fails 20%ish of the time |
Great, that's useful information. |
It's weird. It's not caused by a specific file. Sometimes it iterates through all files with no issue, sometimes it crashes :( |
Well, it's interesting, it's caused in the call to distribution so I'm wondering if there's actually just a model creating a NaN in the step between passing the state through the head and before passing the output of that to the MultiVariateNormal distribution rather than a file error. It seems to be complaining that a value and its transpose are not close? Since the model training is running in serial you could throw a breakpoint into a try, except block and see what is being passed when that method errors? I'll try to help more but I have yet to reproduce the issue on my local machine (admittedly, training is slow). Will be faster once I get access to a cluster again. |
Ah! One more thing that @nathanlct pointed out, are you using Discrete actions or Continuous actions? We've only extensively tested the discrete setting, perhaps the precision / covariance matrix is acting up |
I don't think it's related to the call to distribution. It happens for both action and position action spaces. When I just iterate through the dataset in a simple script I am sometimes (but not always ?) getting a floating point exception. I have only screenshots of the traceback (sorry). |
Oh that's super useful that you can reproduce it without the training! So it's in the worker or possibly in Nocturne itself... |
This is the backtrace with gdb when it fails:
Does it tell you anything about the root cause? |
I have enabled debug option in setup.py. Now I am getting the following errors:
I am not a C++ wizard. Is it possible that those assertion errors lead to the floating point exception? |
I think that's probably it; great job and thank you!! @xiaomengy (our C++ wizard) do you see how this error could occur? We could definitely use your insight here |
Here is a backtrace for the line segment error:
And for the polygon error:
|
Hey @BenQLange, just to give you an update we're slightly backlogged but Xiaomeng will take a look at this on Tuesday. Figured it was better to have a time than persistent uncertainty |
So floating point exceptions are not deterministic, but assertion errors are. I have identified invalid files in the training set:
Hopefully, that's the reason behind floating point exception errors. I'll let you know after I run some more experiments. UPDATE: There is more failing files. I didn't iterate over time :( |
Hi @BenQLange. Sorry for being late because of some other deadlines. I will take a detailed look later today and hopefully resolve it ASAP. |
Small update, here are the configs I used to find the failing scenes listed above. Depending on some configs I get more or less assertion errors. In particular, I noticed it when changing the view angle. Hope that helps.
|
Thanks for finding those! We are still looking into it but in the meantime would including a try, except block in your code temporarily resolve this issue so that you aren't blocked? We should have a resolution soon. |
I don't think we can write a try, except block for floating point exceptions or assertion errors. I tried and it was still killing the worker and stopping the script. Instead, I have iterated through the dataset with the above configs and created a dictionary of failing files (bash script with a loop until it finished iterating through a dataset). For now, I just skip those files during training. |
Modified dataset resolves the assertion errors but I am still experiencing floating point exceptions from time to time :( |
Hmm, we are still looking into it. I just got a new laptop with enough space for the whole dataset so hopefully I can reconstruct your errors and help. |
Are the errors on the files you listed deterministic? I've constructed the subset of files that you have and looped the dataloader over them but am not seeing an error yet. Reproduction script for reference
|
I see. Yes, the assertion errors are deterministic but they only show up when nocturne is compiled with debug flag on. Floating point exceptions are not deterministic and I don't have a clear idea where they are coming from. I'll run your script later on my machine and let you know the outcome. EDIT: Got delayed. I'll run it today. |
Oh! Okay, let me throw on the debug flag and try again. Thanks for the suggestion. |
Hi @BenQLange. Just let you know a progress. It seems there exists one vehicle/object that has a negative length in tfrecord-00008-of-01000_364.json, which is at least the reason of assert failure. Now we are investigating why there is such values and will try to have some solution to deal with such cases. We found an objects with shape of "width": 4.4137163162231445, "length": -1.295910358428955 in tfrecord-00008-of-01000_364.json |
We're following up with Waymo here waymo-research/waymo-open-dataset#542 and will hopefully find some resolution (though the floating point error is probably from a different source). |
Thanks! |
Operating system
Ubuntu 18.04
Bug description
When running the imitation learning baseline, I am sometimes getting a floating point exception. Unfortunately, It's not deterministic and I cannot always reproduce. It just happens sometimes. Has anyone experienced this bug before?
Steps to reproduce
python examples/imitation_learning/train.py
Relevant log output
The text was updated successfully, but these errors were encountered: