Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DataLoader worker is killed by signal: Floating point exception. #42

Open
BenQLange opened this issue Jul 28, 2022 · 30 comments
Assignees
Labels
bug Something isn't working

Comments

@BenQLange
Copy link

BenQLange commented Jul 28, 2022

Operating system

Ubuntu 18.04

Bug description

When running the imitation learning baseline, I am sometimes getting a floating point exception. Unfortunately, It's not deterministic and I cannot always reproduce. It just happens sometimes. Has anyone experienced this bug before?

Steps to reproduce

python examples/imitation_learning/train.py

Relevant log output

Error executing job with overrides: ['device=cuda:1'] Traceback (most recent call last): 
File "scripts/train.py", line 204, in  main dist = model.dist(states) 
File "/home/bernard.lange/imitation-learning-agents-research/./src/algos/imitation_learning/model.py", line 83, in dist return MultivariateNormal( File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/multivariate_normal.py", line 146, in init super(MultivariateNormal, self).init(batch_shape, event_shape, validate_args=validate_args) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/distribution.py", line 53, in init valid = constraint.check(value) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/constraints.py", line 509, in check sym_check = super().check(value) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/constraints.py", line 490, in check return torch.isclose(value, value.mT, atol=1e-6).all(-2).all(-1) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() 
RuntimeError: DataLoader worker (pid 7036) is killed by signal: Floating point exception.

ERROR: Unexpected floating-point exception encountered in worker.
@BenQLange BenQLange added the bug Something isn't working label Jul 28, 2022
@eugenevinitsky
Copy link
Collaborator

eugenevinitsky commented Jul 28, 2022

Ooof; thank you for catching and reporting this. We have never seen this.

A few questions to see if anything is different from your setup then ours.
Question 1: are you training on the mini-dataset or the full dataset?
Question 2: Are you using all the files or a subset of the files i.e. are you modifying the value of num_files in the config?

Then, a reproducibility step:

  1. Is there any chance you can get the dataloader to print the scenario_path when this happens? This is a value defined in the _get_waymo_iterator here (https://github.com/facebookresearch/nocturne/blob/main/examples/imitation_learning/waymo_data_loader.py). Seeing this might help us investigate the right file and find it faster.
  2. Could you print the state and action values on the off-chance you observe whether it's a state or an action?

@eugenevinitsky
Copy link
Collaborator

Once I know if it's the mini or full-dataset and how many files you are using, I'll run the dataloader over the relevant files and see if we can find the file where this error occurs.

@BenQLange
Copy link
Author

Sounds good. I am using the full dataset with num_files set to -1 (entire dataset). I'll let you when I know the file name.

@eugenevinitsky
Copy link
Collaborator

Thanks for that info!

@xiaomengy I'm going to write a quick script tomorrow to search through the dataset and build samples from the dataloader and return any files if they throw an error. Would you be able to run it on the cluster and send me any file-names it flags?

@eugenevinitsky
Copy link
Collaborator

One last question, does it ever train to completion or is this blocking you from completing any training run? Just trying to get a sense of how rare it is.

@BenQLange
Copy link
Author

It does train to completion most often. It fails 20%ish of the time

@eugenevinitsky
Copy link
Collaborator

Great, that's useful information.

@BenQLange
Copy link
Author

It's weird. It's not caused by a specific file. Sometimes it iterates through all files with no issue, sometimes it crashes :(

@eugenevinitsky
Copy link
Collaborator

eugenevinitsky commented Jul 28, 2022

Well, it's interesting, it's caused in the call to distribution so I'm wondering if there's actually just a model creating a NaN in the step between passing the state through the head and before passing the output of that to the MultiVariateNormal distribution rather than a file error. It seems to be complaining that a value and its transpose are not close? Since the model training is running in serial you could throw a breakpoint into a try, except block and see what is being passed when that method errors?

I'll try to help more but I have yet to reproduce the issue on my local machine (admittedly, training is slow). Will be faster once I get access to a cluster again.

@eugenevinitsky
Copy link
Collaborator

Ah! One more thing that @nathanlct pointed out, are you using Discrete actions or Continuous actions? We've only extensively tested the discrete setting, perhaps the precision / covariance matrix is acting up

@BenQLange
Copy link
Author

BenQLange commented Jul 28, 2022

I don't think it's related to the call to distribution. It happens for both action and position action spaces. When I just iterate through the dataset in a simple script I am sometimes (but not always ?) getting a floating point exception. I have only screenshots of the traceback (sorry).
It's really confusing.

Screen Shot 2022-07-28 at 10 30 14 AM

@eugenevinitsky
Copy link
Collaborator

Oh that's super useful that you can reproduce it without the training! So it's in the worker or possibly in Nocturne itself...
I'll try to reproduce it using the smaller dataset but otherwise it'll be a few days until my new laptop arrives and I can do some analysis on the full dataset, sorry!

@BenQLange BenQLange changed the title RuntimeError: DataLoader worker (pid 7036) is killed by signal: Floating point exception. RuntimeError: DataLoader worker is killed by signal: Floating point exception. Jul 28, 2022
@BenQLange
Copy link
Author

BenQLange commented Jul 29, 2022

This is the backtrace with gdb when it fails:

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x00007fff4a29cb85 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007fff4a29cb85 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#1  0x00007fff4a2b08ab in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#2  0x00007fff4a2ab29a in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#3  0x00007fff4a2ad5dd in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#4  0x00007fff4a27dd2a in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#5  0x00007fff4a274a98 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#6  0x00007fff4a269c5d in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#7  0x000055555569000e in cfunction_call_varargs ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#8  0x000055555568513f in _PyObject_MakeTpCall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#9  0x00005555556bacba in _PyObject_Vectorcall (kwnames=0x0, nargsf=3, args=0x7fffffffd2d0, 
    callable=0x7fff4a6e83b0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#10 method_vectorcall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:89
#11 0x000055555568b20d in PyVectorcall_Call (kwargs=0x0, tuple=0x7fff4a50b840, 
    callable=0x7fffa512b7c0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:200
#12 PyObject_Call () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:228
#13 0x000055555562f9cb in slot_tp_init (self=0x7fff43099cf0, args=0x7fff4a50b840, kwds=0x0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:6793
#14 0x000055555568ff27 in type_call ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:994
#15 0x00007fffeec764b9 in pybind11_meta_call ()
   from /home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#16 0x000055555568513f in _PyObject_MakeTpCall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x000055555572f89f in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, 
    args=0x55555856e4d8, callable=0x555558b66080)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, 
    tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#19 _PyEval_EvalFrameDefault ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#20 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856e240)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#21 _PyEval_EvalCodeWithName ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#22 0x0000555555721bc4 in _PyFunction_Vectorcall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#23 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, 
    args=0x7ffff6f6a5b8, callable=0x7ffff6fd0310)
---Type <return> to continue, or q <return> to quit---
   8/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#25 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#26 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#27 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#28 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, 
    args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4327
#29 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:718
#30 0x0000555555796622 in run_eval_code_obj () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1166
#31 0x00005555557a71d2 in run_mod () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1188
#32 0x00005555557aa36b in pyrun_file () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1085
#33 0x00005555557aa54f in pyrun_simple_file (flags=0x7fffffffdb08, closeit=1, filename=0x7ffff6e8b4b0, fp=0x55555596b500)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:439
#34 PyRun_SimpleFileExFlags () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:472
#35 0x00005555557aaa29 in pymain_run_file (cf=0x7fffffffdb08, config=0x5555558f3020)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:391
#36 pymain_run_python (exitcode=0x7fffffffdb00) at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:616
#37 Py_RunMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:695
#38 0x00005555557aac29 in Py_BytesMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:1127
#39 0x00007ffff703fc87 in __libc_start_main (main=0x55555565bea0 <main>, argc=2, argv=0x7fffffffdcf8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdce8) at ../csu/libc-start.c:310
#40 0x000055555574dad7 in _start ()

Does it tell you anything about the root cause?

@BenQLange
Copy link
Author

I have enabled debug option in setup.py. Now I am getting the following errors:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

I am not a C++ wizard. Is it possible that those assertion errors lead to the floating point exception?

@eugenevinitsky
Copy link
Collaborator

eugenevinitsky commented Jul 30, 2022

I think that's probably it; great job and thank you!! @xiaomengy (our C++ wizard) do you see how this error could occur? We could definitely use your insight here

@BenQLange
Copy link
Author

Here is a backtrace for the line segment error:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff704e3fa in __assert_fail_base (fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f", 
    file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h", 
    line=line@entry=41, 
    function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:92
#3  0x00007ffff704e472 in __GI___assert_fail (assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f", 
    file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h", 
    line=line@entry=41, 
    function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:101
#4  0x00007fff4a2b10b4 in nocturne::geometry::LineSegment::Point (t=<optimized out>, this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41
#5  nocturne::(anonymous namespace)::VisibleObjectsImpl (objects=std::vector of length 16, capacity 32 = {...}, o=..., 
    points=std::vector of length 72, capacity 128 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:84
#6  0x00007fff4a2b2507 in nocturne::ViewField::FilterVisibleObjects (this=this@entry=0x7fffffffcce0, 
    objects=std::vector of length 16, capacity 32 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:156
#7  0x00007fff4a286541 in nocturne::Scenario::VisibleObjects (this=this@entry=0x55555d6cadc0, src=..., 
    view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:362
#8  0x00007fff4a28852e in nocturne::Scenario::FlattenedVisibleState (this=0x55555d6cadc0, src=..., 
    view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:508
#9  0x00007fff4a267742 in nocturne::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>::operator() (__closure=<optimized out>, head_angle=0, view_angle=2.09439516, view_dist=80, src=..., scenario=...)
    at /home/bernard.lange/nocturne/nocturne/pybind11/src/scenario.cc:73
#10 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call_impl<pybind11::array_t<float, 16>, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&, 0, 1, 2, 3, 4, pybind11::detail::void_type> (f=..., this=0x7fffffffd030)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1418
#11 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call<pybind11::array_t<float, 16>, pybind11::detail::void_type, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&> (f=..., this=<optimized out>)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1387
#12 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (__closure=0x0, call=...)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:249
#13 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &) ()
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:224
#14 0x00007fff4a243e49 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff43096ec0, 
    kwargs_in=0x7fff4a503280) at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:924
#15 0x000055555569000e in cfunction_call_varargs () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#16 0x000055555568513f in _PyObject_MakeTpCall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x00005555556baca0 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x55555856ca40, 
    callable=0x7fff4a688f40) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 method_vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:60
#19 0x000055555572beb0 in _PyObject_Vectorcall (kwnames=0x7ffff6e66280, nargsf=<optimized out>, args=<optimized out>, 
    callable=0x7fff53c96580) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#20 call_function (kwnames=0x7ffff6e66280, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#21 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3515
#22 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856c7a0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#23 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#24 0x0000555555721bc4 in _PyFunction_Vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#25 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6f6a5b8, 
    callable=0x7ffff6fcf310) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#26 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#27 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#28 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#29 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#30 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, 
---Type <return> to continue, or q <return> to quit---

And for the polygon error:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff704e3fa in __assert_fail_base (
    fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()", 
    file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67, 
    function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
    at assert.c:92
#3  0x00007ffff704e472 in __GI___assert_fail (
    assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()", 
    file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67, 
    function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
    at assert.c:101
#4  0x00007fff4a280052 in nocturne::geometry::ConvexPolygon::ConvexPolygon (vertices=..., this=0x7fffffffc370)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67
#5  nocturne::Object::BoundingPolygon (this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/object.cc:27
#6  0x00007fff4a2a810e in nocturne::ObjectBase::GetAABB (
    this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/object_base.h:66
#7  void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}::operator()(std::shared_ptr<nocturne::Object> const&) const (obj=..., 
    __closure=<synthetic pointer>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:107
#8  nocturne::geometry::BVH::ResetImpl<std::shared_ptr<nocturne::Object>, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}) (this=this@entry=0x55555a066ed8, 
    objects=std::vector of length 72, capacity 128 = {...}, 
    aabb_func=..., ptr_func=...)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:163
#9  0x00007fff4a28efe9 in nocturne::geometry::BVH::Reset<nocturne::Object> (objects=std::vector of length 72, capacity 128 = {...}, 
    this=0x55555a066ed8)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:105
#10 nocturne::Scenario::LoadObjects (this=this@entry=0x55555a066d00, 
    objects_json=...)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:1127
#11 0x00007fff4a2919ad in nocturne::Scenario::LoadScenario (
    this=this@entry=0x55555a066d00, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json")
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:227
#12 0x00007fff4a26a264 in nocturne::Scenario::Scenario (
    this=0x55555a066d00, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_d
---Type <return> to continue, or q <return> to quit---
ataset/val/tfrecord-00506-of-01000_353.json", config=std::unordered_map with 6 elements = {...})
    at /home/bernard.lange/nocturne/nocturne/cpp/include/scenario.h:100
#13 0x00007fff4a2766a0 in std::make_unique<nocturne::Scenario, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&> () at /usr/include/c++/7/bits/unique_ptr.h:821
#14 nocturne::Simulation::Simulation (config=std::unordered_map with 6 elements = {...}, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json", 
    this=0x55555c020f20) at /home/bernard.lange/nocturne/nocturne/cpp/include/simulation.h:32
#15 pybind11::detail::initimpl::construct_or_initialize<nocturne::Simulation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&, 0> ()
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/detail/init.h:73

@eugenevinitsky
Copy link
Collaborator

Hey @BenQLange, just to give you an update we're slightly backlogged but Xiaomeng will take a look at this on Tuesday. Figured it was better to have a time than persistent uncertainty

@BenQLange
Copy link
Author

BenQLange commented Aug 2, 2022

So floating point exceptions are not deterministic, but assertion errors are. I have identified invalid files in the training set:

array(['tfrecord-00008-of-01000_364.json',
       'tfrecord-00104-of-01000_303.json',
       'tfrecord-00128-of-01000_365.json',
       'tfrecord-00131-of-01000_86.json',
       'tfrecord-00214-of-01000_146.json',
       'tfrecord-00402-of-01000_57.json',
       'tfrecord-00506-of-01000_353.json',
       'tfrecord-00689-of-01000_184.json',
       'tfrecord-00811-of-01000_413.json',
       'tfrecord-00074-of-01000_192.json',
       'tfrecord-00090-of-01000_237.json',
       'tfrecord-00151-of-01000_418.json',
       'tfrecord-00179-of-01000_445.json',
       'tfrecord-00203-of-01000_466.json',
       'tfrecord-00206-of-01000_87.json',
       'tfrecord-00241-of-01000_464.json',
       'tfrecord-00247-of-01000_214.json',
       'tfrecord-00279-of-01000_81.json',
       'tfrecord-00298-of-01000_75.json',
       'tfrecord-00325-of-01000_483.json',
       'tfrecord-00343-of-01000_188.json',
       'tfrecord-00376-of-01000_41.json',
       'tfrecord-00396-of-01000_203.json',
       'tfrecord-00411-of-01000_295.json',
       'tfrecord-00431-of-01000_130.json',
       'tfrecord-00472-of-01000_85.json',
       'tfrecord-00483-of-01000_62.json',
       'tfrecord-00487-of-01000_377.json',
       'tfrecord-00532-of-01000_444.json',
       'tfrecord-00534-of-01000_37.json',
       'tfrecord-00564-of-01000_247.json',
       'tfrecord-00567-of-01000_34.json',
       'tfrecord-00570-of-01000_361.json',
       'tfrecord-00580-of-01000_420.json',
       'tfrecord-00616-of-01000_211.json',
       'tfrecord-00639-of-01000_188.json',
       'tfrecord-00653-of-01000_394.json',
       'tfrecord-00711-of-01000_490.json',
       'tfrecord-00735-of-01000_12.json',
       'tfrecord-00738-of-01000_388.json',
       'tfrecord-00754-of-01000_415.json',
       'tfrecord-00802-of-01000_74.json',
       'tfrecord-00805-of-01000_368.json',
       'tfrecord-00810-of-01000_6.json',
       'tfrecord-00829-of-01000_456.json',
       'tfrecord-00846-of-01000_330.json',
       'tfrecord-00863-of-01000_432.json',
       'tfrecord-00868-of-01000_297.json',
       'tfrecord-00869-of-01000_43.json',
       'tfrecord-00924-of-01000_471.json',
       'tfrecord-00937-of-01000_364.json',
       'tfrecord-00962-of-01000_378.json',
       'tfrecord-00984-of-01000_128.json'], dtype='<U32')

Hopefully, that's the reason behind floating point exception errors. I'll let you know after I run some more experiments.

UPDATE: There is more failing files. I didn't iterate over time :(

@xiaomengy
Copy link
Contributor

Hi @BenQLange. Sorry for being late because of some other deadlines. I will take a detailed look later today and hopefully resolve it ASAP.

@BenQLange
Copy link
Author

Small update, here are the configs I used to find the failing scenes listed above. Depending on some configs I get more or less assertion errors. In particular, I noticed it when changing the view angle. Hope that helps.

    # load dataloader config
    dataloader_config = {
        'tmin': 0,
        'tmax': 90,
        'view_dist': 80,
        'view_angle': np.radians(120),
        'dt': 0.1,
        'expert_action_bounds': None,
        'expert_position': True,
        'state_normalization': 100,
        'n_stacked_states': 5,
        'perturbations': False,
    }

    scenario_config = {
        'start_time': 0,
        'allow_non_vehicles': True,
        'spawn_invalid_objects': True,
        'max_visible_road_points': 500,
        'sample_every_n': 1,
        'road_edge_first': False,
    }

    tmin = dataloader_config.get('tmin', 0)
    tmax = dataloader_config.get('tmax', 90)
    view_dist = dataloader_config.get('view_dist', 80)
    view_angle = dataloader_config.get('view_angle', np.radians(120))
    dt = dataloader_config.get('dt', 0.1)
    expert_action_bounds = dataloader_config.get('expert_action_bounds',
                                                 [[-6, 6], [-0.7, 0.7]])
    expert_position = dataloader_config.get('expert_position', True)
    state_normalization = dataloader_config.get('state_normalization', 100)
    n_stacked_states = dataloader_config.get('n_stacked_states', 5)

@eugenevinitsky
Copy link
Collaborator

Thanks for finding those! We are still looking into it but in the meantime would including a try, except block in your code temporarily resolve this issue so that you aren't blocked? We should have a resolution soon.

@BenQLange
Copy link
Author

I don't think we can write a try, except block for floating point exceptions or assertion errors. I tried and it was still killing the worker and stopping the script.

Instead, I have iterated through the dataset with the above configs and created a dictionary of failing files (bash script with a loop until it finished iterating through a dataset). For now, I just skip those files during training.

@BenQLange
Copy link
Author

Modified dataset resolves the assertion errors but I am still experiencing floating point exceptions from time to time :(

@eugenevinitsky
Copy link
Collaborator

Hmm, we are still looking into it. I just got a new laptop with enough space for the whole dataset so hopefully I can reconstruct your errors and help.

@eugenevinitsky
Copy link
Collaborator

eugenevinitsky commented Aug 9, 2022

Are the errors on the files you listed deterministic? I've constructed the subset of files that you have and looped the dataloader over them but am not seeing an error yet.

Reproduction script for reference

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""Imitation learning training script (behavioral cloning)."""
from datetime import datetime
from pathlib import Path
import pickle
import random
import json

import hydra
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm
import wandb

from examples.imitation_learning.model import ImitationAgent
from examples.imitation_learning.waymo_data_loader import WaymoDataset


def set_seed_everywhere(seed):
    """Ensure determinism."""
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)


@hydra.main(config_path="../../cfgs/imitation", config_name="config")
def main(args):
    """Train an IL model."""
    set_seed_everywhere(args.seed)
    expert_bounds = [[-6, 6], [-0.7, 0.7]]
        
    # load dataloader config
    dataloader_config = {
        'tmin': 0,
        'tmax': 90,
        'view_dist': 80,
        'view_angle': np.radians(120),
        'dt': 0.1,
        'expert_action_bounds': expert_bounds,
        'expert_position': False,
        'state_normalization': 100,
        'n_stacked_states': 5,
        'perturbations': False,
    }

    scenario_config = {
        'start_time': 0,
        'allow_non_vehicles': True,
        'spawn_invalid_objects': True,
        'max_visible_road_points': 500,
        'sample_every_n': 1,
        'road_edge_first': False,
    }
    
    dataset = WaymoDataset(
        data_path=args.path,
        file_limit=args.num_files,
        dataloader_config=dataloader_config,
        scenario_config=scenario_config,
    )
    data_loader = iter(
        DataLoader(
            dataset,
            batch_size=args.batch_size,
            num_workers=args.n_cpus,
            pin_memory=True,
        ))

    # create exp dir
    time_str = datetime.now().strftime('%Y_%m_%d_%H_%M_%S')
    exp_dir = Path.cwd() / Path('train_logs') / time_str
    exp_dir.mkdir(parents=True, exist_ok=True)

    # train loop
    for epoch in range(args.epochs):
        print(f'\nepoch {epoch+1}/{args.epochs}')
        n_samples = epoch * args.batch_size * (args.samples_per_epoch //
                                               args.batch_size)

        for i in tqdm(range(args.samples_per_epoch // args.batch_size),
                      unit='batch'):
            # get states and expert actions
            states, expert_actions = next(data_loader)


if __name__ == '__main__':
    main()

@BenQLange
Copy link
Author

BenQLange commented Aug 9, 2022

I see. Yes, the assertion errors are deterministic but they only show up when nocturne is compiled with debug flag on. Floating point exceptions are not deterministic and I don't have a clear idea where they are coming from. I'll run your script later on my machine and let you know the outcome.

EDIT: Got delayed. I'll run it today.

@eugenevinitsky
Copy link
Collaborator

eugenevinitsky commented Aug 9, 2022

Oh! Okay, let me throw on the debug flag and try again. Thanks for the suggestion.

@xiaomengy
Copy link
Contributor

Hi @BenQLange. Just let you know a progress. It seems there exists one vehicle/object that has a negative length in tfrecord-00008-of-01000_364.json, which is at least the reason of assert failure. Now we are investigating why there is such values and will try to have some solution to deal with such cases.

We found an objects with shape of "width": 4.4137163162231445, "length": -1.295910358428955 in tfrecord-00008-of-01000_364.json

@eugenevinitsky
Copy link
Collaborator

We're following up with Waymo here waymo-research/waymo-open-dataset#542 and will hopefully find some resolution (though the floating point error is probably from a different source).

@BenQLange
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants