-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New cosmics evio files crashes recon with FCAL_hits err (also hd_root) (130115 & 130116) #861
Comments
Big clue: hd_dump hd_rawdata_130116_000.evio did NOT crash on gluon47. (but it took a long time). The crashes were on ifarm and on a cmu node. |
If it ran correctly but slow elsewhere, then probably the crash you saw is just the thread timing out and being killed. I was able to run on the file without problem. On the ifarm I usually set JANA_CALIB_URL to mysql://[email protected]/ccdb since hallddb is just super slow. |
It totally crashed. Please could you try with -PPLUGINS=CDC_amp and then add -PTRKFIT:COSMICS=1 ? I wondered if the tracking was causing issues so then I tried CDC_online instead and got a different error 'rtvs' condition is not set for run 130116 |
This ran fine for me on the ifarm. Could you please try changing your JANA_CALIB_URL to what I suggested above? Alternately, you can try setting THREAD_TIMEOUT and THREAD_TIMEOUT_FIRST_EVENT to some large value (>= 3600) |
Changing the ccdb link doesn't make any difference for me. Which version set are you using? I'm using version_jlab.xml which is 5.21.0, from October. With the new ccdb url, CDC_amp's error message is unchanged, but for CDC_online, the error message is about trigger simulation. I expect it was linked to the plugin requesting a physics trigger. `[[email protected]: /volatile/halld/home/njarvis ]> setenv JANA_CALIB_URL mysql://[email protected]/ccdb JANA >>Created JCalibration object of type: JCalibrationCCDB ------------ Trigger Settings --------------- ----------- FCAL ----------- FCAL_CELL_THR = 65 ----------- BCAL ----------- BCAL_CELL_THR = 20 Do not use RCDB for the trigger simulation. Default (spring 2017) trigger settings are used EVIO Processing rate = 0.885559 Hz EVIO Statistics for hd_rawdata_130116_000.evio :
Nbad_blocks: 0 JANA >>Merging event reader thread ... Closed ROOT file |
Ah, ok, I'm running the current halld_recon master, which doesn't call that code. Not sure why the files aren't being saved in RCDB, but you should be able to skip this by specifying I can reproduce the FCAL calib call crash with that version of the code. I guess I'll have to build my own tag of this, since the standard builds don't have the debug symbols to say which line is causing the crash. |
@sdobbs The DAQ still writes to RCDB v1. I also do not see the crash with the current master. I will build version_5.22.0.xml today, which should work. |
This is running fine with the current master: hd_root hd_rawdata_130116_000.evio -PPLUGINS=CDC_amp -PTRKFIT:COSMICS=1 I get the same complaints w -PTRIG:BYPASS=1 but the code keeps running, so that's good. |
Oh right, yeah, I guess in that case one needs to change the RCDB environment variable to point to the v1 DB. OK, that sounds good about the new version - I don't know why the code should crash there, so it's probably some memory error upstream that was (hopefully) fixed. |
I hope so too. Something that sneaks out quietly could sneak back in later. Thanks for your help. |
This is fixed in the new version set version_5.22.0.xml. |
(I was using 5.21.0 with evio files from 2017 yesterday without problems, presumably the issue with it now is specifically for newer data) |
I copied files from 2 recent cosmics runs into /volatile/halld/home/njarvis
FCAL is excluded from the readout. See rcdb.
hd_root and hd_dump crash with a message about FCALHits. See below for the complaints from hd_dump.
`===========================================================
There was a crash.
This is the entire stack trace of all threads:
Thread 7 (Thread 0x7fcd3f7fe640 (LWP 2304731) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c88fa0 in pthread_cond_wait
GLIBC_2.3.2 () from /lib64/libc.so.6
#2 0x00007fcd560d56b0 in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#3 0x00000000012aac6f in async_filebuf::readloop() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 6 (Thread 0x7fcd3ffff640 (LWP 2304729) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c89572 in pthread_cond_clockwait
GLIBC_2.30 () from /lib64/libc.so.6
#2 0x00000000012ae03b in DEVIOWorkerThread::PublishEvents() ()
#3 0x00000000012c4e47 in DEVIOWorkerThread::Run() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 5 (Thread 0x7fcd511fd640 (LWP 2304728) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c89572 in pthread_cond_clockwait
GLIBC_2.30 () from /lib64/libc.so.6
#2 0x00000000012ae03b in DEVIOWorkerThread::PublishEvents() ()
#3 0x00000000012c4e47 in DEVIOWorkerThread::Run() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 4 (Thread 0x7fcd519fe640 (LWP 2304727) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x0000000001296665 in JEventSource_EVIOpp::Dispatcher() ()
#3 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#4 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#5 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 3 (Thread 0x7fcd44d8e640 (LWP 2304726) "hd_dump"):
#0 0x00007fcd55cd8a3f in wait4 () from /lib64/libc.so.6
#1 0x00007fcd55c4b243 in do_system () from /lib64/libc.so.6
#2 0x00007fcd589e272c in TUnixSystem::StackTrace() () from /group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/root/root-6.24.04/lib/libCore.so
#3 0x00007fcd589dfd65 in TUnixSystem::DispatchSignals(ESignals) () from /group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/root/root-6.24.04/lib/libCore.so
#4
#5 0x0000000000eed2dd in DFCALHit_factory::FillCalibTable(std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >&, std::vector<double, std::allocator > const&, DFCALGeometry const&) ()
#6 0x0000000000eee969 in DFCALHit_factory::brun(jana::JEventLoop*, int) ()
#7 0x00000000008bcb3d in jana::JFactory::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&) ()
#8 0x00000000008cd07d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#9 0x00000000008cd441 in jana::JFactory* jana::JEventLoop::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, bool) ()
#10 0x000000000144068f in DEventHitStatistics_factory::evnt(jana::JEventLoop*, unsigned long) ()
#11 0x00000000011b7410 in jana::JFactory::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&) ()
#12 0x00000000011b7d1d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#13 0x00000000011b8008 in jana::JFactory* jana::JEventLoop::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, bool) ()
#14 0x00000000011b85a5 in jana::JFactory::GetNrows(bool, bool) ()
#15 0x000000000073a784 in MyProcessor::evnt(jana::JEventLoop*, unsigned long) ()
#16 0x00000000014eff92 in jana::JEventLoop::OneEvent (this=0x7fcd40000b60) at src/JANA/JEventLoop.cc:693
#17 0x00000000014f05b4 in jana::JEventLoop::Loop (this=this
entry=0x7fcd40000b60) at src/JANA/JEventLoop.cc:496
#18 0x00000000014c54e5 in LaunchThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:1382
#19 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#20 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 2 (Thread 0x7fcd521ff640 (LWP 2304721) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x00007fcd55d04b29 in usleep () from /lib64/libc.so.6
#3 0x00000000014d809a in jana::JApplication::EventBufferThread (this=0x2ad8b70) at src/JANA/JApplication.cc:726
#4 0x00000000014d820a in LaunchEventBufferThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:666
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
Thread 1 (Thread 0x7fcd52b5a4c0 (LWP 2304720) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x00000000014d2a0f in jana::JApplication::Run (this=0x2ad8b70, proc=, Nthreads=) at src/JANA/JApplication.cc:1613
#3 0x000000000072bc11 in main ()
The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
#5 0x0000000000eed2dd in DFCALHit_factory::FillCalibTable(std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >&, std::vector<double, std::allocator > const&, DFCALGeometry const&) ()
#6 0x0000000000eee969 in DFCALHit_factory::brun(jana::JEventLoop*, int) ()
#7 0x00000000008bcb3d in jana::JFactory::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&) ()
#8 0x00000000008cd07d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#9 0x00000000008cd441 in jana::JFactory* jana::JEventLoop::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, bool) ()
#10 0x000000000144068f in DEventHitStatistics_factory::evnt(jana::JEventLoop*, unsigned long) ()
#11 0x00000000011b7410 in jana::JFactory::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&) ()
#12 0x00000000011b7d1d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#13 0x00000000011b8008 in jana::JFactory* jana::JEventLoop::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, bool) ()
#14 0x00000000011b85a5 in jana::JFactory::GetNrows(bool, bool) ()
#15 0x000000000073a784 in MyProcessor::evnt(jana::JEventLoop*, unsigned long) ()
#16 0x00000000014eff92 in jana::JEventLoop::OneEvent (this=0x7fcd40000b60) at src/JANA/JEventLoop.cc:693
#17 0x00000000014f05b4 in jana::JEventLoop::Loop (this=this
entry=0x7fcd40000b60) at src/JANA/JEventLoop.cc:496
#18 0x00000000014c54e5 in LaunchThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:1382
#19 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#20 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6
===========================================================
JANA ERROR>> didn't sleep full 0.5 seconds!
`
The text was updated successfully, but these errors were encountered: