Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cosmics evio files crashes recon with FCAL_hits err (also hd_root) (130115 & 130116) #861

Open
nsjarvis opened this issue Dec 17, 2024 · 12 comments
Assignees

Comments

@nsjarvis
Copy link
Contributor

I copied files from 2 recent cosmics runs into /volatile/halld/home/njarvis
FCAL is excluded from the readout. See rcdb.

hd_root and hd_dump crash with a message about FCALHits. See below for the complaints from hd_dump.

`===========================================================
There was a crash.
This is the entire stack trace of all threads:

Thread 7 (Thread 0x7fcd3f7fe640 (LWP 2304731) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c88fa0 in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libc.so.6
#2 0x00007fcd560d56b0 in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#3 0x00000000012aac6f in async_filebuf::readloop() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 6 (Thread 0x7fcd3ffff640 (LWP 2304729) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c89572 in pthread_cond_clockwait
GLIBC_2.30 () from /lib64/libc.so.6
#2 0x00000000012ae03b in DEVIOWorkerThread::PublishEvents() ()
#3 0x00000000012c4e47 in DEVIOWorkerThread::Run() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 5 (Thread 0x7fcd511fd640 (LWP 2304728) "hd_dump"):
#0 0x00007fcd55c8679a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007fcd55c89572 in pthread_cond_clockwait
GLIBC_2.30 () from /lib64/libc.so.6
#2 0x00000000012ae03b in DEVIOWorkerThread::PublishEvents() ()
#3 0x00000000012c4e47 in DEVIOWorkerThread::Run() ()
#4 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 4 (Thread 0x7fcd519fe640 (LWP 2304727) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x0000000001296665 in JEventSource_EVIOpp::Dispatcher() ()
#3 0x00007fcd560dbad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#4 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#5 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 3 (Thread 0x7fcd44d8e640 (LWP 2304726) "hd_dump"):
#0 0x00007fcd55cd8a3f in wait4 () from /lib64/libc.so.6
#1 0x00007fcd55c4b243 in do_system () from /lib64/libc.so.6
#2 0x00007fcd589e272c in TUnixSystem::StackTrace() () from /group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/root/root-6.24.04/lib/libCore.so
#3 0x00007fcd589dfd65 in TUnixSystem::DispatchSignals(ESignals) () from /group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/root/root-6.24.04/lib/libCore.so
#4
#5 0x0000000000eed2dd in DFCALHit_factory::FillCalibTable(std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >&, std::vector<double, std::allocator > const&, DFCALGeometry const&) ()
#6 0x0000000000eee969 in DFCALHit_factory::brun(jana::JEventLoop*, int) ()
#7 0x00000000008bcb3d in jana::JFactory::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&) ()
#8 0x00000000008cd07d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#9 0x00000000008cd441 in jana::JFactory* jana::JEventLoop::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, bool) ()
#10 0x000000000144068f in DEventHitStatistics_factory::evnt(jana::JEventLoop*, unsigned long) ()
#11 0x00000000011b7410 in jana::JFactory::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&) ()
#12 0x00000000011b7d1d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#13 0x00000000011b8008 in jana::JFactory* jana::JEventLoop::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, bool) ()
#14 0x00000000011b85a5 in jana::JFactory::GetNrows(bool, bool) ()
#15 0x000000000073a784 in MyProcessor::evnt(jana::JEventLoop*, unsigned long) ()
#16 0x00000000014eff92 in jana::JEventLoop::OneEvent (this=0x7fcd40000b60) at src/JANA/JEventLoop.cc:693
#17 0x00000000014f05b4 in jana::JEventLoop::Loop (this=this
entry=0x7fcd40000b60) at src/JANA/JEventLoop.cc:496
#18 0x00000000014c54e5 in LaunchThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:1382
#19 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#20 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 2 (Thread 0x7fcd521ff640 (LWP 2304721) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x00007fcd55d04b29 in usleep () from /lib64/libc.so.6
#3 0x00000000014d809a in jana::JApplication::EventBufferThread (this=0x2ad8b70) at src/JANA/JApplication.cc:726
#4 0x00000000014d820a in LaunchEventBufferThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:666
#5 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#6 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

Thread 1 (Thread 0x7fcd52b5a4c0 (LWP 2304720) "hd_dump"):
#0 0x00007fcd55cd4075 in clock_nanosleep
GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fcd55cd8c87 in nanosleep () from /lib64/libc.so.6
#2 0x00000000014d2a0f in jana::JApplication::Run (this=0x2ad8b70, proc=, Nthreads=) at src/JANA/JApplication.cc:1613
#3 0x000000000072bc11 in main ()

The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#5 0x0000000000eed2dd in DFCALHit_factory::FillCalibTable(std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >&, std::vector<double, std::allocator > const&, DFCALGeometry const&) ()
#6 0x0000000000eee969 in DFCALHit_factory::brun(jana::JEventLoop*, int) ()
#7 0x00000000008bcb3d in jana::JFactory::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&) ()
#8 0x00000000008cd07d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#9 0x00000000008cd441 in jana::JFactory* jana::JEventLoop::Get(std::vector<DFCALHit const*, std::allocator<DFCALHit const*> >&, char const*, bool) ()
#10 0x000000000144068f in DEventHitStatistics_factory::evnt(jana::JEventLoop*, unsigned long) ()
#11 0x00000000011b7410 in jana::JFactory::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&) ()
#12 0x00000000011b7d1d in jana::JFactory* jana::JEventLoop::GetFromFactory(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) ()
#13 0x00000000011b8008 in jana::JFactory* jana::JEventLoop::Get(std::vector<DEventHitStatistics const*, std::allocator<DEventHitStatistics const*> >&, char const*, bool) ()
#14 0x00000000011b85a5 in jana::JFactory::GetNrows(bool, bool) ()
#15 0x000000000073a784 in MyProcessor::evnt(jana::JEventLoop*, unsigned long) ()
#16 0x00000000014eff92 in jana::JEventLoop::OneEvent (this=0x7fcd40000b60) at src/JANA/JEventLoop.cc:693
#17 0x00000000014f05b4 in jana::JEventLoop::Loop (this=this
entry=0x7fcd40000b60) at src/JANA/JEventLoop.cc:496
#18 0x00000000014c54e5 in LaunchThread (arg=0x2ad8b70) at src/JANA/JApplication.cc:1382
#19 0x00007fcd55c89c02 in start_thread () from /lib64/libc.so.6
#20 0x00007fcd55d0ec40 in clone3 () from /lib64/libc.so.6

===========================================================

JANA ERROR>> didn't sleep full 0.5 seconds!
`

@nsjarvis
Copy link
Contributor Author

Big clue: hd_dump hd_rawdata_130116_000.evio did NOT crash on gluon47.

(but it took a long time). The crashes were on ifarm and on a cmu node.

@sdobbs
Copy link
Contributor

sdobbs commented Dec 17, 2024

If it ran correctly but slow elsewhere, then probably the crash you saw is just the thread timing out and being killed. I was able to run on the file without problem.

On the ifarm I usually set JANA_CALIB_URL to mysql://[email protected]/ccdb since hallddb is just super slow.

@nsjarvis
Copy link
Contributor Author

It totally crashed. Please could you try with -PPLUGINS=CDC_amp and then add -PTRKFIT:COSMICS=1 ?

I wondered if the tracking was causing issues so then I tried CDC_online instead and got a different error

'rtvs' condition is not set for run 130116

@sdobbs
Copy link
Contributor

sdobbs commented Dec 17, 2024

This ran fine for me on the ifarm. Could you please try changing your JANA_CALIB_URL to what I suggested above?

Alternately, you can try setting THREAD_TIMEOUT and THREAD_TIMEOUT_FIRST_EVENT to some large value (>= 3600)

@nsjarvis
Copy link
Contributor Author

Changing the ccdb link doesn't make any difference for me. Which version set are you using? I'm using version_jlab.xml which is 5.21.0, from October.

With the new ccdb url, CDC_amp's error message is unchanged, but for CDC_online, the error message is about trigger simulation. I expect it was linked to the plugin requesting a physics trigger.

`[[email protected]: /volatile/halld/home/njarvis ]> setenv JANA_CALIB_URL mysql://[email protected]/ccdb
[[email protected]: /volatile/halld/home/njarvis ]> hd_root hd_rawdata_130116_000.evio -PPLUGINS=CDC_online
JANA >>OUTPUT_FILENAME: hd_root.root
JANA >>Initializing plugin "/group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/halld_recon/halld_recon-4.51.0/Linux_Alma9-x86_64-gcc11.4.1/plugins/CDC_online.so" ...
Opened ROOT file "hd_root.root" ...
JANA >>Opening source "hd_rawdata_130116_000.evio" of type: EVIOpp - Reads EVIO formatted data from file or ET system
loading VERSION 3
JANA >>Control event: Prestart - Mon Dec 16 21:09:23 2024
JANA >>Launching threads .
JANA >>Control event: Go - Mon Dec 16 21:09:51 2024

JANA >>Created JCalibration object of type: JCalibrationCCDB
JANA >>Generated via: JCalibration using CCDB for MySQL and SQLite databases
JANA >>Run:130116
JANA >>URL: mysql://[email protected]/ccdb
JANA >>context: default
JANA >>comment: Default constants for analyzing data
JANA >>Creating DTranslationTable for run 130116
JANA >>Reading translation table from calib DB: Translation/DAQ2detector ...
JANA >>41791 channels defined in translation table(avg.: 0.0Hz)
JANA >>
JANA >> --- Configuration Parameters --
JANA >> PLUGINS = CDC_online
JANA >> THREAD_TIMEOUT = 30 seconds
JANA >> -------------------------------
'rtvs' condition is not set for run 130116

------------ Trigger Settings ---------------

----------- FCAL -----------

FCAL_CELL_THR = 65
FCAL_NSA = 10
FCAL_NSB = 3
FCAL_WINDOW = 10

----------- BCAL -----------

BCAL_CELL_THR = 20
BCAL_NSA = 19
BCAL_NSB = 3
BCAL_WINDOW = 20

Do not use RCDB for the trigger simulation. Default (spring 2017) trigger settings are used
JANA >>Creating DGeometry:
JANA >> Run requested:130116 found:130116
JANA >> Run validity range: 130116-130116
JANA >> URL="ccdb:///GEOMETRY/main_HDDS.xml" context="default"
JANA >> Type="JGeometryXML"
JANA >>Found 25 material maps in calib. DB
JANA >>Read in 25 material maps for run 130116 containing 76153 grid points total
JANA >>In DL1MCTrigger_factory_DATA, loading constants... 3.0Hz)
JANA >> Factory Call Stack
JANA >>============================
JANA >> DL1MCTrigger:DATA (brun) -- line:281 /u/group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/jana/jana_0.8.2^ccdb2005/Linux_Alma9-x86_64-gcc11.4.1/include/JANA/JFactory.h
JANA >> DL1MCTrigger:DATA
JANA >> DTrigger: (evnt) -- line:299 /u/group/halld/Software/builds/Linux_Alma9-x86_64-gcc11.4.1/jana/jana_0.8.2^ccdb2005/Linux_Alma9-x86_64-gcc11.4.1/include/JANA/JFactory.h
JANA >> DTrigger
JANA >> JEventLoop:OneEvent (evnt) -- line:695 src/JANA/JEventLoop.cc
JANA >>----------------------------
src/JANA/JEventLoop.cc:698 EXCEPTION : std::exception
src/JANA/JApplication.cc:1386 EXCEPTION caught for thread 139646549874240 : std::exception
JANA ERROR>>
JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to
JANA ERROR>> have the program relaunch a replacement thread when a stalled
JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration
JANA ERROR>> parameter to a value greater than zero. E.g.:
JANA ERROR>>
JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10
JANA ERROR>>
JANA ERROR>> The program will quit now.
JANA >>
JANA >>Telling all threads to quit ...
JANA >>Merging thread 0 (0x7f01feffd640) ...

EVIO Processing rate = 0.885559 Hz
NDISPATCHER_STALLED = 3127 (92.3%)
NPARSER_STALLED = 6221 (91.8%)
NEVENTBUFF_STALLED = 42 ( 1.2%)

EVIO Statistics for hd_rawdata_130116_000.evio :

Nblocks: 2
Nevents: 8
Nerrors: 0

Nbad_blocks: 0
Nbad_events: 0

JANA >>Merging event reader thread ...
JANA >> 4 events processed (14 events read) Average rate: 2.0Hz

Closed ROOT file
Exit code: 70
JANA >>Closing shared object handle 0 ...
`

@sdobbs
Copy link
Contributor

sdobbs commented Dec 17, 2024

Ah, ok, I'm running the current halld_recon master, which doesn't call that code. Not sure why the files aren't being saved in RCDB, but you should be able to skip this by specifying -PTRIG:BYPASS=1

I can reproduce the FCAL calib call crash with that version of the code. I guess I'll have to build my own tag of this, since the standard builds don't have the debug symbols to say which line is causing the crash.

@aaust
Copy link
Contributor

aaust commented Dec 17, 2024

@sdobbs The DAQ still writes to RCDB v1.

I also do not see the crash with the current master. I will build version_5.22.0.xml today, which should work.

@nsjarvis
Copy link
Contributor Author

This is running fine with the current master: hd_root hd_rawdata_130116_000.evio -PPLUGINS=CDC_amp -PTRKFIT:COSMICS=1
apart from "Do not use RCDB for the trigger simulation. Default (spring 2017) trigger settings are used" and "'rtvs' condition is not set for run 130116".

I get the same complaints w -PTRIG:BYPASS=1 but the code keeps running, so that's good.

@sdobbs
Copy link
Contributor

sdobbs commented Dec 17, 2024

Oh right, yeah, I guess in that case one needs to change the RCDB environment variable to point to the v1 DB.

OK, that sounds good about the new version - I don't know why the code should crash there, so it's probably some memory error upstream that was (hopefully) fixed.

@nsjarvis
Copy link
Contributor Author

I hope so too. Something that sneaks out quietly could sneak back in later.

Thanks for your help.

@nsjarvis
Copy link
Contributor Author

This is fixed in the new version set version_5.22.0.xml.

@nsjarvis
Copy link
Contributor Author

(I was using 5.21.0 with evio files from 2017 yesterday without problems, presumably the issue with it now is specifically for newer data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants