Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird segfaults in PodioOutput #156

Closed
Zehvogel opened this issue Nov 1, 2023 · 9 comments · Fixed by key4hep/k4MarlinWrapper#157
Closed

Weird segfaults in PodioOutput #156

Zehvogel opened this issue Nov 1, 2023 · 9 comments · Fixed by key4hep/k4MarlinWrapper#157

Comments

@Zehvogel
Copy link
Contributor

Zehvogel commented Nov 1, 2023

I have a bunch of files that I'm reconstructing and for 3 out of 200 something Gaudi crashes with this segfault, always in the same event per file:

#6  0x00007fb30767e831 in podio::DatamodelDefinitionCollector::registerDatamodelDefinition(podio::CollectionBase const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2023-10-21/x86_64-almalinux9-gcc11.3.1-opt/podio/e44af47e44b595439eb4a62ac9a6893bc46f9b9e_develop-reqsua/lib64/libpodio.so
#7  0x00007fb30dc4d660 in podio::ROOTFrameWriter::writeFrame(podio::Frame const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2023-10-21/x86_64-almalinux9-gcc11.3.1-opt/podio/e44af47e44b595439eb4a62ac9a6893bc46f9b9e_develop-reqsua/lib64/libpodioRootIO.so
#8  0x00007fb2f96a2a46 in PodioOutput::execute() () from /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2023-11-01/x86_64-almalinux9-gcc11.3.1-opt/k4fwcore/95931b5a9bfc48ef8d00163ace88e04cf4e005e0_develop-7bi526/lib/libk4FWCorePlugins.so
#9  0x00007fb30e61fb18 in Gaudi::Algorithm::sysExecute(EventContext const&) () from /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2023-10-21/x86_64-almalinux9-gcc11.3.1-opt/gaudi/36.14-5mwxb4/lib/libGaudiKernel.so

I have found no obvious difference with these events compared to others in the same file and I am a bit lost on how to debug further.

Should be reproducible by

. /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh
git clone https://github.com/key4hep/CLDConfig.git
cd CLDConfig/CLDConfig
wget https://cernbox.cern.ch/remote.php/dav/public-files/oSYq2xtW01XEdTx/SIM_e-_10deg_1GeV_1000evt.slcio
k4run CLDReconstruction.py --inputFiles SIM_e-_10deg_1GeV_1000evt.slcio -n 1000 --outputBasename crash_test

The crash should happen at event 638. All three files are here: https://cernbox.cern.ch/s/oSYq2xtW01XEdTx for pi-_89deg_1GeV it crashes at event 41 and for pi-_40deg_200GeV at event 914.

I am a bit suspicious of this piece of podio that gets called in the process and might return 0? but my podio knowledge is too limited to know if that can happen..

@tmadlener
Copy link
Contributor

From the stacktrace alone this looks like a collection is missing in the events where this is failing. Can you run with LCIO output and see if that is the case? (Not sure if we get all of them as REC and DST outputs might not write all the collections that are available).

If a collection is missing in the EventStore (i.e. the Frame backing it), then the Frame will try to get the collection from FrameData and would indeed come across doGet (via the public get function) and return nullptr (since the EmptyFrameData does not provide any data to construct one). Taking that into account, this does look like some processor not producing an output collection and that then trips up the output because it will not be converted.

If you are able to find out which collection is missing in some (rare) cases the proper fix would be to make the processor always produce a collection even if it is empty. The quick fix would be to use the PatchCollections processor to patch in empty collections into the LCEvent before running the corresponding converter.

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 1, 2023

This is for pi-_89deg_1GeV, i.e. 41 would be the interesting event. I (or better, my editor) count 71 collections for each of the events...

///////////////////////////////////
EVENT: 40
RUN: 0
DETECTOR: FCCee
COLLECTIONS: (see below)
///////////////////////////////////

---------------------------------------------------------------------------
COLLECTION NAME               COLLECTION TYPE          NUMBER OF ELEMENTS  
===========================================================================
BuildUpVertices               Vertex                           0
BuildUpVertices_RP            ReconstructedParticle            0
BuildUpVertices_V0            Vertex                           0
BuildUpVertices_V0_RP         ReconstructedParticle            0
CalohitMCTruthLink            LCRelation                      24
ClusterMCTruthLink            LCRelation                       1
DebugHits                     TrackerHitPlane                  0
ECALBarrel                    CalorimeterHit                  23
ECALEndcap                    CalorimeterHit                   0
ECalBarrelCollection          SimCalorimeterHit               43
ECalEndcapCollection          SimCalorimeterHit                7
EfficientMCParticles          MCParticle                       1
HCALBarrel                    CalorimeterHit                   1
HCALEndcap                    CalorimeterHit                   0
HCALOther                     CalorimeterHit                   0
HCalBarrelCollection          SimCalorimeterHit               11
HCalEndcapCollection          SimCalorimeterHit               13
HCalRingCollection            SimCalorimeterHit                0
ITrackerEndcapHits            TrackerHitPlane                  0
ITrackerHits                  TrackerHitPlane                  3
InefficientMCParticles        MCParticle                       0
InnerTrackerBarrelCollection  SimTrackerHit                    4
InnerTrackerBarrelHitsRelationsLCRelation                       3
InnerTrackerEndcapCollection  SimTrackerHit                    0
InnerTrackerEndcapHitsRelationsLCRelation                       0
LooseSelectedPandoraPFOs      ReconstructedParticle            1
LumiCalCollection             SimCalorimeterHit                0
LumiCalHits                   CalorimeterHit                   0
MCParticle                    MCParticle                       8
MCParticlesSkimmed            MCParticle                       1
MCPhysicsParticles            MCParticle                       8
MCTruthClusterLink            LCRelation                       1
MCTruthRecoLink               LCRelation                       1
MCTruthSiTracksLink           LCRelation                       1
MUON                          CalorimeterHit                   0
OTrackerEndcapHits            TrackerHitPlane                  1
OTrackerHits                  TrackerHitPlane                 19
OuterTrackerBarrelCollection  SimTrackerHit                   24
OuterTrackerBarrelHitsRelationsLCRelation                      19
OuterTrackerEndcapCollection  SimTrackerHit                    2
OuterTrackerEndcapHitsRelationsLCRelation                       1
PFOsFromJets                  ReconstructedParticle            1
PandoraClusters               Cluster                          1
PandoraPFOs                   ReconstructedParticle            1
PandoraStartVertices          Vertex                           1
PrimaryVertices               Vertex                           1
PrimaryVertices_RP            ReconstructedParticle            1
RecoMCTruthLink               LCRelation                       1
RefinedVertexJets             ReconstructedParticle            1
RefinedVertexJets_rel         LCRelation                       0
RefinedVertexJets_vtx         Vertex                           0
RefinedVertexJets_vtx_RP      ReconstructedParticle            0
RefinedVertices               Vertex                           0
RefinedVertices_RP            ReconstructedParticle            0
RelationCaloHit               LCRelation                      24
RelationMuonHit               LCRelation                       0
SelectedPandoraPFOs           ReconstructedParticle            1
SiTracks                      Track                            1
SiTracksCT                    Track                            1
SiTracksMCTruthLink           LCRelation                       1
SiTracks_Refitted             Track                            1
TightSelectedPandoraPFOs      ReconstructedParticle            1
VXDEndcapTrackerHitRelations  LCRelation                       0
VXDEndcapTrackerHits          TrackerHitPlane                  0
VXDTrackerHitRelations        LCRelation                       6
VXDTrackerHits                TrackerHitPlane                  6
VertexBarrelCollection        SimTrackerHit                    6
VertexEndcapCollection        SimTrackerHit                    0
VertexJets                    ReconstructedParticle            1
YokeBarrelCollection          SimCalorimeterHit                0
YokeEndcapCollection          SimCalorimeterHit                0
---------------------------------------------------------------------------



///////////////////////////////////
EVENT: 41
RUN: 0
DETECTOR: FCCee
COLLECTIONS: (see below)
///////////////////////////////////

---------------------------------------------------------------------------
COLLECTION NAME               COLLECTION TYPE          NUMBER OF ELEMENTS  
===========================================================================
BuildUpVertices               Vertex                           0
BuildUpVertices_RP            ReconstructedParticle            0
BuildUpVertices_V0            Vertex                           0
BuildUpVertices_V0_RP         ReconstructedParticle            0
CalohitMCTruthLink            LCRelation                       0
ClusterMCTruthLink            LCRelation                       0
DebugHits                     TrackerHitPlane                  0
ECALBarrel                    CalorimeterHit                   0
ECALEndcap                    CalorimeterHit                   0
ECalBarrelCollection          SimCalorimeterHit                0
ECalEndcapCollection          SimCalorimeterHit                0
EfficientMCParticles          MCParticle                       1
HCALBarrel                    CalorimeterHit                   0
HCALEndcap                    CalorimeterHit                   0
HCALOther                     CalorimeterHit                   0
HCalBarrelCollection          SimCalorimeterHit                0
HCalEndcapCollection          SimCalorimeterHit                0
HCalRingCollection            SimCalorimeterHit                0
ITrackerEndcapHits            TrackerHitPlane                  2
ITrackerHits                  TrackerHitPlane                 95
InefficientMCParticles        MCParticle                       0
InnerTrackerBarrelCollection  SimTrackerHit                   98
InnerTrackerBarrelHitsRelationsLCRelation                      95
InnerTrackerEndcapCollection  SimTrackerHit                    2
InnerTrackerEndcapHitsRelationsLCRelation                       2
LooseSelectedPandoraPFOs      ReconstructedParticle            1
LumiCalCollection             SimCalorimeterHit                0
LumiCalHits                   CalorimeterHit                   0
MCParticle                    MCParticle                      11
MCParticlesSkimmed            MCParticle                       3
MCPhysicsParticles            MCParticle                      11
MCTruthClusterLink            LCRelation                       0
MCTruthRecoLink               LCRelation                       1
MCTruthSiTracksLink           LCRelation                       1
MUON                          CalorimeterHit                   0
OTrackerEndcapHits            TrackerHitPlane                  0
OTrackerHits                  TrackerHitPlane                 82
OuterTrackerBarrelCollection  SimTrackerHit                   85
OuterTrackerBarrelHitsRelationsLCRelation                      82
OuterTrackerEndcapCollection  SimTrackerHit                    0
OuterTrackerEndcapHitsRelationsLCRelation                       0
PFOsFromJets                  ReconstructedParticle            1
PandoraClusters               Cluster                          0
PandoraPFOs                   ReconstructedParticle            1
PandoraStartVertices          Vertex                           1
PrimaryVertices               Vertex                           1
PrimaryVertices_RP            ReconstructedParticle            1
RecoMCTruthLink               LCRelation                       1
RefinedVertexJets             ReconstructedParticle            1
RefinedVertexJets_rel         LCRelation                       0
RefinedVertexJets_vtx         Vertex                           0
RefinedVertexJets_vtx_RP      ReconstructedParticle            0
RefinedVertices               Vertex                           0
RefinedVertices_RP            ReconstructedParticle            0
RelationCaloHit               LCRelation                       0
RelationMuonHit               LCRelation                       0
SelectedPandoraPFOs           ReconstructedParticle            1
SiTracks                      Track                            1
SiTracksCT                    Track                            1
SiTracksMCTruthLink           LCRelation                       1
SiTracks_Refitted             Track                            1
TightSelectedPandoraPFOs      ReconstructedParticle            0
VXDEndcapTrackerHitRelations  LCRelation                       0
VXDEndcapTrackerHits          TrackerHitPlane                  0
VXDTrackerHitRelations        LCRelation                       6
VXDTrackerHits                TrackerHitPlane                  6
VertexBarrelCollection        SimTrackerHit                    6
VertexEndcapCollection        SimTrackerHit                    0
VertexJets                    ReconstructedParticle            1
YokeBarrelCollection          SimCalorimeterHit                0
YokeEndcapCollection          SimCalorimeterHit                0
---------------------------------------------------------------------------



///////////////////////////////////
EVENT: 42
RUN: 0
DETECTOR: FCCee
COLLECTIONS: (see below)
///////////////////////////////////

---------------------------------------------------------------------------
COLLECTION NAME               COLLECTION TYPE          NUMBER OF ELEMENTS  
===========================================================================
BuildUpVertices               Vertex                           0
BuildUpVertices_RP            ReconstructedParticle            0
BuildUpVertices_V0            Vertex                           0
BuildUpVertices_V0_RP         ReconstructedParticle            0
CalohitMCTruthLink            LCRelation                      47
ClusterMCTruthLink            LCRelation                       1
DebugHits                     TrackerHitPlane                  0
ECALBarrel                    CalorimeterHit                  47
ECALEndcap                    CalorimeterHit                   0
ECalBarrelCollection          SimCalorimeterHit               55
ECalEndcapCollection          SimCalorimeterHit                0
EfficientMCParticles          MCParticle                       1
HCALBarrel                    CalorimeterHit                   0
HCALEndcap                    CalorimeterHit                   0
HCALOther                     CalorimeterHit                   0
HCalBarrelCollection          SimCalorimeterHit               14
HCalEndcapCollection          SimCalorimeterHit                0
HCalRingCollection            SimCalorimeterHit                0
ITrackerEndcapHits            TrackerHitPlane                  0
ITrackerHits                  TrackerHitPlane                  3
InefficientMCParticles        MCParticle                       0
InnerTrackerBarrelCollection  SimTrackerHit                    3
InnerTrackerBarrelHitsRelationsLCRelation                       3
InnerTrackerEndcapCollection  SimTrackerHit                    0
InnerTrackerEndcapHitsRelationsLCRelation                       0
LooseSelectedPandoraPFOs      ReconstructedParticle            1
LumiCalCollection             SimCalorimeterHit                0
LumiCalHits                   CalorimeterHit                   0
MCParticle                    MCParticle                       1
MCParticlesSkimmed            MCParticle                       1
MCPhysicsParticles            MCParticle                       1
MCTruthClusterLink            LCRelation                       1
MCTruthRecoLink               LCRelation                       1
MCTruthSiTracksLink           LCRelation                       1
MUON                          CalorimeterHit                   0
OTrackerEndcapHits            TrackerHitPlane                  0
OTrackerHits                  TrackerHitPlane                  3
OuterTrackerBarrelCollection  SimTrackerHit                    3
OuterTrackerBarrelHitsRelationsLCRelation                       3
OuterTrackerEndcapCollection  SimTrackerHit                    0
OuterTrackerEndcapHitsRelationsLCRelation                       0
PFOsFromJets                  ReconstructedParticle            1
PandoraClusters               Cluster                          1
PandoraPFOs                   ReconstructedParticle            1
PandoraStartVertices          Vertex                           1
PrimaryVertices               Vertex                           1
PrimaryVertices_RP            ReconstructedParticle            1
RecoMCTruthLink               LCRelation                       1
RefinedVertexJets             ReconstructedParticle            1
RefinedVertexJets_rel         LCRelation                       0
RefinedVertexJets_vtx         Vertex                           0
RefinedVertexJets_vtx_RP      ReconstructedParticle            0
RefinedVertices               Vertex                           0
RefinedVertices_RP            ReconstructedParticle            0
RelationCaloHit               LCRelation                      47
RelationMuonHit               LCRelation                       0
SelectedPandoraPFOs           ReconstructedParticle            1
SiTracks                      Track                            1
SiTracksCT                    Track                            1
SiTracksMCTruthLink           LCRelation                       1
SiTracks_Refitted             Track                            1
TightSelectedPandoraPFOs      ReconstructedParticle            1
VXDEndcapTrackerHitRelations  LCRelation                       0
VXDEndcapTrackerHits          TrackerHitPlane                  0
VXDTrackerHitRelations        LCRelation                       6
VXDTrackerHits                TrackerHitPlane                  6
VertexBarrelCollection        SimTrackerHit                    6
VertexEndcapCollection        SimTrackerHit                    0
VertexJets                    ReconstructedParticle            1
YokeBarrelCollection          SimCalorimeterHit                0
YokeEndcapCollection          SimCalorimeterHit                0
---------------------------------------------------------------------------

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 1, 2023

REC should write everything or not? Given that nothing is dropped?

    Output_REC = MarlinProcessorWrapper("Output_REC")
    Output_REC.OutputLevel = WARNING
    Output_REC.ProcessorType = "LCIOOutputProcessor"
    Output_REC.Parameters = {
                             "DropCollectionNames": [],
                             "DropCollectionTypes": [],
                             "FullSubsetCollections": ["EfficientMCParticles", "InefficientMCParticles"],
                             "KeepCollectionNames": [],
                             "LCIOOutputFile": [f"{output_basename}_REC.slcio"],
                             "LCIOWriteMode": ["WRITE_NEW"]
                             }

@tmadlener
Copy link
Contributor

Yes, I think it should write everything with this configuration. The anajob output does not really give us a lot here indeed. Can you get the collection name for which the thing crashes from the debugger? I think we should have debug symbols, but the actual strings might be optimized out. Maybe by running thing inside gdb until they crash and then going to the writeFrame stack frame gives you a name(?). Otherwise we will have to get a less optimized build for podio and repeat.

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 1, 2023

to quote gdb: Shared library is missing debugging information.

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 2, 2023

Thankfully registerDatamodelDefinition only takes two arguments (and one of them was a nullpointer in this case) so it was somewhat "easy" to look at the dissasembly and print the name of the missing collection

(gdb) print *(char**)$rdx
$2 = 0x2a9e46c0 "ToolSvc.lcio2EDM4hep_CaloHitContributions"

Which is missing when there are no SimCalorimeterHits. Fixed by key4hep/k4MarlinWrapper#157.

I was additionally confused because only one of the three broken events has no SimCalorimeterHits in the sim file but apparently for the other two they are deleted during reconstruction, possibly here. I am not sure how all calorimeter hits of a 200 GeV pion can be out of time but that is not relevant for this issue.

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 3, 2023

Urgh I was quite sure that I had tested key4hep/k4MarlinWrapper#157 for all 3 files but pi-_40deg_200GeV is still failing with the same error... I will report back with more details soon

@Zehvogel
Copy link
Contributor Author

Zehvogel commented Nov 3, 2023

Ok, its only the same stack trace but caused by a missing CalohitMCTruthLink collection ultimately caused by ConformalTracking throwing a SkipEventException in the prior event (913).

Maybe PodioOutput should handle this a bit more gracefully...

@tmadlener
Copy link
Contributor

Yeah, PodioOutput (or the underlying podio::ROOTFrameWriter) should definitely report before the crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants