Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

makortel · 2024-12-09T15:50:06Z

A pilot job of the Run2024C ReReco was killed because of using too much memory
https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2024C_JetMET1_pilot_241122_102431_7689/50660/DataProcessing/
The job was ran in CMSSW_14_0_19.

This issue is about studying the memory behavior of the job above.

This problem is reported also in https://its.cern.ch/jira/browse/CMSCOMPPR-56784 and https://gitlab.cern.ch/groups/cms-ppd/-/epics/12.

makortel · 2024-12-09T15:50:10Z

assign core

cmsbuild · 2024-12-09T15:50:24Z

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-12-09T15:50:25Z

cms-bot internal usage

cmsbuild · 2024-12-09T15:50:26Z

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-12-09T15:50:34Z

Plot from the SimpleMemoryCheck printouts

Dr15Jones · 2024-12-09T15:54:00Z

Here is measurements using the PeriodicAllocMonitor as well a sampling the RSS during the same job

makortel · 2024-12-09T16:23:06Z

Some overall properties of the job

2638 modules are constructed, of which 469 are destructed soon after because nothing in the job consumes their data products, leaving 2169 modules in total
12 instances of PoolOutputModule + 1 instance of DQMRootOutputModule
- AOD, MiniAOD, NanoEDMAOD and 9 skims

davidlange6 · 2024-12-09T16:26:09Z

perhaps its wiser to use 8 cores like the tier0 does? That would gain quite a lot of memory headroom per core

makortel · 2024-12-09T16:27:14Z

perhaps its wiser to use 8 cores like the tier0 does?

The failed job was configured to use 8 threads.

davidlange6 · 2024-12-09T16:31:47Z

whoops - i was fooled by some of the plots - sorry for the noise.

makortel · 2024-12-09T20:55:25Z

First observation with @Dr15Jones was a rediscovery of #46526 (comment) that was fixed in #46543 (14_2_X) / #46567 (14_1_X). The fix is being backported to 14_0_X as part of #46903.

srimanob · 2024-12-10T13:06:21Z

Please ignore the x-axis at the moment, I just plot from order that RSS info is printed out. The blue plot (data.txt) was done in total 15000 events, while the green (data_fix) is still on going (~6700th event in the plot).

I just made a quick plot from printout of SimpleMemoryCheck, this PR (#46903) should help for memory reduction, but not the leak.

Dr15Jones · 2024-12-10T14:05:09Z

So I applied the backport to my build area and re-ran the PeriodicAllocMonitor job. Unfortunately after nearly 6000 events processed overnight the machine I was running on was being rebooted for scheduled maintenance. So the comparison isn't complete (note I ran the job single threaded in order to avoid exceeding the VSize limit on the machine).

The backport does show much less outstanding allocation size than the original code. There might still be some upward trend even after the backport but it is hard to tell with the sample I was able to make.

srimanob · 2024-12-10T14:16:12Z

So I applied the backport to my build area and re-ran the PeriodicAllocMonitor job. Unfortunately after nearly 6000 events processed overnight the machine I was running on was being rebooted for scheduled maintenance. So the comparison isn't complete (note I ran the job single threaded in order to avoid exceeding the VSize limit on the machine).

The backport does show much less outstanding allocation size than the original code. There might still be some upward trend even after the backport but it is hard to tell with the sample I was able to make.

Hi @Dr15Jones

Thx.

Which backport are you testing? The PR I made or other PRs which @makortel proposed to backport also.

Dr15Jones · 2024-12-10T14:36:54Z

Which backport are you testing? The PR I made or other PRs which @makortel proposed to backport also.

The one you (@srimanob) made.

srimanob · 2024-12-10T22:23:47Z

Thanks @makortel for the script to make a plot.

I try to run with backport PR, it shows that PR helps to reduce the memory.

Before fix:

With PR to fix the memory:

Dr15Jones · 2024-12-11T22:13:46Z

I was able to uncover a ~ 1k/event memory leak here

cmssw/EventFilter/L1TRawToDigi/plugins/implementations_stage2/RegionalMuonGMTUnpacker.cc

Lines 47 to 48 in 9a4b9e4

    
           regionalMuonShowerCollection = 
        
               new RegionalMuonShowerBxCollection();  // To avoid warning re uninitialised collection

This was found using the prototype ModuleEventAllocMonitor

Dr15Jones · 2024-12-11T22:53:39Z

#46918 fixes the problem in master.

Dr15Jones · 2024-12-12T13:33:40Z

@srimanob would you like #46918 back ported to CMSSW_14_0 ?

Dr15Jones · 2024-12-14T14:11:28Z

So after applying the backport #46903 and memory leak fix #46918 (the latter having a much smaller effect) I see that the allocations (using the AllocMonitor system to record new/delete calls) shows much more stable behavior

and comparing the final results for RSS and allocations gives

Here I'm must processing the first file in the job and I'm reading that file locally.

srimanob · 2024-12-14T17:07:02Z

Hi @Dr15Jones
Thanks very much. As we still need to converge on the release, I would propose to backport #46918

srimanob · 2024-12-14T17:09:03Z

Here is another backport I made, #46942. It includes

[14.1.X] use ReadPrescalesFromFile=False in the GenericTriggerEventFlag of a bunch of DQM modules #46591 (use ReadPrescalesFromFile=False in the GenericTriggerEventFlag of a bunch of DQM modules)
Rework Muon candidate selection in few tracker ALCARECO #46574 (Rework Muon candidate selection in few tracker ALCARECO)
Add GEN-SIM information to resonant di-muon TkAl ALCARECO producers #45357 (Add GEN-SIM information to resonant di-muon TkAl ALCARECO producers)

cmsbuild added core-pending pending-signatures labels Dec 9, 2024

srimanob mentioned this issue Dec 9, 2024

[14_0_X] Backports of 46543 and 46451 to fix memory leak #46903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

makortel commented Dec 9, 2024 •

edited

Loading

makortel commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

makortel commented Dec 9, 2024

Dr15Jones commented Dec 9, 2024

makortel commented Dec 9, 2024

davidlange6 commented Dec 9, 2024

makortel commented Dec 9, 2024

davidlange6 commented Dec 9, 2024

makortel commented Dec 9, 2024

srimanob commented Dec 10, 2024 •

edited

Loading

Dr15Jones commented Dec 10, 2024

srimanob commented Dec 10, 2024

Dr15Jones commented Dec 10, 2024

srimanob commented Dec 10, 2024 •

edited

Loading

Dr15Jones commented Dec 11, 2024

Dr15Jones commented Dec 11, 2024

Dr15Jones commented Dec 12, 2024

Dr15Jones commented Dec 14, 2024

srimanob commented Dec 14, 2024

srimanob commented Dec 14, 2024

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

Comments

makortel commented Dec 9, 2024 • edited Loading

makortel commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

cmsbuild commented Dec 9, 2024

makortel commented Dec 9, 2024

Dr15Jones commented Dec 9, 2024

makortel commented Dec 9, 2024

davidlange6 commented Dec 9, 2024

makortel commented Dec 9, 2024

davidlange6 commented Dec 9, 2024

makortel commented Dec 9, 2024

srimanob commented Dec 10, 2024 • edited Loading

Dr15Jones commented Dec 10, 2024

srimanob commented Dec 10, 2024

Dr15Jones commented Dec 10, 2024

srimanob commented Dec 10, 2024 • edited Loading

Dr15Jones commented Dec 11, 2024

Dr15Jones commented Dec 11, 2024

Dr15Jones commented Dec 12, 2024

Dr15Jones commented Dec 14, 2024

srimanob commented Dec 14, 2024

srimanob commented Dec 14, 2024

makortel commented Dec 9, 2024 •

edited

Loading

srimanob commented Dec 10, 2024 •

edited

Loading

srimanob commented Dec 10, 2024 •

edited

Loading