Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge out lpc #91

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Merge out lpc #91

wants to merge 3 commits into from

Conversation

bpfrancis
Copy link
Contributor

This script is not very kind yet, and checks for very little problems. Simply for all datasets specified, it makes a tarball of all the hist_*.root files in the job directory, ships them off to condor, and merges them a few at a time. Then a second running of this script with -2 or --step2 will merge those mini-merged files into a single file.

It will not check that your input files are good, it will not check that they are all there. It will not yet tell you clearly if it encountered any problems. It won't care if you're close to hitting your quota on /uscms_data/ and this makes big tarball copies of all this histogram files. There's plenty of work remaining to make this a good script, but for now it accomplishes its task.

I've checked the output from this script and confirmed it's identical to the usual mergeOut.py results.

Workflow:

  1. mergeOut.py -l localConfig.py -w jobDirectory --skipMerging -- this still must be done because you need to check the skim outputs and job statuses, and make the resubmission scripts if need be. If this step finds a failed job, proceeding will merge only the partial results and the user needs to remember this.
  2. mergeOutLPC.py -l localConfig.py -w jobDirectory -- wait for all of these condor jobs to complete. This step may take ~2-3 minutes to create the hist_*.root tarballs.
  3. mergeOutLPC.py --step2 -l localConfig.py -w jobDirectory -- this will also take a few minutes.

Result of steps 1-3: a new file finished_<dataset>.root. For now this change of file name from <dataset>.root is intentional in case users have used the regular mergeOut.py and want to check this script for problems. You will also have all the intermediate steps miniMerge_<dataset>_<job index>.root and hist tarballs -- for now these are not deleted automatically while we make this script better at detecting errors.

Future improvements:

  1. Combine steps 1 and 2 so that this script also checks the job outputs before proceeding.
  2. Make step 3 multi-threaded over the different datasets, or also done within condor for each dataset.
  3. When successful, delete miniMerge files and other junk.
  4. Check the modification time of the histogram tarballs to figure out if they need to be remade. Right now it just skips remaking it if the file exists, which could lead to mistakes.
  5. Get rid of the Sumw2 errors everywhere.
  6. ...

@bpfrancis
Copy link
Contributor Author

Important to mention, I was able to merge 3 datasets with this in about 20-25 minutes compared to >24 hours with mergeOut.py.

Accidentally didn’t commit this
@aehart
Copy link
Contributor

aehart commented Jan 15, 2019

A few questions/comments before I look over this more carefully:

  • Just to be clear, are you saying that if there are failed jobs the output will differ from the usual calling of mergeOut.py?
  • Regarding the big tarballs: couldn't you remove the histogram files as you add them to the tarball? As long as you don't remove the tarball, all the files are still available. And needing to look at the histograms in a single job's output after merging is a rare enough use case, I think, to justify leaving them packed in the tarball.
  • It seems like this method should be faster, not just at the LPC, but at other clusters too. Is there anything specific to the LPC about it? If not, I would suggest renaming the script.

@bpfrancis
Copy link
Contributor Author

bpfrancis commented Jan 15, 2019

@aehart:

  • This looks at the condor_*.log files to determine what exists, so it should see everything that could possibly come from the job. If hist_N.root doesn't exist, the user gets a message about it and then it's completely ignored -- the same behavior as mergeOut.py. If hist_N.root exists but has some problem, is a zombie or something, this script makes no checks and would try to merge it. In this case the merging would quit early and the miniMerge file would actually be empty of all histograms:
    https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/UtilAlgos/bin/mergeTFileServiceHistograms.cpp#L121
    The folders would be created because ROOT auto-saves that sort of thing in an open TFile, but there's no call to Write() until the merging is done -- you'd see a miniMerge file with empty folders.
  • That's a good idea, I didn't consider leaving them there as the main storage. One dataset of mine goes from 7.3 GB across hist_*.root to 747 MB compressed. The script could add all hist_*.root to the tarball and remove them, so if a failed job comes in the script would add the new file in as well.
  • I haven't tried anywhere but LPC, but there's nothing really specific. I'll test it out.

@jalimena
Copy link
Contributor

This is all sounding good so far. I have no additional comments at the moment. Happy to take a look at the next commit.

@bpfrancis bpfrancis mentioned this pull request Jan 27, 2019
@bpfrancis
Copy link
Contributor Author

This PR needs to be made on top of #97, and so is a work in progress.

@jalimena
Copy link
Contributor

this PR is still a work in progress? not ready to merge yet?

@bpfrancis
Copy link
Contributor Author

@jalimena this is still in progress, it's been left alone for a while but is still important. I'll be returning to finish this off soon.

@jalimena
Copy link
Contributor

jalimena commented Jul 2, 2020

can we revisit this PR soon? I've occasionally killed LPC nodes by running the normal mergeOut.py interactively. (Can someone remind me why the mergeOut.py option to run on condor doesn't work on the LPC?)

In general, it might be nice to get some of these OSUT3Analysis issues completed before Run 3.

@aehart aehart removed their request for review July 2, 2020 14:11
@bpfrancis
Copy link
Contributor Author

We do need to revisit this, definitely. LPC merging is slow, and gets only worse as you grow larger hist_*.root outputs. I will look at this in the next few days and bring myself back up to speed, then comment again.

@jalimena
Copy link
Contributor

jalimena commented Jul 2, 2020

thanks a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants