-
Notifications
You must be signed in to change notification settings - Fork 42
PicoProducer
This package provides tools to run post-processors on nanoAOD files. There are two modes:
- Skimming: Skim nanoAOD by removing unneeded branches, bad data events (using data certification JSONs), add things like JetMET corrections. Output still has a nanoAOD format. This step is optional, but allows you to run your analysis faster as it is saved locally.
-
Analysis: Analyze nanoAOD events by pre-selecting events and objects and constructing new variables.
The main analysis code is found in
python/analysis/
and instructions are in this wiki page. The output is a custom tree format we will refer to as pico.
A central script called pico.py
allows you to run both modes of nanoAOD processing,
either locally or on a batch system.
You can link several skimming or analysis modules to so-called channels.
You need to have CMSSW and NanoAODTools
installed,
see the Installation wiki page. Test the installation with
pico.py --help
If CMSSW is compiled correctly with scram b
, then the pico.py
script should have been
automatically copied from scripts/
to $CMSSW_BASE/bin/$SCRAM_ARCH
,
and should be available as a command via $PATH
.
If you need to access DAS for getting file lists of nanoAOD samples, make sure you have a GRID certificate installed, and a VOMS proxy setup
voms-proxy-init -voms cms -valid 200:0
or use the script
source utils/setupVOMS.sh
Note: If you are on lxplus, you may need to globally define the location for your temporary VOMS proxy
by executing the following, and adding it to the shell startup script (e.g. .bashrc
for BASH):
export X509_USER_PROXY=~/.x509up_u`id -u`
Or whatever the equivalent is in other shells (setenv
, ...).
The user configuration is saved in config/config.json
. Check the contents with cat
, or use
pico.py list
You can manually edit the file, or set some variable with
pico.py set <variables> <value>
For example:
pico.py set batch HTCondor
pico.py set jobdir 'output/$ERA/$CHANNEL/$SAMPLE'
The configurable variables include:
-
batch
: Batch system to use (e.g.HTCondor
). -
jobdir
: Directory to output job configuration and log files (e.g.output/$ERA/$CHANNEL/$SAMPLE
). -
outdir
: Directory to copy the output pico files from analysis jobs. -
nanodir
: Directory to store the output nanoAOD files from skimming jobs. -
picodir
: Directory to store thehadd
'ed pico file from analysis job output. -
nfilesperjob
: Default number of files per job. This can be overridden per sample (see below). -
maxevtsperjob
: Default limit on events processed per job. This is overridesnfilesperjob
and can be set per sample (see below). -
queue
: batch system queue ("job flavor" for HTCondor, "partition" for SLURM). For default setting look at the corresponding.sub
or.sh
batch configuration files inPicoProducer/python/batch
. -
filelistdir
: Directory to save list of nanoAOD files to run on (e.g.samples/files/$ERA/$SAMPLE.txt
). -
maxopenfiles
: Maximum number of open files during hadd step. -
ncores
: Number of cores for counting events & validating of files in parallel.
Defaults are given in config/config.json
.
Note the directories can contain variables starting with $
like
$ERA
, $CHANNEL
, $TAG
, $SAMPLE
, $GROUP
and $DAS
to create a custom hierarchy and format.
The output directories nanodir
and picodir
can be special storage systems (e.g. on EOS, T2, T3, ...).
If they need special commands for accessing and writing, please see the instructions below.
Besides these variables, there are also the channels
and eras
dictionaries to link a channel short name to a skimming or analysis code, or an era (year) to a list of samples. This will be explained below.
The "skimming step" is optional. The input and output of skimming are both the nanoAOD format. You can use it for several things:
- Remove unneeded branches via keep 'n drop files.
- Remove bad data events using data certification JSONs.
- Add new branches, e.g. corrections and systematic variations like JetMET. See these modules, or these analysis examples.
- Pre-selecting events with a simple selection string, e.g.
cut="HLT_IsoMu27 && Muon_pt>20 && Tau_pt>20"
. - Saving (reduced) nanoAOD files on a local storage system for faster file access, as GRID files connections can be slow.
Skimming of nanoAOD files is done by post-processor scripts saved in python/processors/
.
An example is given by skimjob.py
.
You can link your own skimming script to a custom channel short name
pico.py channel skim skimjob.py
The skimming "channel name" can be whatever string you want,
but it should be unique, contain skim
to differentiate from analysis channels,
and you should avoid characters that are not safe for filenames, including :
and /
.
Extra options to the skimming script can be passed as well:
pico.py channel skimjec 'skimjob.py --jec-sys'
pico.py channel skimmutau 'skimjob.py --jec-sys --preselect "HLT_IsoMu27 && Muon_pt>20 && Tau_pt>20"'
This framework allows to implement many analysis modules called "channels"
(e.g. different final states like mutau or etau).
All analysis code should be saved in python/analysis/
any subdirectory in that path.
A simple example of an analysis is given in ModuleMuTauSimple.py
,
a more full example in ModuleMuTau.py
.
Detailed instructions can be found in the PicoProducer analysis wiki page.
The pico.py
script runs all analysis modules with the post-processor picojob.py
.
You can link any analysis module to a custom channel short name (e.g. mutau
):
pico.py channel mutau ModuleMuTauSimple
The channel short name can be whatever string you like (e.g. mt
, mymutau
, MuTau
, ...).
However, you should avoid characters that are not safe for filenames, including :
and /
,
and it should not contain skim
(reserved for skimming).
To include extra options to the module of a channel, do e.g.
pico.py channel mutau_TESUp ModuleMuTau 'tessys=Up'
pico.py channel mutau_TES1p03 ModuleMuTau 'tes=1.03'
The nanoAOD samples you like to process should be specified in a list inside a python file, which are stored in samples/
.
Each era (year) should be linked to a particular sample list by doing for example
pico.py era 2016 sample_2016.py
Such a python file must include a simple python list called samples
, which contains Sample
objects. For example,
samples = [
Sample('DY','DYJetsToLL_M-50',
"/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16NanoAODv6-PUMoriond17_Nano25Oct2019_102X_mcRun2_asymptotic_v7_ext1-v1/NANOAODSIM",
dtype='mc',store=None,url="root://cms-xrd-global.cern.ch/",opts='zpt=True',
)
]
To distinguish real from simulated data samples, you either set the keyword dtype
,
or use the MC
and Data
subclasses instead.
The Sample
class takes at least three arguments:
- The first string is a user-chosen name to group samples together (e.g.
'DY'
,'TT'
,'VV'
,'Data'
). - The second is a custom, unique short name for the sample (e.g.
'DYJetsToLL_M-50'
,'SingleMuon_Run2016C'
). - The third and optionally additional arguments are the full DAS paths of the sample. Multiple DAS paths for the same sample can be used to combine extensions.
Other optional keyword arguments are
-
dtype
: Data type like'mc'
,'data'
or'embed'
. As a short cut you can use the subclassesMC
andData
. -
store
: Path where all nanoAOD files are stored (instead of being given by the DAS tool). Note that this path is used for both skimming and analysis jobs. This is useful if you have produced or skimmed your NanoAOD samples, and they are not available via DAS. The path may contain variables like$DAS
for the full DAS dataset path,$GROUP
for the group,$SAMPLE
for the sample short name. -
url
: Redirector URL forXRootD
protocol, e.g.root://cms-xrd-global.cern.ch
for DAS. -
files
: Either a list of nanoAOD files, OR a string to a text file with a list of nanoAOD files. This can speed things up if DAS is slow or unreliable, or you want to avoid retrieving the files from a local storage element on the fly each time. Note that this list is used for both skimming and analysis jobs. -
nevents
: The total number of nanoAOD events, that you can optionally compare to the number of processed events (with the--das
flag). By default, it will be obtained from DAS, but it can be set by the user to speed things up, or in case the sample is not available on DAS. -
nfilesperjob
: Number filed per job. If the samples is split in many small files, you can choose a largernfilesperjob
to reduce the number of short jobs. This overrides the defaultnfilesperjob
in the configuration. -
maxevts
: Maximum number of events per job. This will split large files in several jobs, to reduce the number of large jobs, and allow for short resubmission in case of failure. Small files will still be combined on one job as long as the sum of their events is below this maximum. This overrides the defaultmaxevtsperjob
in the configuration and anynfilesperjob
settings. A good choice is between50000
and500000
, depending on the queuing of the batch system, and how many large samples you want to run. Too many jobs may cause a large number of log files taking up space. -
blacklist
: A list of files that you do not want to run on. This is useful if some files are corrupted. -
opts
: Extra key-worded options (key=value
) to be passed to the analysis modules. Can be a comma-separated string ('opt1=val1,opt2=val2'
) or a list of strings (['opt1=val1','opt2=val2']
).
To get a file list for a particular sample in the sample list, you can use the get files
subcommand.
If you include --write
, the list will be written to a text file as defined by filelistdir
in the configuration:
pico.py get files -y 2016 -s DYJets --write --url
Pass the full path of this text file to the sample via files
.
It may contain variables, e.g. samples/files/$ERA/$SAMPLE.txt
.
If you like to split jobs based on events (maxevtsperjob
) instead of files, do
pico.py write -y 2016 -s DYJets --nevts
which will save the number of events per file as well. In this way the submission script does not have to open each file and get the number of nanoAOD events on the fly, which can take much long. Sometimes some GRID files as not available, and several retries are needed.
Note that a priori skimming and analysis channels use the same sample lists (and therefore the same nanoAOD files)
for the same era as specified in the configuration.
While skimming is an optional step, typically you first want to skim nanoAOD from existing files on the GRID (given by DAS)
and store them locally for faster and more reliable access.
To run on skimmed nanoAOD files, you need to change store
for each skimmed sample to point to the storage location.
If you have a text file with the file list, you also need to remember to remove or update this list.
A local run can be done as
pico.py run -y <era> -c <channel>
For example, to run the mutau
channel on a 2016
sample, do
pico.py run -y 2016 -c mutau
Use -m 1000
to limit the number of processed events.
By default, the output will be saved in a new directory called ouput/
.
Because mutau
is an analysis module, the output will be a root file that contains a tree called 'tree'
with a custom format defined in ModuleMuTauSimple.py
.
If you run a skimming channel, which must have skim
in the channel name, the output will be a nanoAOD file.
Automatically, the first file of the first sample in the era's list will be run, but you can
specify a sample that is available in the sample list linked to the era,
by passing the -s
flag a pattern:
pico.py run -y 2016 -c mutau -s SingleMuon
pico.py run -y 2016 -c mutau -s 'DYJets*M-50'
pico.py run -y 2016 -c mutau -s 'DY[23]Jets*M-50'
Glob patterns like *
, ?
or […]
wildcards are allowed.
Some modules allow extra options via keyword arguments. You can specify these using the --opts
flag:
pico.py run -y 2016 -c mutau -s DYJets*M-50 --opts tes=1.1
For all options, see
pico.py run --help
Once the module run locally, everything is configured and your batch system is installed, you can submit with
pico.py submit -y 2016 -c mutau
This will create the the necessary output directories for job configuration (jobdir
)
and output (nanodir
for skimming, outdir
for analysis).
A JSON file is created to keep track of the job input and output.
Again, you can specify a sample by passing a glob patterns to -s
, or exclude patterns with -x
.
To give the output files a specific tag, use -t
.
If there are many small files, they can be combined with --filesperjob
,
or if there are a lot of large files, --maxevts
can be used to limit the number of events processed per job.
These parameters can also be set globally in the configuration, or for each sample individually in the sample list.
You can force the use of GRID files from DAS with --dasfiles
as opposed to nanoAOD files stored on a local storage system.
This is useful for the skimming step:
pico.py submit -y 2016 -c mutau --dasfiles
For all options with submission, do
pico.py submit --help
Check the job status with
pico.py status -y 2016 -c mutau
This will check which jobs are still running, and if the output files exist and are not corrupted.
You can skip the validation step and only look for missing files with --skipevts
to speed up the status check.
For skimming jobs, the nanoAOD output files should appear in nanodir
, and they are checked for having an Events
tree.
For analysis jobs, the pico output files should appear in outdir
, and they are checked for having a tree called tree
,
and a histogram called cutflow
.
To compare how many events were processed compared to the total available events in DAS (or defined in Sample
), use the --das
flag:
pico.py status -y 2016 -c mutau --das
If your jobs fail (status FAIL
or MISS
), please see
Why do my jobs fail ? in the FAQ below.
If jobs failed, you can resubmit with
pico.py resubmit -y 2016 -c mutau
This will resubmit files that are missing or corrupted, unless they are associated with a pending or running job.
In case the jobs take too long, you can specify a smaller number of files per job with --filesperjob
on the fly,
or use --split
to split the previous number.
Otherwise you can limit the number of events per job with --maxevts
if it was not already set in the first submission.
Use --skipevts
to speed up the resubmission by checking for missing files without opening,
and --checkqueue 1
to only check the batch system for pending or running jobs once.
ROOT files from analysis output can be hadd
'ed into one large pico file:
pico.py hadd -y 2016 -c mutau
The output file will be stored in picodir
.
This will not work for channels with skim
in the name,
as it is preferred to keep skimmed nanoAOD files split for further batch submission.
After a while, many small job configuration and log files can accumulate.
Remove leftover job output files and directories with the clean
subcommand
(or use -r
with hadd
):
pico.py clean -y 2016 -c mutau
If you trust the job output and hadd
step, you can do the clean step during hadd
:
pico.py hadd -y 2016 -c mutau --clean
Systematic variations can be run by passing extra keyword options via -E
, e.g.:
pico.py run -y 2016 -c mutau -s DY TT -E 'tes=1.03' -t _TES1p03
pico.py run -y 2016 -c mutau -s DY TT -E 'tes=0.97' -t _TES0p97
The keyword argument, e.g. tes
for tau energy scale,
must already be defined in the analysis module linked to the channel, see e.g.
here &
here
for ModuleMuTau
.
And extra tag should be added with -t
to avoid overwriting the nominal analysis output.
As a shortcut, you can define a new channel with the same module, but a different setting:
pico.py channel mutau_TES1p03 'ModuleMuTau tes=1.03'
pico.py channel mutau_TES0p97 'ModuleMuTau tes=0.97'
pico.py run -y 2016 -c mutau_TES1p03 -s DY TT
pico.py run -y 2016 -c mutau_TES0p97 -s DY TT
After you defined your systematic channels,
you can edit and use the vary.sh
script to quickly run multiple variations, e.g.
cp utils/vary.sh ./
vary.sh run -c mutau -y UL2017 -T
vary.sh submit -c mutau -y UL2017 -T
etc.
This framework might not work for your computing system... yet. It was created with a modular design in mind, meaning that users can add their own "plug-in" modules to make the scripts work with their own batch system and storage system. If you like to contribute, please make sure the changes run as expected, and then push the changes to a fork to make a pull request.
To plug in your own batch system, make a subclass of BatchSystem
,
overriding the abstract methods (e.g. submit
).
Your subclass has to be saved in separate python module in python/batch/
,
and the module's filename should be the same as the class.
See for example HTCondor.py
.
If you need extra (shell) scripts, leave them in python/batch
as well.
Then you need to implement your submit
command to the main_submit
function in
python/pico/job.py
,
where you define the script and some extra keyword options via jkwargs
, for example:
def main_submit(args):
...
elif batch.system=='SLURM':
script = "python/batch/submit_SLURM.sh %s"%(joblist)
logfile = os.path.join(logdir,"%x.%A.%a") # $JOBNAME.o$JOBID.$TASKID
jkwargs.update({'log': logfile })
...
Test your implementation with this test script:
test/testBatch.py <batchsystem> -v1
Similarly for a storage element, subclass StorageSystem
in python/storage/
.
Have a look at T3_PSI
as an example of a subclass.
Currently, the code automatically assigns a path to a storage system, so you also need to
edit getstorage
in python/storage/utils.py
, e.g.
def getstorage(path,verb=0,ensure=False):
...
elif path.startswith('/pnfs/psi.ch/'):
from TauFW.PicoProducer.storage.T3_PSI import T3_PSI
storage = T3_PSI(path,ensure=ensure,verb=verb)
...
return storage
You can test your implementation with
test/testStorage.py <path> -v3
where you pass an absolute path on your storage system.
If you want, you can also add the path of your storage element to getsedir
in the same file.
This help function automatically sets the default paths for new users, based on the host and user name.
Detailed instructions to create an analysis module are provided in the PicoProducer wiki page.
If you want to share your analysis module (e.g. for TauPOG measurements),
please make a new directory in python/analysis
,
where you save your analysis modules with a Module
subclass. For example, reusing the full examples:
cd python/analysis
mkdir MuTauFakeRate
cp TreeProducerMuTau.py MuTauFakeRate/TreeProducerMuTau.py
cp ModuleMuTau.py MuTauFakeRate/ModuleMuTau.py
Rename the module class to the filename (in this example ModuleMuTau
stays the same) and
edit the the pre-selection and tree format in these modules to your liking.
Test run as
pico.py channel mutau python/analysis/MuTauFakeRate/ModuleMuTau.py
pico.py run -c mutau -y 2018
Here are some frequently asked questions and hints during troubleshooting.
-
Is the skimming step required ?
-
How do I make my own analysis module ?
-
What should be the format of my "pico" analysis ntuples ?
-
How do I plot my analysis output ?
-
Why do I get a
ImportError: No module named TauPOG.TauIDSFs.TauIDSFTool
error message ?
-
Why do I get a
no branch named ...
error message ?
-
Why do I get a
no branch named MET_pt_nom
error message ?
-
Why do my jobs fail ?
-
Why do my jobs take so long ?
No. It is optional, but recommended if you do not have the nanoAOD files stored locally. Skimming is meant to reduce the file size by removing unneeded branches and/or events, plus to store the nanoAOD files locally for faster access.
You can also use the skimming step to add JEC corrections and systematic variations, or other neat stuff.
Examples and instructions are provided in the README
in python/analysis
.
The simplest one is ModuleMuTauSimple.py
.
This creates a new flat tree as output.
Alternatively, you do analysis with nanoAOD as output, following
these examples.
Whatever you like. It can be a flat tree, it can be just histograms; it can even be nanoAOD again. It really depends how you like to do your analysis.
See the instructions in the Plotter
package.
To interface with your analysis tuples, use the Plotter.sample.Sample
class.
For a full example, hone into these instructions.
Please install TauIDSFs
as instructed here.
Because the branch is not available in your nanoAOD input file.
Please check the code of your module in python/analysis
.
Documentation of the available branches can also be found
here,
although this page can be outdated for your version of nanoAOD.
The same information is saved in the ROOT file itself; open the input nanaAOD with ROOT
and look for the available branches with a TBrowser
or the Tree::Print
function, e.g.
Events->Print("Tau_*")
The exact definition of the default branches can be found in
CMSSW/PhysicsTools/NanoAOD
.
By default, the full tau pair analysis modules assume that the JEC variations are available.
If you did not run JEC variations in the skimming step, please disable it via the -E
option,
pico.py run -c mutau -y 2018 -E jec=False
or fix it once and for all via
pico.py channel mutau 'MuTauModule jec=False'
pico.py run -c mutau -y 2018
Alternatively, you could edit the module file locally, setting the
hardcoded default
to False
.
At some point the jet/MET correction tools in nanoAOD-tools
were updated to include T1 smearing,
and now corrected MET branches are called MET_T1_nom
, etc.
If you have nanoAOD files with this new correction method, please use useT1=True
. Specify it during running:
pico.py run -c mutau -y 2018 -E useT1=True
or change it in the channel:
pico.py channel mutau 'MuTauModule useT1=True'
pico.py run -c mutau -y 2018
or set the
hardcoded default
to True
,
or add this option to all the new samples in the sample list:
M('DY','DYJetsToLL_M-50',
"/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer19UL16NanoAODAPVv2-106X_mcRun2_asymptotic_preVFP_v9-v1/NANOAODSIM",
store=storage,url=url,files=filelist,opts=['zpt=True','useT1=True']),
First make sure it runs locally with pico.py run
.
If it runs locally and your jobs still fail,
you can find out the reason by looking into the job log files.
You can find them in jobdir
, which by default is set to output/$ERA/$CHANNEL/$SAMPLE
,
or via
pico.py status -c mutau -y 2018 --log
Make sure that your VOMS proxy is valid,
voms-proxy-init -voms cms -valid 200:0
and that your batch system can access the ROOT input files and the TauFW working directory. Double check the configuration of the output directories with
pico.py list
or during submission, set the verbosity high
pico.py submit -c mutau -y 2018 -v2
If the jobs were terminated because of time limitations,
you can edit the default batch submission files in python/batch/
,
or pass time option to the submission command via --time
, e.g for 10 hours:
pico.py resubmit -c mutau -y 2018 --time 10:00:00
Other options specific to your batch system can be added via -B
.
If you are on CERN's lxplus
, you may need to globally define the location for your temporary VOMS proxy
by executing the following, and adding it to the shell startup script (e.g. .bashrc
for BASH):
export X509_USER_PROXY=~/.x509up_u`id -u`
Or whatever the equivalent is in other shells (setenv
, ...).
If you use HTCondor, double check the actual maximum run time for any job via
(jobid
="clusterId", taskid
="procId")
condor_q -long <jobid>.<taskid>
Make sure that nfilesperjob
in the configuration is small enough,
or split large files by limiting the number of events per job with --maxevts
.
Furthermore, open file connections to GRID can often be slow.
One immediate solution to this is to pass the "prefetch" option, --prefetch
,
which first copies the input file to the local working directory,
and removes it at the end of the job:
pico.py submit -c mutau -y 2018 --prefetch
If you repeatedly run on the same nanoAOD files that are stored on the GRID, consider doing a skimming step to reduce their file size and save them on a local storage system for faster connection.