Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor configuration & singularity for SLC7/CentOS7 compatibility #66

Open
IzaakWN opened this issue Apr 19, 2024 · 2 comments
Open
Labels
bug Something isn't working enhancement New feature or request

Comments

@IzaakWN
Copy link
Collaborator

IzaakWN commented Apr 19, 2024

Issue: Environment not set

Since March, HTCondor jobs on lxplus do not have the CMSSW environment set correctly, nor JOBID or TASKID as defined in submit_HTCondor.sub. This causes the following error and subsequent job failure:

Traceback (most recent call last):
  File "/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8_g-2/src/TauFW/PicoProducer/python/processors/picojob.py", line 8, in <module>
    from PhysicsTools.NanoAODTools.postprocessing.framework.postprocessor import PostProcessor
  File "/usr/lib64/python3.6/site-packages/ROOT/_facade.py", line 150, in _importhook
    return _orig_ihook(name, *args, **kwds)
ModuleNotFoundError: No module named 'PhysicsTools'

Our hacky workaround was to hardcode our individual CMSSW_BASE path in the executable submit_HTCondor.sh script and do cmsenv...

The cause appears to be that newer HTCondor versions have a "new syntax" (documented here), and we have to simply change

getenv                = true
environment           = JOBID=$(ClusterId);TASKID=$(ProcId)

to

getenv                = true
environment           = "JOBID=$(ClusterId) TASKID=$(ProcId)"

I'll make a PR with a patch asap.

Issue: SLC7/CC7/CentOS7 compatibility on lxplus

CERN's lxplus is phasing out CentOS7 by end of June 2024 (see this announcement and this page).

If we want to keep using CMSSW 11 or 12 on a SLC7 architecture, we have to use a singularity on lxplus user nodes and in HTCondor jobs, see this page:

CMSSW_BASE="/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8/src/TauFW/"
cmssw-el7 --env "CMSSW_BASE=$CMSSW_BASE" # setup singularity & pass environment variable
cd $CMSSW_BASE/src
cmsenv

I'll add this in a future PR as well, and update the instructions in the documentation...

@IzaakWN IzaakWN added bug Something isn't working enhancement New feature or request labels Apr 19, 2024
@IzaakWN
Copy link
Collaborator Author

IzaakWN commented Apr 24, 2024

The environment issue in lxplus HTCondor should be solved with PR #67. It should also be possible now to ask jobs to be run in a singularity container.

One issue, however, is that if you work in a singularity, you lose the ability to submit jobs (condor_submit cannot be found anymore). We need to find a workaround for this... :( It would mean people have to exit the singularity to submit jobs.

However, if you have a CMSSW 11 or 12 setup with SCL7/CC7 inside a cmssw-cc7 singularity on a lxplus EL9 node, C++ libraries like ROOT will stop working, and so we run into a new compatibility issue... Currently,

  • pico.py submit needs both ROOT and condor_submit, and
  • pico.py status needs ROOT and condor_q...
    This means that when using a singularity, we need to prepare jobs inside the singularity, and then exit it to submit it (e.g. as a simple shell script with all the condor_submit commands).

@IzaakWN
Copy link
Collaborator Author

IzaakWN commented May 24, 2024

It now seems possible to submit HTCondor jobs inside (SCL7/CC7) singularities, with the following instructions:
https://gitlab.cern.ch/cms-cat/cmssw-lxplus/-/tree/master

#!/bin/bash
export APPTAINER_BINDPATH=/afs,/cvmfs,/cvmfs/grid.cern.ch/etc/grid-security:/etc/grid-security,/cvmfs/grid.cern.ch/etc/grid-security/vomses:/etc/vomses,/eos,/etc/pki/ca-trust,/etc/tnsnames.ora,/run/user,/tmp,/var/run/user,/etc/sysconfig,/etc:/orig/etc
schedd=`myschedd show -j | jq .currentschedd | tr -d '"'`

apptainer -s exec /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus:latest/ sh -c "source /app/setupCondor.sh && export _condor_SCHEDD_HOST=$schedd && export _condor_SCHEDD_NAME=$schedd && export _condor_CREDD_HOST=$schedd && /bin/bash  "

and

export _condor_SCHEDD_HOST=bigbirdXY.cern.ch
export _condor_SCHEDD_NAME=bigbirdXY.cern.ch
export _condor_CREDD_HOST=bigbirdXY.cern.ch

In HTCondor config files:

MY.SingularityImage = "/cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus:latest/"

Also see general instructions for using containers with lxplus's HTCondor system: https://batchdocs.web.cern.ch/containers/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant