-
Notifications
You must be signed in to change notification settings - Fork 8
BigJob on Open Science Grid
Here are the required steps to run BigJob scripts on OSG via the gateway host provided by OSG.
Log in to the XSEDE-OSG gateway node via gsissh
(details are explained here). Your standard XSEDE X.509 credentials should work:
gsissh osg-xsede.grid.iu.edu
Bootstrap the BigJob software stack (/home/oweidner/software/
is readable for everyone in the xsede
group):
[you@osg-xsede:~]$ source /home/oweidner/software/env.sh
_______________________________________
/ SAGA/BigJob Environment Bootstrapped \
| |
| - Python version: 2.7.3 |
| - SAGA version : 1.6.1 |
\ - BigJob version : 0.4.89 /
---------------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
Submit a simple test job using the SAGA command-line tools: saga-job submit condor://localhost /bin/date
You can watch the queue status of your job using the condor_q
tool:
-- Submitter: osg-xsede.grid.iu.edu : <129.79.53.21:39607> : osg-xsede.grid.iu.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
603979.0 oweidner 8/9 21:52 0+00:00:00 R 0 0.1 date
Once the job has finished, you should receive an email from Condor that hopefully looks something like this:
This is an automated email from the Condor system
on machine "osg-xsede.grid.iu.edu". Do not reply.
Condor job 603979.0
/bin/date
has exited normally with status 0
If everything has worked so far, you can now run a simple BigJob script. Cut and past the following script into a .py file:
import os
import time
import sys
from bigjob import bigjob, subjob, description
COORDINATION_URL = "redis://gw68.quarry.iu.teragrid.org:2525"
def main():
project = "TG-MCB123456" # <-- Put your XSEDE allocation here
walltime = 10
processes_per_node = 1
number_of_processes = 1
workingdirectory = os.path.join(os.getcwd(), "agent")
lrms_url = "condor://localhost"
##########################################################################################
bj_filetransfers = ["/etc/motd > motd"]
print "Starting Pilot Job at: " + lrms_url
bj = bigjob(COORDINATION_URL)
bj.start_pilot_job( lrms_url,
None,
number_of_processes,
queue,
project,
workingdirectory,
userproxy,
walltime,
processes_per_node,
bj_filetransfers)
print "Pilot Job URL: " + bj.pilot_url + " State: " + str(bj.get_state())
##########################################################################################
# Submit SubJob through BigJob
jd = description()
jd.executable = "/bin/cat"
jd.number_of_processes = "1"
jd.spmd_variation = "single"
jd.arguments = ["motd"]
jd.output = "stdout.txt"
jd.error = "stderr.txt"
sj = subjob()
sj.submit_job(bj.pilot_url, jd)
#########################################
# busy wait for completion
while 1:
state = str(sj.get_state())
bj_state = bj.get_state()
print "bj state: " + str(bj_state) + " state: " + state
if(state=="Failed" or state=="Done"):
break
time.sleep(2)
##########################################################################################
# Cleanup - stop BigJob
bj.cancel()
#time.sleep(30)
if __name__ == "__main__":
main()
When you execute the file, you will see the following output:
[you@osg-xsede:~]$ python example_condor_single.py
Start Pilot Job/BigJob at: condor://localhost
Pilot Job/BigJob URL: bigjob:bj-44f6be2a-e31a-11e1-a8fa-d4bed9aefe00:localhost State: Unknown
bj state: Unknown state: Unknown
bj state: Unknown state: Unknown
# possibly a lot of 'Unknown', depending on how busy the Condor pool is
Requirements:
-
Access to a Condor pool, e.g. OSG:
-
Account in OSG with access to Renci Gateway/portal machine(VO as Engage).
-
Generate VOMS proxy on Renci gateway machine (similar to globus proxy) with required certificates
$ voms-proxy-init -voms Engage
-
-
Please refer to documentation on VOMS/OSG/Engage with provided urls during registration for more details.
-
Working SAGA and BigJob Installation
-
SAGA >= 1.6 (https://svn.cct.lsu.edu/repos/saga-projects/extenci/osg_howto/HOWTO)
-
SAGA-Condor Adaptor
-
SAGA-Python Bindings
-
-
Python 2.7.x
-
Bigjob >=0.4.40
In general, SAGA C++ is not available on OSG Condor resources. Thus, it is recommended to utilize the Redis coordination backend.
BigJob support Condor as resource manager. To submit a pilot to the Condor vanilla universe use the following URL string:
lrms_url = "condor://localhost"
An example can be found here. The same URL is used to submit a pilot to the Condor GlideInWMS.
BJ/Condor supports pilot-level file transfers:
bj_filetransfers = ["/path/to/test.txt > test.txt"]
bj.start_pilot_job( lrms_url,
None,
number_of_processes,
queue,
project,
workingdirectory,
userproxy,
walltime,
processes_per_node,
bj_filetransfers)
Sub-Job file transfers are not supported.
Working directory handling: jd.working_directory
refers to a local directory where the output of the BJ agent will be stored (resp. moved). The job itself is executed in Condor's default directory, i.e. the $_CONDOR_SCRATCH_DIR
.
Currently, the transfer of the output files is not working properly. The staging of file requires some time after the job termination. The output is currently zipped into a file called output.tar.gz
, which will be placed in the directory in which the BJ script is executed.
Please make sure that your OSG account is correctly setup and that you can submit simple jobs. The OSG Documentation has many examples on how to submit jobs to OSG via Condor-G.
-
Condor/G relies on Globus for job submission. In order to use BigJob with Condor-G a valid proxy certificate is required. On OSG this can be generated using the following command:
voms-proxy-init -voms Engage
-
The SAGA Condor/Adaptor utilizes the following convention of Condor-G resources:
condorg://brgw1.renci.org:2119/jobmanager-pbs
After submission you can monitor the state of the pilot using condor_q -globus
.
-
Create a subjob description:
#Submit SubJob through BigJob jd = description() jd.executable = "/bin/hostname" jd.number_of_processes = "1" jd.spmd_variation = "single" jd.arguments = [""] jd.output = "stdout.txt" jd.error = "stderr.txt" sj = subjob() sj.submit_job(bj.pilot_url, jd)
Please refer to the example: example_condorg_single.py for details.
-
How can I monitor my job? You can monitor your job using the following commands:
condor_q -globus condor_q -better-analyze <jobid>
-
Older BigJob version is used by the agent: In certain cases, the BigJob agent picks up an older previously installed BJ version (please check the agent trace for this). This issue can be resolved by submitting a Condor job that deletes the older version (Make sure to replace the GT2 endpoint in your script):
Universe = grid grid_resource = gt2 brgw1.renci.org:/jobmanager-pbs Executable = /bin/rm Arguments = -rf ~/.bigjob Output = job_test.output Error = job_test.error Log = job_test.log