Skip to content

BigJob on Open Science Grid

oweidner edited this page Aug 10, 2012 · 6 revisions

Via the XSEDE-OSG Gateway

Here are the required steps to run BigJob scripts on OSG via the gateway host provided by OSG.

1. Login

Log in to the XSEDE-OSG gateway node via gsissh (details are explained here). Your standard XSEDE X.509 credentials should work:

gsissh osg-xsede.grid.iu.edu

2. Setup Environment

Bootstrap the BigJob software stack (/home/oweidner/software/ is readable for everyone in the xsede group):

[you@osg-xsede:~]$ source /home/oweidner/software/env.sh 
 _______________________________________ 
/ SAGA/BigJob Environment Bootstrapped  \
|                                       |
|  - Python version: 2.7.3              |
|  - SAGA version : 1.6.1               |
\  - BigJob version : 0.4.89            /
 --------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

3. Test SAGA

Submit a simple test job using the SAGA command-line tools: saga-job submit condor://localhost /bin/date

You can watch the queue status of your job using the condor_q tool:

-- Submitter: osg-xsede.grid.iu.edu : <129.79.53.21:39607> : osg-xsede.grid.iu.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
603979.0   oweidner        8/9  21:52   0+00:00:00 R  0   0.1  date              

Once the job has finished, you should receive an email from Condor that hopefully looks something like this:

This is an automated email from the Condor system
on machine "osg-xsede.grid.iu.edu".  Do not reply.

Condor job 603979.0
	/bin/date
has exited normally with status 0

4. Run BigJob

If everything has worked so far, you can now run a simple BigJob script. Cut and past the following script into a .py file:

import os
import time
import sys

from bigjob import bigjob, subjob, description

COORDINATION_URL = "redis://gw68.quarry.iu.teragrid.org:2525"

def main():

    project             = "TG-MCB123456" # <-- Put your XSEDE allocation here 
    walltime            = 10
    processes_per_node  = 1
    number_of_processes = 1
    workingdirectory    = os.path.join(os.getcwd(), "agent")

    lrms_url = "condor://localhost"

    ##########################################################################################

    bj_filetransfers = ["/etc/motd > motd"]
  
    print "Starting Pilot Job at: " + lrms_url
    bj = bigjob(COORDINATION_URL)
    bj.start_pilot_job( lrms_url,
                        None,
                        number_of_processes,
                        queue,
                        project,
                        workingdirectory,
                        userproxy,
                        walltime,
                        processes_per_node,
                        bj_filetransfers)
    
    print "Pilot Job URL: " + bj.pilot_url + " State: " + str(bj.get_state())

    ##########################################################################################
    # Submit SubJob through BigJob
    jd = description()

    jd.executable          = "/bin/cat"
    jd.number_of_processes = "1"
    jd.spmd_variation      = "single"
    jd.arguments           = ["motd"]
    jd.output              = "stdout.txt"
    jd.error               = "stderr.txt"    

    sj = subjob()
    sj.submit_job(bj.pilot_url, jd)
    
    #########################################
    # busy wait for completion
    while 1:
        state = str(sj.get_state())
        bj_state = bj.get_state()
        print "bj state: " + str(bj_state) + " state: " + state
        if(state=="Failed" or state=="Done"):
            break
        time.sleep(2)

    ##########################################################################################
    # Cleanup - stop BigJob
    bj.cancel()
    #time.sleep(30)

if __name__ == "__main__":
    main()

When you execute the file, you will see the following output:

[you@osg-xsede:~]$ python example_condor_single.py 
Start Pilot Job/BigJob at: condor://localhost
Pilot Job/BigJob URL: bigjob:bj-44f6be2a-e31a-11e1-a8fa-d4bed9aefe00:localhost State: Unknown
bj state: Unknown state: Unknown
bj state: Unknown state: Unknown
# possibly a lot of 'Unknown', depending on how busy the Condor pool is

From Scratch (Developer Notes)

Requirements:

  • Access to a Condor pool, e.g. OSG:

    • Account in OSG with access to Renci Gateway/portal machine(VO as Engage).

    • Generate VOMS proxy on Renci gateway machine (similar to globus proxy) with required certificates

      $ voms-proxy-init -voms Engage

  • Please refer to documentation on VOMS/OSG/Engage with provided urls during registration for more details.

  • Working SAGA and BigJob Installation

  • Python 2.7.x

  • Bigjob >=0.4.40

In general, SAGA C++ is not available on OSG Condor resources. Thus, it is recommended to utilize the Redis coordination backend.

Condor

BigJob support Condor as resource manager. To submit a pilot to the Condor vanilla universe use the following URL string:

lrms_url = "condor://localhost"

An example can be found here. The same URL is used to submit a pilot to the Condor GlideInWMS.

File Staging

BJ/Condor supports pilot-level file transfers:

bj_filetransfers = ["/path/to/test.txt > test.txt"]
bj.start_pilot_job( lrms_url,
                    None,
                    number_of_processes,
                    queue,
                    project,
                    workingdirectory,
                    userproxy,
                    walltime,
                    processes_per_node,
                    bj_filetransfers)

Sub-Job file transfers are not supported.

Working Directory Handling

Working directory handling: jd.working_directory refers to a local directory where the output of the BJ agent will be stored (resp. moved). The job itself is executed in Condor's default directory, i.e. the $_CONDOR_SCRATCH_DIR.

Limitations:

Currently, the transfer of the output files is not working properly. The staging of file requires some time after the job termination. The output is currently zipped into a file called output.tar.gz, which will be placed in the directory in which the BJ script is executed.

Condor-G

Please make sure that your OSG account is correctly setup and that you can submit simple jobs. The OSG Documentation has many examples on how to submit jobs to OSG via Condor-G.

  1. Condor/G relies on Globus for job submission. In order to use BigJob with Condor-G a valid proxy certificate is required. On OSG this can be generated using the following command:

    voms-proxy-init -voms Engage

  2. The SAGA Condor/Adaptor utilizes the following convention of Condor-G resources:

    condorg://brgw1.renci.org:2119/jobmanager-pbs

After submission you can monitor the state of the pilot using condor_q -globus.

  1. Create a subjob description:

     #Submit SubJob through BigJob
     jd = description()
     jd.executable = "/bin/hostname"
     jd.number_of_processes = "1"
     jd.spmd_variation = "single"
     jd.arguments = [""]        
     jd.output = "stdout.txt"
     jd.error = "stderr.txt"
    
     sj = subjob()
     sj.submit_job(bj.pilot_url, jd)
    

Please refer to the example: example_condorg_single.py for details.

Troubleshooting

  1. How can I monitor my job? You can monitor your job using the following commands:

     condor_q -globus
    
     condor_q -better-analyze <jobid>
    
  2. Older BigJob version is used by the agent: In certain cases, the BigJob agent picks up an older previously installed BJ version (please check the agent trace for this). This issue can be resolved by submitting a Condor job that deletes the older version (Make sure to replace the GT2 endpoint in your script):

     Universe        = grid
     grid_resource = gt2 brgw1.renci.org:/jobmanager-pbs
     Executable      = /bin/rm
     Arguments       = -rf ~/.bigjob
     Output          = job_test.output
     Error           = job_test.error
     Log             = job_test.log
    
Clone this wiki locally