-
Notifications
You must be signed in to change notification settings - Fork 8
Application Execution and Examples
This guide provides a conceptual overview of BigJob and detailed information on how to use BigJob for developing workflows.
1. Do you need to execute a good number of compute tasks on a busy HPC cluster? Yes, you need BigJob to avoid queue waiting time involved for each task when submitted through traditional scheduling system.
2. Designing workflows? You need BigJob since it provides decoupling between task submission and resource assignment.
Before starting with development please make sure that BigJob is installed/loaded successfully. Successful Execution of '''import bigjob''' statement in python shell indicates successful installation/loading of BigJob.
(python)-bash-3.2$ python Python 2.7.1 (r271:86832, Jun 13 2011, 12:48:51) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import bigjob 01/15/2012 10:05:23 AM - bigjob - DEBUG - Loading BigJob version: 0.4.23 01/15/2012 10:05:23 AM - bigjob - DEBUG - read configfile: /N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg/bigjob/../bigjob.conf 01/15/2012 10:05:23 AM - bigjob - DEBUG - Using SAGA C++/Python.
Familiarity with below terms will help you to understand the overview of BigJob functionality.
1. Application - It is a program, which specifies the HPC resources to be used to execute a set of tasks, and provide dependencies between those tasks.
2. Sub-Job - Sub-Job is a task with information like executable, environment variables required to execute the task, number of processes required, arguments to the executable, SPMD variation (serial vs MPI), output file, error file.
3. BigJob-Manager - The BigJob-manager stores the information of sub-jobs and is responsible for orchestrating the interactions between the BigJob-Agents and manager.
4. BigJob-Agent - For each HPC resource specified, BigJob agent is launched. When resource is available the BigJob agent becomes active and pulls the stored information of the Sub-Job and executes it on that HPC resource.
5. Coordination system - Coordination system is a database used by BigJob manager to store the information of SubJobs and orchestrate BigJob agents. Active BigJob agents uses it to pull the Sub-Job information to execute them on HPC resources.
a. Identify the coordination system to be used. SAGA Advert service or Redis (refer FAQ 6) can be used as coordination systems. Specify suitable COORDINATION_URL in the example scripts as below
Advert Service:
COORDINATION_URL = "advert://localhost/?dbtype=sqlite3" # uses sqlite3 database as coordination system. Works only on localhost.
COORDINATION_URL = "advert://SAGA:[email protected]:8080/?dbtype=postgresql" #uses PostGRESQL database on #advert.cct.lsu.edu at port 8080 as coordination system. SAGA & SAGA_client are user id and password for the database.
Redis:
COORDINATION_URL = "redis://localhost:6379" # uses redis database as coordination system.
COORDINATION_URL = "redis://cyder.cct.lsu.edu:2525" # uses redis database on cyder.cct.lsu.edu at port 2525 as coordination system.
b. Identify the HPC clusters to be used and specify resource specifications like resource url, number of nodes, processes per node, wall time, queue, allocation information, working directory ( where BigJob agent executes ). Resource_url depends on the type of adaptor suitable for that infrastructure. Scale to multiple HPC clusters just by appending resource specification to the resource_list object. Please make sure you have password less access enabled when remote jobs are submitted (see https://github.com/saga-project/BigJob/wiki/Configuration-of-SSH-for-Password-less-Authentication ).
example:
resource_list.append( { "resource_url" : "pbs-ssh://eric1.loni.org", "processes_per_node":"4", "number_of_processes" : "4", "allocation" : "TG-12321" , "queue" : "workq", "working_directory": (os.getcwd() + "/agent"), "walltime":10 } )
Please use suitable resource url based on the tabular information below.
Infrastructure | Supported Adaptors | Description | Information | ||||
LONI | GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (grid-proxy-init) before executing the BigJob application.Example usage of URL :gram:eric1.loni.org/jobmanager-pbs | Suggested | ||||
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost | ||||||
ssh | Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:eric1.loni.org | ||||||
pbs-ssh | Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:eric1.loni.org | Doesn't work since ssh adaptors are not available on LONI | |||||
XSEDE | GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL :gram:gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge | Suggested. Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html | ||||
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost | ||||||
ssh | Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:eric1.loni.org | ||||||
pbs-ssh | Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:eric1.loni.org | Not suitable for HPC resources using SGE scheduling system | |||||
FutureGrid | pbs-ssh | Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:sierra.futuregrid.org | Suggested | ||||
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost | ||||||
ssh | Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:sierra.futuregrid.org | ||||||
PBSPro | Submit jobs to local machines scheduling system. Example usage: pbspro:localhost |
c. Start BigJob agents on HPC resources with resource list and coordination system as parameters.
example:
mjs = many_job_service(resource_list, COORDINATION_URL)
d. Create Sub-Jobs with their specifications like executable, environment variables required to execute the !SubJob, arguments to the executable, number of processes required, SPMD variation (serial vs MPI), output file, error file, .
example:
jd = description() jd.executable = "/bin/cat" # Specify the executable name with absolute path. jd.number_of_processes = "1" # Specify the number of processes required for SubJob. jd.environment=["k=123","HPATH=/home/usrk/"] jd.spmd_variation = "single" $ Sepcify the SPMD variation ( single or mpi ) jd.arguments = ["text.txt"] # Specify the arguments to the executable jd.working_directory = "/home/pmanth2"; # Specify the location where SubJob has to execute. jd.output = "stdout-" + str(i) + ".txt" # Specify the SubJob output file name jd.error = "stderr-" + str(i) + ".txt"# Specify the SubJob error file name subjob = mjs.create_job(jd) # Creates SubJob with the given Job description. subjob.run()
The following BigJob examples can be used to submit local/remote jobs and can be used as building blocks to develop applications. These can be downloaded from https://github.com/drelu/BigJob/tree/master/examples.
Example running single Big-Job and a single Sub-Job on localhost: https://raw.github.com/drelu/BigJob/master/examples/example_local_single.py
Example running single BigJob and multiple !SubJobs on localhost: https://raw.github.com/drelu/BigJob/master/examples/example_local_multiple.py
Example running multiple Big-Jobs and execution of Sub-Jobs on multiple/distributed resources: https://raw.github.com/drelu/BigJob/master/examples/example_manyjob_local.py
Run the BigJob example script:
python <example script>
Log & error files are directed to the working directory mentioned in the resource URL and Sub-Job specifications. A Guide for debugging can be found at: Debugging.