-
Notifications
You must be signed in to change notification settings - Fork 8
BigJob Tutorial Part 3: Simple Ensemble Example
This page is part of the BigJob Tutorial.
The below example submits N jobs using SAGA Pilot-Job. It demonstrates the mapping of a simple echo job using all of the parameters of a Compute Unit Description.
What types of workflows would this be useful for? Many jobs using the same executable.
One of the features of BigJob is the ability for application-level programmability by users. Many of the parameters in each script are customizable and configurable. For the purposes of this tutorial, we would like to draw your attention to a few important parameters that may prevent this script from running if not modified. For a more robust understanding of the configurable parameters, please view the API documentation.
The code below uses fork://localhost
as the service_url
. The service URL communicates what type of queueing system or middleware you want to use and where it is. localhost
can be changed to a machine-specific URL, for example: sge://lonestar.tacc.utexas.edu
. The following table explains the supported middleware on XSEDE and FutureGrid. Note: You WILL have to edit the examples for your personal middleware or queueing system.
Supported Adaptors | Description | Information |
---|---|---|
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost | |
SSH | Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org | Allows to submit jobs to a remote host via SSH |
PBS | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost | Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH |
SGE | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost | Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH |
GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge | Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html |
Torque+GSISSH | Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org | Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html |
When using these scripts on XSEDE, The allocation parameter must be changed from XSEDE-SAGA
to your project's allocation number. This parameter may not be necessary if you are using your local cluster.
This refers to the number of cores used. If your machine does not have 12 cores per node, you will have to change this parameter. For example, if you are using your laptop, number of processes might be 2 or 4.
This refers to the name of the queue on the submission machine. This may not be necessary for your local laptop, but a machine such as Lonestar has different queues within SGE. You must specify if you wish to submit to the "development" queue or some other queue.
In your $HOME directory, open a new file simple_ensembles.py with your favorite editor (e.g., vim) and paste the following content:
import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State
### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://[email protected]:6379"
if __name__ == "__main__":
pilot_compute_service = PilotComputeService(COORDINATION_URL)
pilot_compute_description = { "service_url": "fork://localhost",
"number_of_processes": 12,
"allocation": "XSEDE12-SAGA",
"queue": "development",
"working_directory": os.getenv("HOME")+"/agent",
"walltime":10
}
pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)
compute_data_service = ComputeDataService()
compute_data_service.add_pilot_compute_service(pilot_compute_service)
print ("Finished Pilot-Job setup. Submitting compute units")
# submit compute units
for i in range(NUMBER_JOBS):
compute_unit_description = {
"executable": "/bin/echo",
"arguments": ["Hello","$ENV1","$ENV2"],
"environment": ['ENV1=env_arg1','ENV2=env_arg2'],
"number_of_processes": 4,
"spmd_variation":"mpi",
"output": "stdout.txt",
"error": "stderr.txt",
}
compute_data_service.submit_compute_unit(compute_unit_description)
print ("Waiting for compute units to complete")
compute_data_service.wait()
print ("Terminate Pilot Jobs")
compute_data_service.cancel()
pilot_compute_service.cancel()
Execute the script using command
python simple_ensembles.py
If you run the script, what do you get? You will have to go into the working directory( which is $HOME/agent in this case ), then the directory named after the pilot-service, and then the compute unit directories associated with that pilot-service.