-
Notifications
You must be signed in to change notification settings - Fork 8
BigJob Tutorial Part 3: Simple Ensemble Example
This page is part of the BigJob Tutorial.
The below example submits N jobs using SAGA Pilot-Job. It demonstrates the mapping of a simple echo job using all of the parameters of a Compute Unit Description.
What types of workflows would this be useful for? Many jobs using the same executable.
The code below uses fork://localhost
as the service_url
. The service URL communicates what type of queueing system or middleware you want to use and where it is. localhost
can be changed to a machine-specific URL, for example: sge://lonestar.tacc.utexas.edu
. The following table explains the supported middleware on XSEDE and FutureGrid. Note: You WILL have to edit the examples for your personal middleware or queueing system.
Supported Adaptors | Description | Information |
---|---|---|
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost | |
SSH | Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org | Allows to submit jobs to a remote host via SSH |
PBS | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost | Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH |
SGE | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost | Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH |
GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge | Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html |
Torque+GSISSH | Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org | Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html |
When using these scripts on XSEDE, The allocation parameter must be changed from XSEDE-SAGA
to your project's allocation number. This parameter may not be necessary if you are using your local cluster.
In your $HOME directory, open a new file simple_ensembles.py with your favorite editor (e.g., vim) and paste the following content:
import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State
### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://[email protected]:6379"
if __name__ == "__main__":
pilot_compute_service = PilotComputeService(COORDINATION_URL)
pilot_compute_description = { "service_url": "fork://localhost",
"number_of_processes": 12,
"allocation": "XSEDE12-SAGA",
"queue": "development",
"working_directory": os.getenv("HOME")+"/agent",
"walltime":10
}
pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)
compute_data_service = ComputeDataService()
compute_data_service.add_pilot_compute_service(pilot_compute_service)
print ("Finished Pilot-Job setup. Submitting compute units")
# submit compute units
for i in range(NUMBER_JOBS):
compute_unit_description = {
"executable": "/bin/echo",
"arguments": ["Hello","$ENV1","$ENV2"],
"environment": ['ENV1=env_arg1','ENV2=env_arg2'],
"number_of_processes": 4,
"spmd_variation":"mpi",
"output": "stdout.txt",
"error": "stderr.txt",
}
compute_data_service.submit_compute_unit(compute_unit_description)
print ("Waiting for compute units to complete")
compute_data_service.wait()
print ("Terminate Pilot Jobs")
compute_data_service.cancel()
pilot_compute_service.cancel()
Execute the script using command
python simple_ensembles.py
If you run the script, what do you get? You will have to go into the working directory( which is $HOME/agent in this case ), then the directory named after the pilot-service, and then the compute unit directories associated with that pilot-service.