Skip to content

BigJob Tutorial Part 3: Simple Ensemble Example

melrom edited this page Sep 18, 2012 · 16 revisions

This page is part of the BigJob Tutorial.

Overview

The below example submits N jobs using SAGA Pilot-Job. It demonstrates the mapping of a simple echo job using all of the parameters of a Compute Unit Description.

What types of workflows would this be useful for? Many jobs using the same executable.

Important Configurable Parameters

One of the features of BigJob is the ability for application-level programmability by users. Many of the parameters in each script are customizable and configurable. For the purposes of this tutorial, we would like to draw your attention to a few important parameters that may prevent this script from running if not modified. For a more robust understanding of the configurable parameters, please view the API documentation.

service_url

The code below uses fork://localhost as the service_url. The service URL communicates what type of queueing system or middleware you want to use and where it is. localhost can be changed to a machine-specific URL, for example: sge://lonestar.tacc.utexas.edu. The following table explains the supported middleware on XSEDE and FutureGrid. Note: You WILL have to edit the examples for your personal middleware or queueing system.

Supported Adaptors Description Information
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost
SSH Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org Allows to submit jobs to a remote host via SSH
PBS Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH
SGE Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH
GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html
Torque+GSISSH Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html

allocation

When using these scripts on XSEDE, The allocation parameter must be changed from XSEDE-SAGA to your project's allocation number. This parameter may not be necessary if you are using your local cluster.

number_of_processes

This refers to the number of cores used. If your machine does not have 12 cores per node, you will have to change this parameter. For example, if you are using your laptop, number of processes might be 2 or 4.

queue

This refers to the name of the queue on the submission machine. This may not be necessary for your local laptop, but a machine such as Lonestar has different queues within SGE. You must specify if you wish to submit to the "development" queue or some other queue.

Sample Script

In your $HOME directory, open a new file simple_ensembles.py with your favorite editor (e.g., vim) and paste the following content:

import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State
	
### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://[email protected]:6379"

if __name__ == "__main__":

    pilot_compute_service = PilotComputeService(COORDINATION_URL)

    pilot_compute_description = { "service_url": "fork://localhost",
                                  "number_of_processes": 12,
                                  "allocation": "XSEDE12-SAGA",
                                  "queue": "development",                                      
                                  "working_directory": os.getenv("HOME")+"/agent",
                                  "walltime":10
                                }

    pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)

    compute_data_service = ComputeDataService()
    compute_data_service.add_pilot_compute_service(pilot_compute_service)

    print ("Finished Pilot-Job setup. Submitting compute units")

    # submit compute units
    for i in range(NUMBER_JOBS):
        compute_unit_description = {
                "executable": "/bin/echo",
                "arguments": ["Hello","$ENV1","$ENV2"],
                "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                "number_of_processes": 4,            
                "spmd_variation":"mpi",
                "output": "stdout.txt",
                "error": "stderr.txt",
                }    
        compute_data_service.submit_compute_unit(compute_unit_description)

    print ("Waiting for compute units to complete")
    compute_data_service.wait()

    print ("Terminate Pilot Jobs")
    compute_data_service.cancel()    
    pilot_compute_service.cancel()

Execute the script using command

python simple_ensembles.py

If you run the script, what do you get? You will have to go into the working directory( which is $HOME/agent in this case ), then the directory named after the pilot-service, and then the compute unit directories associated with that pilot-service.