Skip to content

BigJob Tutorial Part 3: Simple Ensemble Example

drelu edited this page Jan 31, 2013 · 16 revisions

This page is part of the BigJob Tutorial.

Overview

You might be wondering how to create your own BigJob script or how BigJob can be useful for your needs.

The first example, below, submits N jobs using BigJob. This is very useful if you are running many jobs using the same executable. Rather than submit each job individually to the queuing system and then wait for every job to become active and complete, you submit just one 'Big' job that reserves the number of cores needed to run all of your jobs. When this BigJob becomes active, your jobs are pulled by BigJob from the Redis server and executed.

The below examples demonstrates the mapping of a simple job (i.e. executable is /bin/echo) using all of the parameters of a Compute Unit Description.

Important Configurable Parameters

One of the features of BigJob is the ability for application-level programmability by users. Many of the parameters in each script are customizable and configurable. For the purposes of this tutorial, we would like to draw your attention to a few important parameters that may prevent this script from running if not modified. For a more robust understanding of the configurable parameters, please view the API documentation.

service_url

The code below uses fork://localhost as the service_url. The service URL communicates what type of queueing system or middleware you want to use and where it is. localhost can be changed to a machine-specific URL, for example: sge://lonestar.tacc.utexas.edu. The following table explains the supported middleware on XSEDE and FutureGrid. Note: You WILL have to edit the examples for your personal middleware or queueing system.

Supported Adaptors Description Information
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost
SSH Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org Allows to submit jobs to a remote host via SSH
PBS Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH
SGE Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH
GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html
Torque+GSISSH Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html

allocation

When using these scripts on XSEDE, The allocation parameter must be changed from XSEDE-SAGA to your project's allocation number. This parameter may not be necessary if you are using your local cluster.

number_of_processes

This refers to the number of cores used. If your machine does not have 12 cores per node, you will have to change this parameter. For example, if you are using your laptop, number of processes might be 2 or 4.

queue

This refers to the name of the queue on the submission machine. This may not be necessary for your local laptop, but a machine such as Lonestar has different queues within SGE. You must specify if you wish to submit to the "development" queue or some other queue.

Sample Script

In your $HOME directory, open a new file simple_ensembles.py with your favorite editor (e.g., vim) and paste the following content:

import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State
	
### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://localhost"

if __name__ == "__main__":

    pilot_compute_service = PilotComputeService(COORDINATION_URL)

    pilot_compute_description = { "service_url": "fork://localhost",
                                  "number_of_processes": 12,
                                  "allocation": "XSEDE12-SAGA",
                                  "queue": "development",                                      
                                  "working_directory": os.getenv("HOME")+"/agent",
                                  "walltime":10
                                }

    pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)

    compute_data_service = ComputeDataService()
    compute_data_service.add_pilot_compute_service(pilot_compute_service)

    print ("Finished Pilot-Job setup. Submitting compute units")

    # submit compute units
    for i in range(NUMBER_JOBS):
        compute_unit_description = {
                "executable": "/bin/echo",
                "arguments": ["Hello","$ENV1","$ENV2"],
                "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                "number_of_processes": 1,            
                "spmd_variation":"mpi",
                "output": "stdout.txt",
                "error": "stderr.txt",
                }    
        compute_data_service.submit_compute_unit(compute_unit_description)

    print ("Waiting for compute units to complete")
    compute_data_service.wait()

    print ("Terminate Pilot Jobs")
    compute_data_service.cancel()    
    pilot_compute_service.cancel()

Execute the script using command

python simple_ensembles.py

Where is my output?

Go into the working directory (in this case, $HOME/agent). You should see a directory named after the pilot-service that starts with bj- and is followed by a unique identifier for that BigJob. If you cd into that directory, you will see the compute unit directories. These directories start with sj- and are followed by a unique identifier. If you cd into one of these directories, you will find a stdout.txt and stderr.txt file. stdout.txt should contain the results of the /bin/echo job. Please note that the names of stdout and stderr are configurable in the ComputeUnitDescription.


Back: [Tutorial Home](BigJob Tutorial)    Next: BigJob Tutorial Part 4: Mandelbrot Example