Skip to content

BigJob on XSEDE

melrom edited this page Sep 20, 2012 · 21 revisions

Introduction to BigJob

BigJob, a SAGA-based Pilot-Job, is a general purpose Pilot-Job framework. Pilot-Jobs support the use of container jobs with sophisticated workflow management to coordinate the launch and interaction of actual computational tasks within the container. This results in the decoupling of workload submission from resource assignment, allowing a flexible execution model that enables the distributed scale-out of applications on multiple and possibly heterogeneous resources. It allows the execution of jobs without the necessity to queue each individual job.

Additional information about BigJob can be found on the website: http://saga-project.github.com/BigJob/. A comprehensive API documentation is available at http://saga-project.github.com/BigJob/apidoc/.

Below are the descriptions of two important constructs used to build workflows using Pilot-API.

Pilot Description

Pilot description defines the resource specification for managing the jobs on that resource. The following are the resource specifications that need to be provided:

  • service_url - specifies the SAGA Bliss job adaptor and resource hostname on which jobs can be executed. For remote hosts password less login need to be enabled.
  • number_of_processes - specifies the total number of processes need to be allocated to run the jobs.
  • queue - specifies the job queue to be used.
  • working_directory - specifies the directory in which the Pilot-Job agent executes
  • walltime - specifies the number of minutes the resources are requested.
  • file_transfer - specifies the files that need to be transferred in order to execute the jobs successfully. Generally files common to all the jobs need to be listed here.
pilot_compute_description.append({ "service_url": "sge+ssh://localhost",
                                   "number_of_processes": 12,
                                   "allocation": "XSEDE12-SAGA",
                                   "queue": "development",
                                   "working_directory": os.getenv("HOME")+"/agent",
                                   "walltime":10
                                })

Compute Unit Description

The Compute Unit Description allows the user to specify the actual job parameters and data needed to execute the job.

  • executable - specifies the executable.
  • arguments - specifies the list of arguments to be passed to executable.
  • environment - specifies the list of environment variables to be set for the successful of job execution.
  • working_directory - specifies the directory in which the job has to execute. If not specified Pilot-Job creates a default directory.
  • number_of_processes - specifies the number of processes to be assigned for the job execution.
  • spmd_variation - specifies the type of job. By default it is single job.
  • output - specifies the file in which the standard output of the job execution to be stored.
  • error - specifies the file in which the standard error of the job execution to be stored.
  • file_transfer - specifies the files that need to be transferred in order to execute the job successfully. Generally files specific to the job need to be listed here.
compute_unit_description = { "executable": "/bin/echo",
                             "arguments": ["Hello","$ENV1","$ENV2"],
                             "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                             "number_of_processes": 4,            
                             "spmd_variation":"mpi",
                             "output": "stdout.txt",
                             "error": "stderr.txt"
                           }    

Environment Setup and Installation

BigJob uses SAGA-Bliss to connect to different grid middleware. SAGA-Bliss is installed automatically by BigJob. For more information on SAGA-Bliss, please read the SAGA Tutorial.

Bootstrap your Local Python Environment

We do not want to use the system Python installation on XSEDE, because it is not uniform across all machines. Instead, you need a place were you can install BigJob locally. A small tool called virtualenv allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.):

virtualenv $HOME/tutorial

If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead:

curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob

Activate your Local Python Environment

You need to activate your Python environment in order to make it work. Run the command below. It will temporarily modify your PYTHONPATH so that it points to $HOME/.bigjob/lib/python2.7/site-packages/ instead of the the system site-package directory:

source $HOME/.bigjob/bin/activate

Activating the virtualenv is very important. If you don't activate your virtual Python environment, the rest of this tutorial will not work. You can usually tell that your environment is activated properly if your bash command-line prompt starts with (.bigjob).

Install BigJob

After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using pip by typing:

pip install BigJob

To make sure that your installation works, run the following command to check if the BigJob module can be imported by the interpreter:

python -c "import pilot; print pilot.version"

Create BigJob Agent Directory

Prior to running these examples, you will need to create a directory called 'agent' in the same location that you are running your scripts from. BigJob uses this as its working directory. For example, you might create the agent directory in the $HOME directory by typing:

mkdir $HOME/agent

Note: It is good practice to run your scripts out of $SCRATCH or $WORK. In this case, you would type mkdir $SCRATCH/<anySubDirectoryOfYourChoice>/agent and run your script from $SCRATCH/<anySubDirectoryOfYourChoice. $HOME should only be used for the tutorial scripts!

Configure SSH Keys

If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine. Please see our guide to configuring SSH Password-Less Login.

Examples of when you would need password-less login: (1) You want to submit from your local machine to an XSEDE resource, (2) You want to submit from one XSEDE resource to another, (3) You want to submit from your local cluster to external clusters, etc. etc.

Supported Adaptors on XSEDE

The following describes which adaptors are supported on XSEDE and which machines use which adaptors (i.e. Lonestar uses the SGE batch queuing system, Kraken uses Torque for job submission, etc). Reference this table to find out how to edit the service_url in the pilot_compute_description as shown above.

Supported Adaptors Description Information Machine
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost localhost
SSH Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org Allows to submit jobs to a remote host via SSH localhost
PBS Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH Trestles, Kraken
SGE Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH Lonestar, Ranger
GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html Kraken, Lonestar, Ranger, Trestles
Torque+GSISSH Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html Kraken

Note: For Lonestar/Ranger, sge+ssh:// is required when the target machine is not the same as the machine that the script is being run on. For example, if you run the script on Lonestar, but wish to submit jobs TO ranger, you must use sge+ssh://, but if you are running the script ON Lonestar and submitting jobs TO Lonestar only, you can use sge://. The same is true with PBS. Also note that to run from your LOCAL laptop TO Lonestar, you would use sge+ssh:// and you MUST have SSH keys configured.

Additional Examples

You can work through examples for running the BigJob code by visiting the comprehensive BigJob tutorial.