BigJob on XSEDE

Introduction to BigJob

BigJob, a SAGA-based Pilot-Job, is a general purpose Pilot-Job framework. Pilot-Jobs enable the decoupling of workload submission from resource assignment, thus allowing a flexible and dynamic execution of tasks. This has important consequences, ranging from something as simple as the ability of Pilot-Jobs to execute multiple jobs without the necessity to queue each individual job, to advanced usage scenarios such as executing tasks with complex and dynamic dependencies. The task execution model supported by Pilot-Jobs, enables the distributed scale-out of applications on multiple and possibly heterogeneous resources.

Additional information about BigJob can be found on the website: http://saga-project.github.com/BigJob/. We recommend you work through the [BigJob Tutorial](https://github.com/saga-project/BigJob/wiki/BigJob-Tutorial. A comprehensive API documentation is available at http://saga-project.github.com/BigJob/apidoc/.

Environment Setup and Installation

BigJob uses SAGA-Python to connect to different grid middleware. SAGA-Python is installed automatically by BigJob. Although you should not need to know about SAGA-Python to use BigJob, for more information on SAGA-Python, please read the SAGA Tutorial.

Redis Server

BigJob uses a Redis server for coordination and task management. Redis is the most stable and fastest backend (requires Python >2.5) and the recommended way of using BigJob. Redis can easily be run in user space. It can be downloaded at: http://redis.io/download (just ~500 KB). Once you downloaded it, start a Redis server on the machine of your choice:

$ redis-server 
[489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
[489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
[489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
[489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use

Then set the COORDINATION_URL parameter in the example to the Redis endpoint of your Redis installation, e.g.

redis://<hostname>:6379

You can install redis on a persistent server and use this server as your coordination server.

Bootstrap your Local Python Environment

We do not want to use the system Python installation on XSEDE, because it is not uniform across all machines. Instead, you need a place were you can install BigJob locally. A small tool called virtualenv allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.):

virtualenv $HOME/.bigjob

If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead:

curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob

Activate your Local Python Environment

You need to activate your Python environment in order to make it work. Run the command below. It will temporarily modify your PYTHONPATH so that it points to $HOME/.bigjob/lib/python2.7/site-packages/ instead of the the system site-package directory:

source $HOME/.bigjob/bin/activate

Activating the virtualenv is very important. If you don't activate your virtual Python environment, the rest of this installation will not work. You can usually tell that your environment is activated properly if your bash command-line prompt starts with (.bigjob).

Install BigJob

After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using pip by typing:

easy_install BigJob

To make sure that your installation works, run the following command to check if the BigJob module can be imported by the interpreter:

python -c "import pilot; print pilot.version"

The expected output should give the date and time, followed by "bigjob - INFO - Loading BigJob version: #.#.#", where #.#.# corresponds to the actual BigJob version you are using.

Configure/Install Redis

BigJob requires Redis for communication between BigJob manager/agent. It is recommended to setup your own Redis in user-space. On Linux you can simply download Redis (Version 2.2.x is recommended) and install Redis using the following commands: tar -xzvf redis-2.2.12.tar.gz cd redis-2.2.12 make make install (if root)

Start Redis (redis-server executable is located in src/ dir): cd src ./redis-server

It is recommend to setup a password for your Redis server. Otherwise, other users will be able to access and manipulate your data stored in the Redis server.

Create BigJob Agent Directory

Prior to running these examples, you will need to create a directory called 'agent' in the same location that you are running your scripts from. BigJob uses this as its working directory. For example, you might create the agent directory in the $HOME directory by typing:

mkdir $HOME/agent

Note: It is good practice to run your scripts out of $SCRATCH or $WORK. In this case, you would type mkdir $SCRATCH/<anySubDirectoryOfYourChoice>/agent and run your script from $SCRATCH/<anySubDirectoryOfYourChoice>. $HOME should only be used for the tutorial scripts!

Configure SSH Keys

If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine. Please see our guide to configuring SSH Password-Less Login.

Examples of when you would need password-less login: (1) You want to submit from your local machine to an XSEDE resource, (2) You want to submit from one XSEDE resource to another, (3) You want to submit from your local cluster to external clusters, etc. etc.

Below are the descriptions of two important constructs used to build workflows using Pilot-API.

Use the Pilot-API to describe Pilots, Compute and Data Units

The Pilot-API provides different entity for managing pilots, compute and data units. Most important are Pilot Descriptions - for describing pilots - and Compute Unit Description - for describing computational tasks.

Pilot Description

Pilot description defines the resource specification for managing the jobs on that resource. The following are the resource specifications that need to be provided:

service_url - specifies the SAGA Bliss job adaptor and resource hostname on which jobs can be executed. For remote hosts password less login need to be enabled.
number_of_processes - specifies the total number of processes need to be allocated to run the jobs.
allocation - specifies your allocation number on XSEDE
queue - specifies the job queue to be used.
working_directory - specifies the directory in which the Pilot-Job agent executes
walltime - specifies the number of minutes the resources are requested.
file_transfer - specifies the files that need to be transferred in order to execute the jobs successfully. Generally files common to all the jobs need to be listed here.

pilot_compute_description.append({ "service_url": "sge+ssh://localhost",
                                   "number_of_processes": 12,
                                   "allocation": "XSEDE12-SAGA",
                                   "queue": "development",
                                   "working_directory": os.getenv("HOME")+"/agent",
                                   "walltime":10
                                })

Compute Unit Description

The Compute Unit Description allows the user to specify the actual job parameters and data needed to execute the job.

executable - specifies the executable.
arguments - specifies the list of arguments to be passed to executable.
environment - specifies the list of environment variables to be set for the successful of job execution.
working_directory - specifies the directory in which the job has to execute. If not specified Pilot-Job creates a default directory.
number_of_processes - specifies the number of processes to be assigned for the job execution.
spmd_variation - specifies the type of job. By default it is single job.
output - specifies the file in which the standard output of the job execution to be stored.
error - specifies the file in which the standard error of the job execution to be stored.
file_transfer - specifies the files that need to be transferred in order to execute the job successfully. Generally files specific to the job need to be listed here.

compute_unit_description = { "executable": "/bin/echo",
                             "arguments": ["Hello","$ENV1","$ENV2"],
                             "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                             "number_of_processes": 4,            
                             "spmd_variation":"mpi",
                             "output": "stdout.txt",
                             "error": "stderr.txt"
                           }

Supported Middleware Plugins on XSEDE

The following describes which middleware plugins are supported on XSEDE and which machines use which middleware (i.e. Lonestar uses the SGE batch queuing system etc). Reference this table to find out how to edit the service_url in the pilot_compute_description as shown above.

Supported Middleware Plugins	Description	Information	Machine
fork	Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost		localhost
SSH	Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org	Allows to submit jobs to a remote host via SSH	localhost
PBS	Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost	Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH	Trestles, Kraken
SGE	Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost	Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH	Lonestar, Ranger
GRAM	Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge	Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html	Kraken, Lonestar, Ranger, Trestles
Torque+GSISSH	Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org	Please find the GSISSH resource URLs of XSEDE machines at	Kraken

Note: For Lonestar/Ranger, sge+ssh:// is required when the target machine is not the same as the machine that the script is being run on. For example, if you run the script on Lonestar, but wish to submit jobs TO ranger, you must use sge+ssh://, but if you are running the script ON Lonestar and submitting jobs TO Lonestar only, you can use sge://. The same is true with PBS. Also note that to run from your LOCAL laptop TO Lonestar, you would use sge+ssh:// and you MUST have SSH keys configured.

BigJob Script Examples

You can work through examples for running the BigJob code by visiting the comprehensive BigJob tutorial.

BigJob Tutorial Learn more about BigJob and work through examples

Where to Get Help?

Check the [Frequently Asked Questions] (https://github.com/saga-project/BigJob/wiki/Frequently-Asked-Questions).

For questions and comments, please join the bigjob-users group:


Subscribe to bigjob-users
Email:
Visit this group

***

Provide feedback

Saved searches

Use saved searches to filter your results more quickly