Skip to content

BigJob on XSEDE

shantenujha edited this page Nov 2, 2012 · 21 revisions

Introduction to BigJob

BigJob, a SAGA-based Pilot-Job, is a general purpose Pilot-Job framework. Pilot-Jobs enable the decoupling of workload submission from resource assignment, thus allowing a flexible and dynamic execution of tasks. This has important consequences, ranging from something as simple as the ability of Pilot-Jobs to execute multiple jobs without the necessity to queue each individual job, to advanced usage scenarios such as executing tasks with complex and dynamic dependencies. The task execution model supported by Pilot-Jobs, enables the distributed scale-out of applications on multiple and possibly heterogeneous resources.

Additional information about BigJob can be found on the website: http://saga-project.github.com/BigJob/. We recommend you work through the [BigJob Tutorial](https://github.com/saga-project/BigJob/wiki/BigJob-Tutorial. A comprehensive API documentation is available at http://saga-project.github.com/BigJob/apidoc/.

Below are the descriptions of two important constructs used to build workflows using Pilot-API.

Pilot Description

Pilot description defines the resource specification for managing the jobs on that resource. The following are the resource specifications that need to be provided:

  • service_url - specifies the SAGA Bliss job adaptor and resource hostname on which jobs can be executed. For remote hosts password less login need to be enabled.
  • number_of_processes - specifies the total number of processes need to be allocated to run the jobs.
  • allocation - specifies your allocation number on XSEDE
  • queue - specifies the job queue to be used.
  • working_directory - specifies the directory in which the Pilot-Job agent executes
  • walltime - specifies the number of minutes the resources are requested.
  • file_transfer - specifies the files that need to be transferred in order to execute the jobs successfully. Generally files common to all the jobs need to be listed here.
pilot_compute_description.append({ "service_url": "sge+ssh://localhost",
                                   "number_of_processes": 12,
                                   "allocation": "XSEDE12-SAGA",
                                   "queue": "development",
                                   "working_directory": os.getenv("HOME")+"/agent",
                                   "walltime":10
                                })

Compute Unit Description

The Compute Unit Description allows the user to specify the actual job parameters and data needed to execute the job.

  • executable - specifies the executable.
  • arguments - specifies the list of arguments to be passed to executable.
  • environment - specifies the list of environment variables to be set for the successful of job execution.
  • working_directory - specifies the directory in which the job has to execute. If not specified Pilot-Job creates a default directory.
  • number_of_processes - specifies the number of processes to be assigned for the job execution.
  • spmd_variation - specifies the type of job. By default it is single job.
  • output - specifies the file in which the standard output of the job execution to be stored.
  • error - specifies the file in which the standard error of the job execution to be stored.
  • file_transfer - specifies the files that need to be transferred in order to execute the job successfully. Generally files specific to the job need to be listed here.
compute_unit_description = { "executable": "/bin/echo",
                             "arguments": ["Hello","$ENV1","$ENV2"],
                             "environment": ['ENV1=env_arg1','ENV2=env_arg2'],
                             "number_of_processes": 4,            
                             "spmd_variation":"mpi",
                             "output": "stdout.txt",
                             "error": "stderr.txt"
                           }    

Environment Setup and Installation

BigJob uses SAGA-Python to connect to different grid middleware. SAGA-Python is installed automatically by BigJob. Although you should not need to know about SAGA-Python to use BigJob, for more information on SAGA-Python, please read the SAGA Tutorial.

Redis Server

BigJob uses a Redis server for coordination and task management. Redis is the most stable and fastest backend (requires Python >2.5) and the recommended way of using BigJob. Redis can easily be run in user space. It can be downloaded at: http://redis.io/download (just ~500 KB). Once you downloaded it, start a Redis server on the machine of your choice:

$ redis-server 
[489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
[489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
[489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
[489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use

Then set the COORDINATION_URL parameter in the example to the Redis endpoint of your Redis installation, e.g.

redis://localhost:6379 

You can install redis on a persistent server and use this server as your coordination server.

Bootstrap your Local Python Environment

We do not want to use the system Python installation on XSEDE, because it is not uniform across all machines. Instead, you need a place were you can install BigJob locally. A small tool called virtualenv allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.):

virtualenv $HOME/.bigjob

If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead:

curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob

Activate your Local Python Environment

You need to activate your Python environment in order to make it work. Run the command below. It will temporarily modify your PYTHONPATH so that it points to $HOME/.bigjob/lib/python2.7/site-packages/ instead of the the system site-package directory:

source $HOME/.bigjob/bin/activate

Activating the virtualenv is very important. If you don't activate your virtual Python environment, the rest of this installation will not work. You can usually tell that your environment is activated properly if your bash command-line prompt starts with (.bigjob).

Install BigJob

After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using pip by typing:

easy_install BigJob

To make sure that your installation works, run the following command to check if the BigJob module can be imported by the interpreter:

python -c "import pilot; print pilot.version"

The expected output should give the date and time, followed by "bigjob - INFO - Loading BigJob version: #.#.#", where #.#.# corresponds to the actual BigJob version you are using.

Create BigJob Agent Directory

Prior to running these examples, you will need to create a directory called 'agent' in the same location that you are running your scripts from. BigJob uses this as its working directory. For example, you might create the agent directory in the $HOME directory by typing:

mkdir $HOME/agent

Note: It is good practice to run your scripts out of $SCRATCH or $WORK. In this case, you would type mkdir $SCRATCH/<anySubDirectoryOfYourChoice>/agent and run your script from $SCRATCH/<anySubDirectoryOfYourChoice>. $HOME should only be used for the tutorial scripts!

Configure SSH Keys

If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine. Please see our guide to configuring SSH Password-Less Login.

Examples of when you would need password-less login: (1) You want to submit from your local machine to an XSEDE resource, (2) You want to submit from one XSEDE resource to another, (3) You want to submit from your local cluster to external clusters, etc. etc.

Supported Middleware Plugins on XSEDE

The following describes which middleware plugins are supported on XSEDE and which machines use which middleware (i.e. Lonestar uses the SGE batch queuing system, Kraken uses Torque for job submission, etc). Reference this table to find out how to edit the service_url in the pilot_compute_description as shown above.

Supported Middleware Plugins Description Information Machine
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost localhost
SSH Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org Allows to submit jobs to a remote host via SSH localhost
PBS Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH Trestles, Kraken
SGE Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH Lonestar, Ranger
GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html Kraken, Lonestar, Ranger, Trestles
Torque+GSISSH Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org Please find the GSISSH resource URLs of XSEDE machines at http://www.xsede.org Kraken

Note: For Lonestar/Ranger, sge+ssh:// is required when the target machine is not the same as the machine that the script is being run on. For example, if you run the script on Lonestar, but wish to submit jobs TO ranger, you must use sge+ssh://, but if you are running the script ON Lonestar and submitting jobs TO Lonestar only, you can use sge://. The same is true with PBS. Also note that to run from your LOCAL laptop TO Lonestar, you would use sge+ssh:// and you MUST have SSH keys configured.

BigJob Script Examples

You can work through examples for running the BigJob code by visiting the comprehensive BigJob tutorial.

Where to Get Help?

Check the [Frequently Asked Questions] (https://github.com/saga-project/BigJob/wiki/Frequently-Asked-Questions).

For questions and comments, please join the bigjob-users group:

Google Groups
Subscribe to bigjob-users
Email:
Visit this group
***