-
Notifications
You must be signed in to change notification settings - Fork 8
BigJob on XSEDE
BigJob, a SAGA-based Pilot-Job, is a general purpose Pilot-Job framework. Pilot-Jobs enable the decoupling of workload submission from resource assignment, thus allowing a flexible and dynamic execution of tasks. This has important consequences, ranging from something as simple as the ability of Pilot-Jobs to execute multiple jobs without the necessity to queue each individual job, to advanced usage scenarios such as executing tasks with complex and dynamic dependencies. The task execution model supported by Pilot-Jobs, enables the distributed scale-out of applications on multiple and possibly heterogeneous resources.
Additional information about BigJob can be found on the website: http://saga-project.github.com/BigJob/. We recommend you work through the [BigJob Tutorial](https://github.com/saga-project/BigJob/wiki/BigJob-Tutorial. A comprehensive API documentation is available at http://saga-project.github.com/BigJob/apidoc/.
BigJob uses SAGA-Python to connect to different grid middleware. SAGA-Python is installed automatically by BigJob. Although you should not need to know about SAGA-Python to use BigJob, for more information on SAGA-Python, please read the SAGA Tutorial.
BigJob uses a Redis server for coordination and task management. Redis is the most stable and fastest backend (requires Python >2.5) and the recommended way of using BigJob. Redis can easily be run in user space. It can be downloaded at: http://redis.io/download (just ~500 KB). Once you downloaded it, start a Redis server on the machine of your choice:
$ redis-server
[489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
[489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
[489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
[489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use
Then set the COORDINATION_URL parameter in the example to the Redis endpoint of your Redis installation, e.g.
redis://<hostname>:6379
You can install redis on a persistent server and use this server as your coordination server.
We do not want to use the system Python installation on XSEDE, because it is not uniform across all machines. Instead, you need a place were you can install BigJob locally. A small tool called virtualenv allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.):
virtualenv $HOME/.bigjob
If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead:
curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob
You need to activate your Python environment in order to make it work. Run the command below. It will temporarily modify your PYTHONPATH
so that it points to $HOME/.bigjob/lib/python2.7/site-packages/
instead of the the system site-package directory:
source $HOME/.bigjob/bin/activate
Activating the virtualenv is very important. If you don't activate your virtual Python environment, the rest of this installation will not work. You can usually tell that your environment is activated properly if your bash command-line prompt starts with (.bigjob)
.
After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using pip by typing:
easy_install BigJob
To make sure that your installation works, run the following command to check if the BigJob module can be imported by the interpreter:
python -c "import pilot; print pilot.version"
The expected output should give the date and time, followed by "bigjob - INFO - Loading BigJob version: #.#.#", where #.#.# corresponds to the actual BigJob version you are using.
BigJob requires Redis for communication between BigJob manager/agent. It is recommended to setup your own Redis in user-space. On Linux you can simply download Redis (Version 2.2.x is recommended) and install Redis using the following commands: tar -xzvf redis-2.2.12.tar.gz cd redis-2.2.12 make make install (if root)
Start Redis (redis-server executable is located in src/ dir): cd src ./redis-server
It is recommend to setup a password for your Redis server. Otherwise, other users will be able to access and manipulate your data stored in the Redis server.
Prior to running these examples, you will need to create a directory called 'agent' in the same location that you are running your scripts from. BigJob uses this as its working directory. For example, you might create the agent directory in the $HOME
directory by typing:
mkdir $HOME/agent
Note: It is good practice to run your scripts out of $SCRATCH
or $WORK
. In this case, you would type mkdir $SCRATCH/<anySubDirectoryOfYourChoice>/agent
and run your script from $SCRATCH/<anySubDirectoryOfYourChoice>
. $HOME
should only be used for the tutorial scripts!
If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine. Please see our guide to configuring SSH Password-Less Login.
Examples of when you would need password-less login: (1) You want to submit from your local machine to an XSEDE resource, (2) You want to submit from one XSEDE resource to another, (3) You want to submit from your local cluster to external clusters, etc. etc.
Below are the descriptions of two important constructs used to build workflows using Pilot-API.
The Pilot-API provides different entity for managing pilots, compute and data units. Most important are Pilot Descriptions - for describing pilots - and Compute Unit Description - for describing computational tasks.
Pilot description defines the resource specification for managing the jobs on that resource. The following are the resource specifications that need to be provided:
- service_url - specifies the SAGA Bliss job adaptor and resource hostname on which jobs can be executed. For remote hosts password less login need to be enabled.
- number_of_processes - specifies the total number of processes need to be allocated to run the jobs.
- allocation - specifies your allocation number on XSEDE
- queue - specifies the job queue to be used.
- working_directory - specifies the directory in which the Pilot-Job agent executes
- walltime - specifies the number of minutes the resources are requested.
- file_transfer - specifies the files that need to be transferred in order to execute the jobs successfully. Generally files common to all the jobs need to be listed here.
pilot_compute_description.append({ "service_url": "sge+ssh://localhost",
"number_of_processes": 12,
"allocation": "XSEDE12-SAGA",
"queue": "development",
"working_directory": os.getenv("HOME")+"/agent",
"walltime":10
})
The Compute Unit Description allows the user to specify the actual job parameters and data needed to execute the job.
- executable - specifies the executable.
- arguments - specifies the list of arguments to be passed to executable.
- environment - specifies the list of environment variables to be set for the successful of job execution.
- working_directory - specifies the directory in which the job has to execute. If not specified Pilot-Job creates a default directory.
- number_of_processes - specifies the number of processes to be assigned for the job execution.
- spmd_variation - specifies the type of job. By default it is single job.
- output - specifies the file in which the standard output of the job execution to be stored.
- error - specifies the file in which the standard error of the job execution to be stored.
- file_transfer - specifies the files that need to be transferred in order to execute the job successfully. Generally files specific to the job need to be listed here.
compute_unit_description = { "executable": "/bin/echo",
"arguments": ["Hello","$ENV1","$ENV2"],
"environment": ['ENV1=env_arg1','ENV2=env_arg2'],
"number_of_processes": 4,
"spmd_variation":"mpi",
"output": "stdout.txt",
"error": "stderr.txt"
}
The following describes which middleware plugins are supported on XSEDE and which machines use which middleware (i.e. Lonestar uses the SGE batch queuing system etc). Reference this table to find out how to edit the service_url
in the pilot_compute_description
as shown above.
Supported Middleware Plugins | Description | Information | Machine |
---|---|---|---|
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost | localhost | |
SSH | Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org | Allows to submit jobs to a remote host via SSH | localhost |
PBS | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost | Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH | Trestles, Kraken |
SGE | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost | Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH | Lonestar, Ranger |
GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge | Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html | Kraken, Lonestar, Ranger, Trestles |
Torque+GSISSH | Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org | Please find the GSISSH resource URLs of XSEDE machines at | Kraken |
Note: For Lonestar/Ranger, sge+ssh://
is required when the target machine is not the same as the machine that the script is being run on. For example, if you run the script on Lonestar, but wish to submit jobs TO ranger, you must use sge+ssh://
, but if you are running the script ON Lonestar and submitting jobs TO Lonestar only, you can use sge://
. The same is true with PBS. Also note that to run from your LOCAL laptop TO Lonestar, you would use sge+ssh://
and you MUST have SSH keys configured.
You can work through examples for running the BigJob code by visiting the comprehensive BigJob tutorial.
- BigJob Tutorial Learn more about BigJob and work through examples
Check the [Frequently Asked Questions] (https://github.com/saga-project/BigJob/wiki/Frequently-Asked-Questions).
For questions and comments, please join the bigjob-users group:
Subscribe to bigjob-users |
Email: |
Visit this group |