-
Notifications
You must be signed in to change notification settings - Fork 8
BigJob on XSEDE
BigJob, a SAGA-based Pilot-Job, is a general purpose Pilot-Job framework. Pilot-Jobs support the use of container jobs with sophisticated workflow management to coordinate the launch and interaction of actual computational tasks within the container. This results in the decoupling of workload submission from resource assignment, allowing a flexible execution model that enables the distributed scale-out of applications on multiple and possibly heterogeneous resources. It allows the execution of jobs without the necessity to queue each individual job.
Additional information about BigJob can be found on the website: http://saga-project.github.com/BigJob/. A comprehensive API documentation is available at http://saga-project.github.com/BigJob/apidoc/.
Below are the descriptions of two important constructs used to build workflows using Pilot-API.
Pilot description defines the resource specification for managing the jobs on that resource. The following are the resource specifications that need to be provided:
- service_url - specifies the SAGA Bliss job adaptor and resource hostname on which jobs can be executed. For remote hosts password less login need to be enabled.
- number_of_processes - specifies the total number of processes need to be allocated to run the jobs.
- queue - specifies the job queue to be used.
- working_directory - specifies the directory in which the Pilot-Job agent executes
- walltime - specifies the number of minutes the resources are requested.
- file_transfer - specifies the files that need to be transferred in order to execute the jobs successfully. Generally files common to all the jobs need to be listed here.
pilot_compute_description.append({ "service_url": "sge+ssh://localhost",
"number_of_processes": 12,
"allocation": "XSEDE12-SAGA",
"queue": "development",
"working_directory": os.getenv("HOME")+"/agent",
"walltime":10
})
The Compute Unit Description allows the user to specify the actual job parameters and data needed to execute the job.
- executable - specifies the executable.
- arguments - specifies the list of arguments to be passed to executable.
- environment - specifies the list of environment variables to be set for the successful of job execution.
- working_directory - specifies the directory in which the job has to execute. If not specified Pilot-Job creates a default directory.
- number_of_processes - specifies the number of processes to be assigned for the job execution.
- spmd_variation - specifies the type of job. By default it is single job.
- output - specifies the file in which the standard output of the job execution to be stored.
- error - specifies the file in which the standard error of the job execution to be stored.
- file_transfer - specifies the files that need to be transferred in order to execute the job successfully. Generally files specific to the job need to be listed here.
compute_unit_description = { "executable": "/bin/echo",
"arguments": ["Hello","$ENV1","$ENV2"],
"environment": ['ENV1=env_arg1','ENV2=env_arg2'],
"number_of_processes": 4,
"spmd_variation":"mpi",
"output": "stdout.txt",
"error": "stderr.txt"
}
BigJob uses SAGA-Bliss to connect to different grid middleware. SAGA-Bliss is installed automatically by BigJob. For more information on SAGA-Bliss, please read the SAGA Tutorial.
We do not want to use the system Python installation on XSEDE, because it is not uniform across all machines. Instead, you need a place were you can install BigJob locally. A small tool called virtualenv allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.):
virtualenv $HOME/tutorial
If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead:
curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob
You need to activate your Python environment in order to make it work. Run the command below. It will temporarily modify your PYTHONPATH
so that it points to $HOME/.bigjob/lib/python2.7/site-packages/
instead of the the system site-package directory:
source $HOME/.bigjob/bin/activate
Activating the virtualenv is very important. If you don't activate your virtual Python environment, the rest of this tutorial will not work. You can usually tell that your environment is activated properly if your bash command-line prompt starts with (.bigjob)
.
After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using pip by typing:
pip install BigJob
To make sure that your installation works, run the following command to check if the BigJob module can be imported by the interpreter:
python -c "import pilot; print pilot.version"
Prior to running these examples, you will need to create a directory called 'agent' in the same location that you are running your scripts from. BigJob uses this as its working directory. For example, you might create the agent directory in the $HOME
directory by typing:
mkdir $HOME/agent
Note: It is good practice to run your scripts out of $SCRATCH
or $WORK
. In this case, you would type mkdir $SCRATCH/<anySubDirectoryOfYourChoice>/agent
and run your script from $SCRATCH/<anySubDirectoryOfYourChoice
. $HOME
should only be used for the tutorial scripts!
If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine. Please see our guide to configuring SSH Password-Less Login.
Examples of when you would need password-less login: (1) You want to submit from your local machine to an XSEDE resource, (2) You want to submit from one XSEDE resource to another, (3) You want to submit from your local cluster to external clusters, etc. etc.
The following describes which adaptors are supported on XSEDE and which machines use which adaptors (i.e. Lonestar uses the SGE batch queuing system, Kraken uses Torque for job submission, etc). Reference this table to find out how to edit the service_url
in the pilot_compute_description
as shown above.
Supported Adaptors | Description | Information | Machine |
---|---|---|---|
fork | Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork://localhost | localhost | |
SSH | Submit jobs on target machine's head node. Password-less login to target machine is required. Example usage: ssh://eric1.loni.org | Allows to submit jobs to a remote host via SSH | localhost |
PBS | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): pbs+ssh://eric1.loni.org or Local: pbs://localhost | Interfaces with a PBS, PBS Pro or TORQUE scheduler locally or remotely via SSH | Trestles, Kraken |
SGE | Submit jobs to target machine's scheduling system. Password-less login to target machine is required. Example usage: Remote (over SSH): sge+ssh://lonestar.tacc.utexas.edu or Local: sge://localhost | Interfaces with a Sun Grid Engine (SGE) scheduler locally or remotely via SSH | Lonestar, Ranger |
GRAM | Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL gram://gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge | Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html | Kraken, Lonestar, Ranger, Trestles |
Torque+GSISSH | Submit jobs using gsissh. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL: xt5torque+gsissh://gsissh.kraken.nics.xsede.org | Please find the GSISSH resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html | Kraken |
Note: For Lonestar/Ranger, sge+ssh://
is required when the target machine is not the same as the machine that the script is being run on. For example, if you run the script on Lonestar, but wish to submit jobs TO ranger, you must use sge+ssh://
, but if you are running the script ON Lonestar and submitting jobs TO Lonestar only, you can use sge://
. The same is true with PBS. Also note that to run from your LOCAL laptop TO Lonestar, you would use sge+ssh://
and you MUST have SSH keys configured.
You can work through examples for running the BigJob code by visiting the comprehensive BigJob tutorial.
- BigJob Tutorial Learn more about BigJob and work through examples