diff --git a/search/search_index.json b/search/search_index.json index 4726c00..d0f6029 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"DiRAC Extreme Scaling User Documentation","text":"
DiRAC Extreme Scaling is part of the DiRAC National HPC Service. You can find more information on the service and the research it supports on the DiRAC website.
The DiRAC Extreme Scaling service is an HPC resource for UK researchers. DiRAC Extreme Scaling is provided by UKRI, EPCC and the University of Edinburgh. The hardware is provided by ATOS.
"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"The documentation currently includes:
The source for this documentation is publicly available in the DiRAC Extreme Scaling documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
The Tursa User Guide covers all aspects of use of the Tursa resource. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of Tursa and more technical topics.
The Tursa User Guide contains the following sections:
On the Tursa system, interactive access can be achieved via SSH, either directly from a command line terminal or using an SSH client. In addition data can be transferred to and from the Tursa system using scp
from the command line or by using a file transfer client.
This section covers the basic connection methods.
Before following the process below, we assume you have setup an account on Tursa through the DiRAC SAFE. Documentation on how to do this can be found at:
Linux distributions come installed with a terminal application that can be used for SSH access to the login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g. GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"tursa-user-guide/connecting/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"tursa-user-guide/connecting/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend all our Windows users to download and install MobaXterm to access Tursa. It is very easy to use and includes an integrated X server with SSH client to run any graphical applications on Tursa.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi), and the Windows wizard will automatically guides you through the installation process. Note, you might need to have administrator rights to install on some Windows OS. Also make sure to check whether Windows Firewall hasn't blocked any features of this program after installation.
Start MobaXterm and then click \"Start local terminal\"
Tips
If you download the .zip file rather than the .msi, make sure you unzip before attempting to run the installer.
If you are using a \"managed desktop\" machine, so do not have admin rights, you can use the Portable edition of MobaXterm which doesn't need install privilages.
If this is your first time using MobaXterm, check that a permanent /home directory has been set up (or all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\"-> check path to \"Persistent home directory\" is set and make sure path is \"private\" if prompted.
Any ssh key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above) i.e. if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find _MyDocuments_\\MobaXterm\\home\\.ssh
with your keys. This folder will be 'hidden' by default so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up ssh sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\" but we are not able to assist with debugging connection issues if you choose this method. We recommend sticking to command line terminal access.
To access Tursa, you need to use two credentials:
You can find more detailed instructions on how to set up your credentials to access Tursa from Windows, macOS and Linux below.
"},{"location":"tursa-user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access Tursa.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"tursa-user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE. Then:
Once you have done this, your SSH key will be added to your Tursa account.
Remember, you will need to use both an SSH key and password to log into Tursa so you will also need to collect your initial password before you can log into Tursa. We cover this next.
Note
If you want to connect to Tursa from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"tursa-user-guide/connecting/#initial-passwords-up-to-13-feb-2024","title":"Initial passwords (up to 13 Feb 2024)","text":"The SAFE web interface is used to provide your initial password for logging onto Tursa (see the SAFE Documentation for more details on requesting accounts and picking up passwords).
Note
You may now change your password on the Tursa machine itself using the passwd command or when you are prompted the first time you login. This change will not be reflected in the SAFE. If you forget your password, you should use the SAFE to request a new one-shot password.
"},{"location":"tursa-user-guide/connecting/#mfa-time-based-one-time-passcode-totp-from-13-feb-2024","title":"MFA Time-based one-time passcode (TOTP) (from 13 Feb 2024)","text":"You will need to use both an SSH key and time-based one-time passcode to log into Tursa so you will also need to set up a method for generating a TOTP code before you can log into Tursa.
"},{"location":"tursa-user-guide/connecting/#first-login-from-a-new-account-password-required","title":"First login from a new account: password required","text":"Important
You will not use your password when logging on to Tursa after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to Tursa with a new account. When you log into Tursa for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into Tursa from this point forwards, you will use your SSH key and TOTP as described above.
"},{"location":"tursa-user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"Interaction with Tursa is done remotely, over an encrypted communication channel, Secure Shell version 2 (SSH-2). This allows command-line access to one of the login nodes of a Tursa, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers when used in conjunction with an X client.
"},{"location":"tursa-user-guide/connecting/#logging-in","title":"Logging in","text":"You can use the following command from the terminal window to login into Tursa:
ssh username@tursa.dirac.ed.ac.uk\n
You need to enter both credentials correctly to be able to access Tursa.
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_Tursa
you would use the command ssh -i keys/id_rsa_Tursa username@login.tursa.ac.uk
to log in.
Tip
When you first log into Tursa, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed
To allow remote programs, especially graphical applications to control your local display, such as being able to open up a new GUI window (such as for a debugger), use:
ssh -X username@tursa.dirac.ed.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Typing in the full command to login or transfer data to Tursa can become tedious as it often has to be repeated many times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make things a bit more convenient.
Each remote site (or group of sites) can have an entry in this file which may look something like:
Host tursa\n HostName tursa.dirac.ed.ac.uk\n User username\n
(remember to replace username
with your actual username!).
The Host tursa
line defines a short name for the entry. In this case, instead of typing ssh username@tursa.dirac.ed.ac.uk
to access the Tursa login nodes, you could use ssh tursa
instead. The remaining lines define the options for the tursa
host.
Hostname tursa.dirac.ed.ac.uk
- defines the full address of the hostUser username
- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access Tursa without needing to enter your username or the full hostname every time:
$ ssh tursa\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config man page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. You may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add the path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect via SSH there are a number of ways you can try and diagnose the issue. Some of these are collected below - if you are having difficulties connecting we suggest trying these before contacting the Tursa service desk.
"},{"location":"tursa-user-guide/connecting/#use-the-usertursadiracedacuk-syntax-rather-than-l-user-tursadiracedacuk","title":"Use theuser@tursa.dirac.ed.ac.uk
syntax rather than -l user tursa.dirac.ed.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user tursa.dirac.ed.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@tursa.dirac.ed.ac.uk\n
works more reliably. If you are using the -l user
option to connect and are seeing issues, then try using user@tursa.dirac.ed.ac.uk
instead.
Try the command ping -c 3 tursa.dirac.ed.ac.uk
. If you successfully connect to the login node, the output should include:
--- tursa.dirac.ed.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your internet connection, or the login node could be unavailable.
"},{"location":"tursa-user-guide/connecting/#password","title":"Password","text":"If you are having trouble entering your password consider using a password manager, from which you can copy and paste it. This will also help you generate a secure password. If you need to reset your password, instructions for doing so can be found in the SAFE documentation
Windows users please note that Ctrl+V
does not work to paste in to PuTTY, MobaXterm, or PowerShell. Instead use Shift+Ins
to paste. Alternatively, right-click and select 'Paste' in PuTTY and MobaXterm, or simply right-click to paste in PowerShell.
If you get the error message Permission denied (publickey)
this can indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is reuploaded SAFE will not map the \"new\" key to Tursa. If for some reason this is required, please delete the key first, then reupload.
Is ssh using the correct key? You can check which keys are being found and offered by ssh using ssh -vvv
. If your private key has a non-default name you can use the -i
flag to provide it to ssh, i.e. ssh -i path/to/key username@tursa.dirac.ed.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again, and usually up to three times in total, after which ssh will fail with Permission denied (publickey)
. If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check it is the one uploaded to SAFE.
Are permissions correct on the ssh key? One common issue is that the permissions are incorrect on the either the key file, or the directory it's contained in. On Linux/MacOS for example, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command. On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both public and private key files, and the containing folder.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_Tursa
one would use the command chmod 600 id_rsa_Tursa
.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
Verbose debugging output from ssh
can be very useful for diagnosing the issue. In particular, it can be used to distinguish between problems with the SSH key and password - further details are given below. To enable verbose output add the -vvv
flag to your SSH command. For example:
ssh -vvv username@tursa.dirac.ed.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: publickey,password\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the Tursa login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
You should next see something similiar to:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (publickey).\n
Most importantly, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default ssh will go through a list of standard private key files, as well as any you have specified with -i
or a config file. This is fine, as long as one of the files mentioned is the one that matches the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
This section covers best practice and tools for data management on Tursa as well as information on the storgae available on the system.
Information
If you have any questions on data management and transfer please do not hesitate to contact the DiRAC service desk at dirac-support@epcc.ed.ac.uk.
"},{"location":"tursa-user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":"We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the Tursa service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.
"},{"location":"tursa-user-guide/data/#tursa-storage","title":"Tursa storage","text":"Tursa has two different storage systems available:
The Tursa storage is provided by a parallel Lustre file system that provides your home directories and working storage. When you log in you will be placed in your home directory.
The home directory for each user is located at:
/home/[project code]/[group code]/[username]\n
where
[project code]
is the code for your project (e.g., x01);[group code]
is the code for your project group, if your project has groups, (e.g. x01-a) or the same as the project code, if not;[username]
is your login name.Each project is allocated a portion of the total storage available, and the project PI will be able to sub-divide this quota among the groups and users within the project. As is standard practice on UNIX and Linux systems, the environment variable $HOME
is automatically set to point to your home directory.
The tape storage can be made available to any Tursa user on request and can be used to store data from the Lustre parallel file system.
Managing and transferring data to/from the Tursa tape storage is done via the Miria web interface via an SSH tunnel to the Tursa login nodes.
Important
All data on the tape storage is shared project data rather than data associated with individual user accounts. Any data you move to tape will be visible to all users in the same project as you who have access to the tape storage.
"},{"location":"tursa-user-guide/data/#requesting-access-to-the-tape-storage","title":"Requesting access to the tape storage","text":"If you want to use the Tursa tape storage, you should contact the DiRAC Service Desk with the username and project ID you want to use to access the storage.
"},{"location":"tursa-user-guide/data/#data-locations","title":"Data locations","text":"In order to move data to the tape storage it must exist in a specific directory on the Tursa Lustre file system. You will need to move or copy the data to this location before it can be moved to tape and when you restore data from tape it will be placed in this location.
There is one directory per project on Tursa. The directory has the path:
/mnt/lustre/tursafs1/archive/[project code]\n
So, for example, the directory for project dp001
would be:
/mnt/lustre/tursafs1/archive/dp001\n
"},{"location":"tursa-user-guide/data/#setup-the-ssh-tunnel-for-miria","title":"Setup the SSH tunnel for Miria","text":"Once your tape storage access has been setup and you have moved data to the archive directory, you will need to connect to the Miria web interface in a web browser on your local system by setting up an SSH tunnel to the Tursa login nodes.
You do this by logging into Tursa in the usual way (with your SSH key and password) and adding the -L 9080:10.144.20.95:443
option to the ssh
command.
For example, if your username is dc-user1
, you would setup the tunnel by logging into Tursa with (assuming your SSH key is in the default location):
ssh -L 9080:10.144.20.95:443 dc-user1@tursa.dirac.ed.ac.uk\n
Enter your SSH key passphrase and password in the usual way.
Note
You will need to setup the SSH tunnel each time you want to access the Miria interface.
"},{"location":"tursa-user-guide/data/#access-the-miria-interface","title":"Access the Miria interface","text":"Once you have setup the SSH tunnel, you should be able to access the Miria interface in a web browser on your local system. Open a new tab and enter the URL:
You should see an interface asking you for a username and password. Use the username and password that you use to log into Tursa to log into the tape storage interface.
"},{"location":"tursa-user-guide/data/#transfer-data-from-tursa-lustre-to-tape","title":"Transfer data from Tursa Lustre to tape","text":"You use the \"Easy Move\" option from the left-hand menu to transfer data.
Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.
"},{"location":"tursa-user-guide/data/#restore-data-from-tape-to-tursa-lustre","title":"Restore data from tape to Tursa Lustre","text":"You use the \"Easy Move\" option from the left-hand menu to transfer data.
Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.
Bug
If you restore a file rather than a directory, the Miria tool will give the file the name NULL
once it is restored, you should use the mv
command to rename the file to the correct name once it has been restored.
How you share data with other Tursa users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.
"},{"location":"tursa-user-guide/data/#sharing-data-with-tursa-users-in-your-project","title":"Sharing data with Tursa users in your project","text":"Each project has an inner shared folder.
/home/[project code]/[project code]/shared\n
This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /home/x01/x01/shared
.
Each project also has an outer shared folder.:
/home/[project code]/shared\n
It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other Tursa users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /home/x01/shared
.
You should check the permissions of any files that you place in the shared area, especially if those files were created in your own Tursa account Files of the latter type are likely to be readable by you only.
The chmod
command below shows how to make sure that a file placed in the outer shared folder is also readable by all Tursa users.
chmod a+r /home/x01/shared/your-shared-file.txt\n
Similarly, for the inner shared folder, chmod
can be called such that read permission is granted to all users within the x01 project.
chmod g+r /home/x01/x01/shared/your-shared-file.txt\n
If you're sharing a set of files stored within a folder hierarchy the chmod
is slightly more complicated.
chmod -R a+Xr /home/x01/shared/my-shared-folder\nchmod -R g+Xr /home/x01/x01/shared/my-shared-folder\n
The -R
option ensures that the read permission is enabled recursively and the +X
guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder
.
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
The method you use to transfer data to/from Tursa will depend on how much you want to transfer and where to. The methods we cover in this guide are:
Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.
"},{"location":"tursa-user-guide/data/#archiving","title":"Archiving","text":"If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar
and zip
.
The tar
command packs files into a \"tape archive\" format. The command has general form:
tar [options] [file(s)]\n
Common options include:
-c
create a new archive-v
verbosely list files processed-W
verify the archive after writing-l
confirm all file hard links are included in the archive-f
use an archive file (for historical reasons, tar writes its output to stdout by default rather than a file).Putting these together:
tar -cvWlf mydata.tar mydata\n
will create and verify an archive.
To extract files from a tar file, the option -x
is used. For example:
tar -xf mydata.tar\n
will recover the contents of mydata.tar
to the current working directory.
To verify an existing tar file against a set of data, the -d
(diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n
Note
tar files do not store checksums with their data, requiring the original data to be present during verification.
Tip
Further information on using tar
can be found in the tar
manual (accessed via man tar
or at man tar).
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]\n
Common options are:
-r
used to zip up a directory-#
where \"#\" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process.Together:
zip -0r mydata.zip mydata\n
will create an archive.
Note
Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.
The corresponding unzip
command is used to extract data from the archive. The simplest use case is:
unzip mydata.zip\n
which recovers the contents of the archive to the current working directory.
Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip
provides options for verifying this checksum against the stored files. The relevant flag is -t
and is used as follows:
$> unzip -t mydata.zip\nArchive: mydata.zip\n testing: mydata/ OK\n testing: mydata/file OK\nNo errors detected in compressed data of mydata.zip.\n
Tip
Further information on using zip
can be found in the zip
manual (accessed via man zip
or at man zip).
The easiest way of transferring data to/from Tursa is to use one of the standard programs based on the SSH protocol such as scp
, sftp
or rsync
. These all use the same underlying mechanism (SSH) as you normally use to log-in to Tursa. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (Tursa in this case).
To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting
.
The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The Tursa login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr
or aes256-ctr
algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp
so you will need to manually specify them.
A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.
In addition, you should consider the following when transferring data:
gzip
.The scp
command creates a copy of a file, or if given the -r
flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.
For example, to transfer files to Tursa from a local machine:
scp [options] source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
In the above example, the [destination]
is optional, as when left out scp
will copy the source into your home directory. Also, the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
If you want to request a different encryption algorithm add the -c [algorithm-name]
flag to the scp
options. For example, to use the (usually faster) arcfour encryption algorithm you would use:
scp [options] -c aes128-ctr source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
The rsync
command can also transfer data between hosts using a ssh
connection. It creates a copy of a file or, if given the -r
flag, a directory at the given destination, similar to scp
above.
Given the -a
option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync
command is executed with ssh
to create the copy on a remote machine.
To transfer files to Tursa using rsync
with ssh
the command has the form:
rsync [options] -e ssh source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
In the above example, the [destination]
is optional, as when left out rsync will copy the source into your home directory. Also the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
Additional flags can be specified for the underlying ssh
command by using a quoted string as the argument of the -e
flag. e.g.
rsync [options] -e \"ssh -c arcfour\" source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
Tip
Further information on using rsync
can be found in the rsync
manual (accessed via man rsync
or at man rsync).
Here we have a short example demonstrating transfer of data directly from a laptop/workstation to Tursa.
Note
This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.
Before we can transfer of data to Tursa we need to make sure we have an SSH key setup to access Tursa from the system we are transferring data from. If you are using the same system that you use to log into Tursa then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to Tursa.
Tip
Remember that you will need to use both a key and your password to transfer data to Tursa.
Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n
We then initiate the data transfer from our system to Tursa, here using rsync
to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2
on our local system:
rsync -Pv -e\"ssh -c aes128-gcm@openssh.com -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@tursa.dirac.ed.ac.uk:/home/z19/z19/otbz19/\n
Note the use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-gcm
which has been found to increase performance. Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on Tursa.
Note
Remember to replace otbz19
with your username on Tursa.
If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp
command,
scp -c aes128-gcm@openssh.com -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@transfer.dyn.tursa.ac.uk:/home/z19/z19/otbz19/\n
but rsync
is recommended for larger transfers.
Note
Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.
"},{"location":"tursa-user-guide/hardware/#system-overview","title":"System overview","text":"Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect.
There are additional login nodes, which provide access to the system.
Compute nodes are only accessible via the Slurm job scheduling system.
There is a single file system which is available on login and compute nodes (see Data management and transfer).
The Lustre file system has a capacity of 5.1 PiB.
The interconnect uses a Fat Tree topology.
"},{"location":"tursa-user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology:
As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
Hint
If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
"},{"location":"tursa-user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"tursa-user-guide/scheduler/#gpuh","title":"GPUh","text":"Time used on Tursa nodes is measured in GPUh. 1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost 4 GPUh per hour.
Note
The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.
"},{"location":"tursa-user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. dp123 resources
and then under Resource Pool to the right of this, a note of the remaining budgets.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins%75\n
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins \n---------- ---------- --------------------------------------------------------------------------- \n t01 dc-user1 gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 \n z01 dc-user1 \n
This shows that dc-user1
is a member of budgets t01
and z01
. However, the gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0
indicates that the t01
budget can only run GPU jobs in standard (charged) partitions (all other options are disabled, indicated by =0
for CPU standard, CPU low and GPU low). This user can also submit jobs to any partition using the z01
budget.
To see the number of coreh or GPUh remaining you must check in SAFE.
"},{"location":"tursa-user-guide/scheduler/#charging","title":"Charging","text":"Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
"},{"location":"tursa-user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"There are three key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called: jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID 12345
)We cover each of these commands in more detail below.
"},{"location":"tursa-user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo
: information on resources","text":"sinfo
is used to query information about available resources and partitions. Without any options, sinfo
lists the status of all resources and partitions, e.g.
[dc-user1@tursa-login1 ~]$ sinfo \n\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\ncpu up 2-00:00:00 4 alloc tu-c0r0n[66-69]\ncpu up 2-00:00:00 2 idle tu-c0r0n[70-71]\ngpu up 2-00:00:00 1 plnd tu-c0r2n93\ngpu up 2-00:00:00 11 drain tu-c0r0n75,tu-c0r5n[48,51,54,57],tu-c0r6n[48,51,54,57],tu-c0r7n[00,48]\ngpu up 2-00:00:00 112 mix tu-c0r0n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,72,87,90],tu-c0r1n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90],tu-c0r3n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,90,93],tu-c0r4n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,81,84,87,90,93]\ngpu up 2-00:00:00 56 resv tu-c0r0n93,tu-c0r4n78,tu-c0r5n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45],tu-c0r6n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,60,63,66,69],tu-c0r7n[03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,51,54,57]\ngpu up 2-00:00:00 1 idle tu-c0r3n87\n
alloc
nodes are those that are running jobsidle
nodes are emptydrain
, down
, maint
nodes are unavailable to usersplnd
nodes are reserved for future jobssbatch
: submitting jobs","text":"sbatch
is used to submit a job script to the job submission system. The script will typically contain one or more mpirun
commands to launch parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"tursa-user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue
: monitoring jobs","text":"squeue
without any options or arguments shows the current status of all jobs known to the scheduler. For example:
squeue\n
will list all jobs on Tursa.
The output of this is often large. You can restrict the output to just your jobs by adding the --me
option:
squeue --me\n
"},{"location":"tursa-user-guide/scheduler/#scancel-deleting-jobs","title":"scancel
: deleting jobs","text":"scancel
is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately. You need to provide the job ID of the job you wish to cancel/stop. For example:
scancel 12345\n
will cancel (if waiting) or stop (if running) the job with ID 12345
.
The Tursa resource limits for any given job are covered by three separate attributes.
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on Tursa which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
"},{"location":"tursa-user-guide/scheduler/#partitions","title":"Partitions","text":"On Tursa, compute nodes are grouped into partitions. You will have to specify a partition using the --partition
option in your Slurm submission script. The following table has a list of active partitions on Tursa.
You can list the active partitions by running sinfo
.
Tip
You may not have access to all the available partitions.
"},{"location":"tursa-user-guide/scheduler/#quality-of-service-qos","title":"Quality of Service (QoS)","text":"On Tursa, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos
Slurm directive. The following table lists the active QoS on Tursa.
gpu-a100-40
(1-node maximum) or gpu-a100-80
(2-node maximum) partitions. You can find out the QoS that you can use by running the following command:
sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50\n
As long as you have a positive budget, you should use the standard
QoS. Once you have exhausted your budget you can use the low
QoS to continue to run jobs at a lower priority than jobs in the standard
QoS.
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
Important
Only jobs sizes that are powers of 2 nodes are allowed. i.e. 1, 2, 4, 8, 16, 32, 64 nodes on the gpu
partition and 1, 2, 4 nodes on the cpu
partition.
Job priority on Tursa depends on a number of different factors:
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]\n
The priority factors are:
standard
QoS has a value of 5000 and low
QoS a value of 1.You can view the priorities for current queued jobs on the system with the sprio
command:
[dc-user1@tursa-login1 ~]$ sprio \n JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS\n 43963 gpu 5055 0 51 5 5000\n 43975 gpu 5061 0 41 20 5000\n 43976 gpu 5061 0 41 20 5000\n 43982 gpu 5046 0 26 20 5000\n 43986 gpu 5011 0 6 5 5000\n 43996 gpu 5020 0 0 20 5000\n 43997 gpu 5020 0 0 20 5000\n
"},{"location":"tursa-user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"tursa-user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
sbatch: unrecognized option <text>
One of your options is invalid or has a typo. man sbatch
to help.
error: Batch job submission failed: No partition specified or system default partition
A --partition=
option is missing. You must specify the partition (see the list above). This is most often --partition=standard
.
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the --account=
options against valid accounts in SAFE.
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a --qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.)
error: Your job has no time specification (--time=)...
Add an option of the form --time=hours:minutes:seconds
to the submission script. E.g., --time=01:30:00
gives a time limit of 90 minutes.
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
The squeue
command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.
For a full list of see Job State Codes.
Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.For a full list of see Job Reasons.
"},{"location":"tursa-user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out
. This file appears in the job's working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered
to the srun
command in your SLURM script.
You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH
.
Hint
Most options provided using #SBATCH
directives can also be specified as command line options to srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something like t01
or t01-test
. You can see which budget codes you can charge to in SAFE.Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use --time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it inIn addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--tasks-per-node=<processes per node>
the number of parallel processes (e.g. MPI ranks) per node. For Grid on GPU nodes this will typically be 4 to give 1 MPI process per GPU. The CPU nodes have 128 cores per node.--cpus-per-task=<stride between processes>
for Grid jobs on GPU nodes where you typically use 1 MPI process per GPU, 4 per node, this will usually be 12 (so that the 48 cores on a node are evenly divided between the 4 MPI processes)--gres=gpu:4
the number of GPU to use per node. This will almost always be 4 to use all GPUs on a node. (This option should not be specified for jobs on the CPU nodes.)If you are happy to have any GPU type for your job (A100-40 or A100-80) then you select the gpu
partition:
--partition=gpu
If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:
--partition=gpu-a100-80
request the job is placed on nodes with high-memory (80 GB) GPUs - there are 64 high memory GPU nodes on the system. To just use the A100-40 GPU nodes:
--partition=gpu-a100-40
request the job is placed on nodes with standard memory (40 GB) GPUs.If you do not specfy a partition, the scheduler may use any available node types for the job (equivalent of --partition=gpu
).
Note
For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes).
To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.Using the --export=none
means that the behaviour of batch submissions should be repeatable. We strongly recommend its use, although see the following section to enable access to the usual modules.
Important
The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.
Users can control the GPU frequency in their job submission scripts:
--gpu-freq=<desired GPU freq in MHz>
allows users to set the GPU frequency on a per job basis. The frequency can be set in the range 210 - 1410 MHz in steps of 15 MHz.Bug
When setting the GPU frequency you will see an error in the output from the job that says control disabled
. This is an incorrect message due to an issue with how Slurm sets the GPU frequency and can be safely ignored.
srun
: Launching parallel jobs","text":"If you are running parallel jobs, your job submission script should contain one or more srun commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block and --hint=nomultithread to your srun command to ensure you get the correct pinning of processes to cores on a compute node.
A brief explanation of these options: - --hint=nomultithread
- do not use hyperthreads/SMP - --distribution=block:block
- the first block
means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block
means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).
Important
The Slurm definition of a \"socket\" does not usually correspond to a physical CPU socket. On Tursa GPU nodes it corresponds to half the cores on a socket as the GPU nodes are configured with NPS2.
On the Tursa CPU nodes, the Slurm definition of a scoket does correspond to a physical CPU socket (64 cores) as the COU nodes are configured with NPS1.
"},{"location":"tursa-user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":""},{"location":"tursa-user-guide/scheduler/#example-job-submission-script-for-a-parallel-job-using-cuda","title":"Example: job submission script for a parallel job using CUDA","text":"A job submission script for a parallel job that uses 4 compute nodes, 4 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Grid_job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=8\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\nexport OMP_NUM_THREADS=8\n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\nsrun --hint=nomultithread --distribution=block:block \\\n gpu_launch.sh \\\n ${application} ${options}\n
This will run your executable \"my_mpi_opnemp_app.x\" in parallel usimg 16 MPI processes on 4 nodes. 4 GPUs will be used per node.
Important
You must use the gpu_launch.sh
wrapper script to get the correct biniding of GPU to MPI processes and of network interface to GPU and MPI process. This script is described in more detail below.
gpu_launch.sh
wrapper script","text":"The gpu_launch.sh
wrapper script is required to set the correct binding of GPU to MPI processes and the correct binding of interconnect interfaces to MPI process and GPU. We provide this centrally for convenience but its contents are simple:
#!/bin/bash\n\n# Compute the raw process ID for binding to GPU and NIC\nlrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))\n\n# Bind the process to the correct GPU and NIC\nexport CUDA_VISIBLE_DEVICES=${lrank}\nexport UCX_NET_DEVICES=mlx5_${lrank}:1\n\n$@\n
"},{"location":"tursa-user-guide/scheduler/#using-the-dev-qos","title":"Using the dev
QoS","text":"The dev
QoS is designed for faster turnaround of short jobs than is usually available through the production QoS. It is subject to a number of restrictions:
gpu-a100-80
partitiongpu-a100-40
partitionIn addtion, you must specify either the gpu-a100-80
or gpu-a100-40
partitions when using the dev
QoS.
Tip
The generic gpu
partition will not work consistently when using the dev
QoS.
Here is an example job submission script for a 2-node job in the dev
QoS using the gpu-a100-80
partition. Note the use of the gpu_launch.sh
wrapper script to get correct GPU and NIC binding.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Dev_Job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=12\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu-a100-80\n#SBATCH --qos=dev\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n\nexport OMP_NUM_THREADS=1\n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\nsrun --hint=nomultithread --distribution=block:block \\\n gpu_launch.sh \\\n ${application} ${options}\n
"},{"location":"tursa-user-guide/sw-environment/","title":"Software environment","text":"The software environment on Tursa is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on Tursa start with the default software environment loaded.
Software modules on Tursa are provided by both Eviden and by EPCC.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the module
command here. For full documentation please see the Linux manual page on modules
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module savelist
- List module collections available (usually used for accessing different programming environments)module restore name
- Restore the module collection called name
(usually used for setting up a programming environment)module load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule swap old new
- Swap module new
for module old
in your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
"},{"location":"tursa-user-guide/sw-environment/#information-on-the-available-modules","title":"Information on the available modules","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment. By default, you will have no modules loaded when you first log into Tursa
Finding out which software modules are available on the system is performed using the module avail
command. To list all software modules available, use:
[dc-user1@tursa-login1 ~]$ module avail\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\ncuda/11.0.2 openmpi/4.1.1-cuda11.0.2 ucx/1.10.1-cuda11.0.2 \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\ncuda/11.4.1 openmpi/4.1.1-cuda11.4 ucx/1.12.0-cuda11.4 \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\ncuda/11.4.1 openmpi/4.1.1-cuda11.4.1 ucx/1.12.0-cuda11.4.1 \n\n------------------------------------------------ /mnt/lustre/tursafs1/apps/modulefiles -------------------------------------------------\ncuda/11.0.3 dot gcc/9.3.0 module-git module-info modules null openmpi/4.1.1 ucx/1.10.1 use.own xpmem/2.6.5 \n
This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the OpenMPI library, use:
[dc-user1@tursa-login1 ~]$ module avail openmpi\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.0.2 \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\nopenmpi/4.1.1-cuda11.4 \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.4.1 \n\n----------------\n
The module show
command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default openmpi module:
[dc-user1@tursa-login1 ~]$ module show openmpi\n-------------------------------------------------------------------\n/mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles/openmpi/4.1.1-cuda11.0.2:\n\nmodule-whatis Sets up OpenMPI on your environment\nsetenv MPI_ROOT /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/\nprepend-path PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/bin/\nprepend-path LD_LIBRARY_PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/lib\nprepend-path MANPATH /opt/mpi/openmpi/4.0.4.1/share/man\nmodule load ucx/1.10.1\nsetenv OMPI_CC cc\nsetenv OMPI_CXX g++\nsetenv OMPI_CFLAGS -g -m64\nsetenv OMPI_CXXFLAGS -g -m64\n-------------------------------------------------------------------\n
"},{"location":"tursa-user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To load a module to use the module load
command. For example, to load the default version of OpenMPI into your environment, use:
[dc-user1@tursa-login1 ~]$ module load openmpi\n\n UCX 1.10 loaded\n\n\n OpenMPI 4.1.1 loaded\n
Once you have done this, your environment will be setup to use the OpenMPI library. The above command will load the default version of OpenMPI. If you need a specific version of the software, you can add more information:
[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4.1\n\n UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0 loaded\n
will load OpenMPI version 4.1.1 with CUDA 11.4.1 into your environment, regardless of the default.
If you want to remove software from your environment, module rm
will remove a loaded module:
[dc-user1@tursa-login1 ~]$ module rm openmpi\n
will unload what ever version of openmpi
(even if it is not the default) you might have loaded.
There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
Suppose you have loaded version 4.1.1 of openmpi
, the following command will change to version 4.1.1-cuda11.4.1:
[dc-user1@tursa-login1 ~]$ module swap openmpi openmpi/4.1.1-cuda11.4.1\n\n UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0 loaded\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"tursa-user-guide/sw-environment/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:
[dc-user1@tursa-login1 ~]$ module save [collection_name]\n
Note
If you do not specify the environment name, it is called default
.
You can find the list of saved module environments by executing:
[dc-user1@tursa-login1 ~]$ module savelist\nNamed collection list:\n 1) default\n
To list the modules in a collection, you can execute, e.g.,:
[dc-user1@tursa-login1 ~]$ module saveshow default\n-------------------------------------------------------------------\n/home/z01/z01/dc-turn1/.module/default:\n\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/modulefilesintel\nmodule use --append /mnt/lustre/tursafs1/apps/modulefiles\nmodule load ucx/1.12.0-cuda11.4.1\nmodule load openmpi/4.1.1-cuda11.4.1\n\n-------------------------------------------------------------------\n
Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,
[dc-user1@tursa-login1 ~]$ module save /home/t01/z01/auser/my-module-collection\n
if you want to save to a specific file name.
To delete a module environment, you can execute:
[dc-user1@tursa-login1 ~]$ module saverm <environment_name>\n
"},{"location":"tursa-user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to Tursa, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
"},{"location":"tursa-user-guide/sw-environment/#compiler-environment","title":"Compiler environment","text":"The system supports two different primary compiler environments:
To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:
[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load cuda/11.4.1 \n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) gcc/9.3.0 2) cuda/11.4.1 3) ucx/1.12.0-cuda11.4 4) openmpi/4.1.1-cuda11.4 \n
Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:
mpicc
mpicxx
mpif90
You can find more information on these scripts in the OpenMPI documentation.
"},{"location":"tursa-user-guide/sw-environment/#nvhpc-toolchain","title":"NVHPC toolchain","text":"To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:
[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load nvhpc/21.7-nompi\n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) /mnt/lustre/tursafs1/home/y07/shared/tursa-modules/setup-env 2) nvhpc/21.7-nompi\n 3) ucx/1.12.0-cuda11.4 4) openmpi/4.1.1-cuda11.4 5) gcc/9.3.0\n
Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:
mpicc
mpicxx
mpif90
and the NVIDIA compilers are available as:
nvcc
nvc++
nvfortran
Tip
Both the NVIDIA compilers and the MPI compiler wrapper scripts will use the GCC compilers directly in the default configuration - this is often what you want. If you want the compiler wrappers to call the NVIDIA compilers themselves rather than GCC directly, you would use:
export OMPI_CC=nvcc\nexport OMPI_CXX=nvc++\nexport OMPI_FC=nvfortran\n
"},{"location":"tursa-user-guide/sw-environment/#other-build-tools","title":"Other build tools","text":""},{"location":"tursa-user-guide/sw-environment/#cmake","title":"cmake","text":"CMake is available by using the commands:
[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load cmake\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"DiRAC Extreme Scaling User Documentation","text":"DiRAC Extreme Scaling is part of the DiRAC National HPC Service. You can find more information on the service and the research it supports on the DiRAC website.
The DiRAC Extreme Scaling service is an HPC resource for UK researchers. DiRAC Extreme Scaling is provided by UKRI, EPCC and the University of Edinburgh. The hardware is provided by ATOS.
"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"The documentation currently includes:
The source for this documentation is publicly available in the DiRAC Extreme Scaling documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
The Tursa User Guide covers all aspects of use of the Tursa resource. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of Tursa and more technical topics.
The Tursa User Guide contains the following sections:
On the Tursa system, interactive access can be achieved via SSH, either directly from a command line terminal or using an SSH client. In addition data can be transferred to and from the Tursa system using scp
from the command line or by using a file transfer client.
This section covers the basic connection methods.
Before following the process below, we assume you have setup an account on Tursa through the DiRAC SAFE. Documentation on how to do this can be found at:
Linux distributions come installed with a terminal application that can be used for SSH access to the login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g. GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"tursa-user-guide/connecting/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"tursa-user-guide/connecting/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend all our Windows users to download and install MobaXterm to access Tursa. It is very easy to use and includes an integrated X server with SSH client to run any graphical applications on Tursa.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi), and the Windows wizard will automatically guides you through the installation process. Note, you might need to have administrator rights to install on some Windows OS. Also make sure to check whether Windows Firewall hasn't blocked any features of this program after installation.
Start MobaXterm and then click \"Start local terminal\"
Tips
If you download the .zip file rather than the .msi, make sure you unzip before attempting to run the installer.
If you are using a \"managed desktop\" machine, so do not have admin rights, you can use the Portable edition of MobaXterm which doesn't need install privilages.
If this is your first time using MobaXterm, check that a permanent /home directory has been set up (or all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\"-> check path to \"Persistent home directory\" is set and make sure path is \"private\" if prompted.
Any ssh key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above) i.e. if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find _MyDocuments_\\MobaXterm\\home\\.ssh
with your keys. This folder will be 'hidden' by default so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up ssh sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\" but we are not able to assist with debugging connection issues if you choose this method. We recommend sticking to command line terminal access.
To access Tursa, you need to use two credentials:
You can find more detailed instructions on how to set up your credentials to access Tursa from Windows, macOS and Linux below.
"},{"location":"tursa-user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access Tursa.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"tursa-user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE. Then:
Once you have done this, your SSH key will be added to your Tursa account.
Remember, you will need to use both an SSH key and password to log into Tursa so you will also need to collect your initial password before you can log into Tursa. We cover this next.
Note
If you want to connect to Tursa from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"tursa-user-guide/connecting/#initial-passwords-up-to-13-feb-2024","title":"Initial passwords (up to 13 Feb 2024)","text":"The SAFE web interface is used to provide your initial password for logging onto Tursa (see the SAFE Documentation for more details on requesting accounts and picking up passwords).
Note
You may now change your password on the Tursa machine itself using the passwd command or when you are prompted the first time you login. This change will not be reflected in the SAFE. If you forget your password, you should use the SAFE to request a new one-shot password.
"},{"location":"tursa-user-guide/connecting/#mfa-time-based-one-time-passcode-totp-from-13-feb-2024","title":"MFA Time-based one-time passcode (TOTP) (from 13 Feb 2024)","text":"You will need to use both an SSH key and time-based one-time passcode to log into Tursa so you will also need to set up a method for generating a TOTP code before you can log into Tursa.
"},{"location":"tursa-user-guide/connecting/#first-login-from-a-new-account-password-required","title":"First login from a new account: password required","text":"Important
You will not use your password when logging on to Tursa after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to Tursa with a new account. When you log into Tursa for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into Tursa from this point forwards, you will use your SSH key and TOTP as described above.
"},{"location":"tursa-user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"Interaction with Tursa is done remotely, over an encrypted communication channel, Secure Shell version 2 (SSH-2). This allows command-line access to one of the login nodes of a Tursa, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers when used in conjunction with an X client.
"},{"location":"tursa-user-guide/connecting/#logging-in","title":"Logging in","text":"You can use the following command from the terminal window to login into Tursa:
ssh username@tursa.dirac.ed.ac.uk\n
You need to enter both credentials correctly to be able to access Tursa.
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_Tursa
you would use the command ssh -i keys/id_rsa_Tursa username@login.tursa.ac.uk
to log in.
Tip
When you first log into Tursa, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed
To allow remote programs, especially graphical applications to control your local display, such as being able to open up a new GUI window (such as for a debugger), use:
ssh -X username@tursa.dirac.ed.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Typing in the full command to login or transfer data to Tursa can become tedious as it often has to be repeated many times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make things a bit more convenient.
Each remote site (or group of sites) can have an entry in this file which may look something like:
Host tursa\n HostName tursa.dirac.ed.ac.uk\n User username\n
(remember to replace username
with your actual username!).
The Host tursa
line defines a short name for the entry. In this case, instead of typing ssh username@tursa.dirac.ed.ac.uk
to access the Tursa login nodes, you could use ssh tursa
instead. The remaining lines define the options for the tursa
host.
Hostname tursa.dirac.ed.ac.uk
- defines the full address of the hostUser username
- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access Tursa without needing to enter your username or the full hostname every time:
$ ssh tursa\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config man page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. You may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add the path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect via SSH there are a number of ways you can try and diagnose the issue. Some of these are collected below - if you are having difficulties connecting we suggest trying these before contacting the Tursa service desk.
"},{"location":"tursa-user-guide/connecting/#use-the-usertursadiracedacuk-syntax-rather-than-l-user-tursadiracedacuk","title":"Use theuser@tursa.dirac.ed.ac.uk
syntax rather than -l user tursa.dirac.ed.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user tursa.dirac.ed.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@tursa.dirac.ed.ac.uk\n
works more reliably. If you are using the -l user
option to connect and are seeing issues, then try using user@tursa.dirac.ed.ac.uk
instead.
Try the command ping -c 3 tursa.dirac.ed.ac.uk
. If you successfully connect to the login node, the output should include:
--- tursa.dirac.ed.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your internet connection, or the login node could be unavailable.
"},{"location":"tursa-user-guide/connecting/#password","title":"Password","text":"If you are having trouble entering your password consider using a password manager, from which you can copy and paste it. This will also help you generate a secure password. If you need to reset your password, instructions for doing so can be found in the SAFE documentation
Windows users please note that Ctrl+V
does not work to paste in to PuTTY, MobaXterm, or PowerShell. Instead use Shift+Ins
to paste. Alternatively, right-click and select 'Paste' in PuTTY and MobaXterm, or simply right-click to paste in PowerShell.
If you get the error message Permission denied (publickey)
this can indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is reuploaded SAFE will not map the \"new\" key to Tursa. If for some reason this is required, please delete the key first, then reupload.
Is ssh using the correct key? You can check which keys are being found and offered by ssh using ssh -vvv
. If your private key has a non-default name you can use the -i
flag to provide it to ssh, i.e. ssh -i path/to/key username@tursa.dirac.ed.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again, and usually up to three times in total, after which ssh will fail with Permission denied (publickey)
. If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check it is the one uploaded to SAFE.
Are permissions correct on the ssh key? One common issue is that the permissions are incorrect on the either the key file, or the directory it's contained in. On Linux/MacOS for example, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command. On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both public and private key files, and the containing folder.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_Tursa
one would use the command chmod 600 id_rsa_Tursa
.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
Verbose debugging output from ssh
can be very useful for diagnosing the issue. In particular, it can be used to distinguish between problems with the SSH key and password - further details are given below. To enable verbose output add the -vvv
flag to your SSH command. For example:
ssh -vvv username@tursa.dirac.ed.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: publickey,password\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the Tursa login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
You should next see something similiar to:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (publickey).\n
Most importantly, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default ssh will go through a list of standard private key files, as well as any you have specified with -i
or a config file. This is fine, as long as one of the files mentioned is the one that matches the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
This section covers best practice and tools for data management on Tursa as well as information on the storgae available on the system.
Information
If you have any questions on data management and transfer please do not hesitate to contact the DiRAC service desk at dirac-support@epcc.ed.ac.uk.
"},{"location":"tursa-user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":"We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the Tursa service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.
"},{"location":"tursa-user-guide/data/#tursa-storage","title":"Tursa storage","text":"Tursa has two different storage systems available:
The Tursa storage is provided by a parallel Lustre file system that provides your home directories and working storage. When you log in you will be placed in your home directory.
The home directory for each user is located at:
/home/[project code]/[group code]/[username]\n
where
[project code]
is the code for your project (e.g., x01);[group code]
is the code for your project group, if your project has groups, (e.g. x01-a) or the same as the project code, if not;[username]
is your login name.Each project is allocated a portion of the total storage available, and the project PI will be able to sub-divide this quota among the groups and users within the project. As is standard practice on UNIX and Linux systems, the environment variable $HOME
is automatically set to point to your home directory.
The tape storage can be made available to any Tursa user on request and can be used to store data from the Lustre parallel file system.
Managing and transferring data to/from the Tursa tape storage is done via the Miria web interface via an SSH tunnel to the Tursa login nodes.
Important
All data on the tape storage is shared project data rather than data associated with individual user accounts. Any data you move to tape will be visible to all users in the same project as you who have access to the tape storage.
"},{"location":"tursa-user-guide/data/#requesting-access-to-the-tape-storage","title":"Requesting access to the tape storage","text":"If you want to use the Tursa tape storage, you should contact the DiRAC Service Desk with the username and project ID you want to use to access the storage.
"},{"location":"tursa-user-guide/data/#data-locations","title":"Data locations","text":"In order to move data to the tape storage it must exist in a specific directory on the Tursa Lustre file system. You will need to move or copy the data to this location before it can be moved to tape and when you restore data from tape it will be placed in this location.
There is one directory per project on Tursa. The directory has the path:
/mnt/lustre/tursafs1/archive/[project code]\n
So, for example, the directory for project dp001
would be:
/mnt/lustre/tursafs1/archive/dp001\n
"},{"location":"tursa-user-guide/data/#setup-the-ssh-tunnel-for-miria","title":"Setup the SSH tunnel for Miria","text":"Once your tape storage access has been setup and you have moved data to the archive directory, you will need to connect to the Miria web interface in a web browser on your local system by setting up an SSH tunnel to the Tursa login nodes.
You do this by logging into Tursa in the usual way (with your SSH key and password) and adding the -L 9080:10.144.20.95:443
option to the ssh
command.
For example, if your username is dc-user1
, you would setup the tunnel by logging into Tursa with (assuming your SSH key is in the default location):
ssh -L 9080:10.144.20.95:443 dc-user1@tursa.dirac.ed.ac.uk\n
Enter your SSH key passphrase and password in the usual way.
Note
You will need to setup the SSH tunnel each time you want to access the Miria interface.
"},{"location":"tursa-user-guide/data/#access-the-miria-interface","title":"Access the Miria interface","text":"Once you have setup the SSH tunnel, you should be able to access the Miria interface in a web browser on your local system. Open a new tab and enter the URL:
You should see an interface asking you for a username and password. Use the username and password that you use to log into Tursa to log into the tape storage interface.
"},{"location":"tursa-user-guide/data/#transfer-data-from-tursa-lustre-to-tape","title":"Transfer data from Tursa Lustre to tape","text":"You use the \"Easy Move\" option from the left-hand menu to transfer data.
Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.
"},{"location":"tursa-user-guide/data/#restore-data-from-tape-to-tursa-lustre","title":"Restore data from tape to Tursa Lustre","text":"You use the \"Easy Move\" option from the left-hand menu to transfer data.
Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.
Bug
If you restore a file rather than a directory, the Miria tool will give the file the name NULL
once it is restored, you should use the mv
command to rename the file to the correct name once it has been restored.
How you share data with other Tursa users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.
"},{"location":"tursa-user-guide/data/#sharing-data-with-tursa-users-in-your-project","title":"Sharing data with Tursa users in your project","text":"Each project has an inner shared folder.
/home/[project code]/[project code]/shared\n
This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /home/x01/x01/shared
.
Each project also has an outer shared folder.:
/home/[project code]/shared\n
It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other Tursa users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /home/x01/shared
.
You should check the permissions of any files that you place in the shared area, especially if those files were created in your own Tursa account Files of the latter type are likely to be readable by you only.
The chmod
command below shows how to make sure that a file placed in the outer shared folder is also readable by all Tursa users.
chmod a+r /home/x01/shared/your-shared-file.txt\n
Similarly, for the inner shared folder, chmod
can be called such that read permission is granted to all users within the x01 project.
chmod g+r /home/x01/x01/shared/your-shared-file.txt\n
If you're sharing a set of files stored within a folder hierarchy the chmod
is slightly more complicated.
chmod -R a+Xr /home/x01/shared/my-shared-folder\nchmod -R g+Xr /home/x01/x01/shared/my-shared-folder\n
The -R
option ensures that the read permission is enabled recursively and the +X
guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder
.
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
The method you use to transfer data to/from Tursa will depend on how much you want to transfer and where to. The methods we cover in this guide are:
Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.
"},{"location":"tursa-user-guide/data/#archiving","title":"Archiving","text":"If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar
and zip
.
The tar
command packs files into a \"tape archive\" format. The command has general form:
tar [options] [file(s)]\n
Common options include:
-c
create a new archive-v
verbosely list files processed-W
verify the archive after writing-l
confirm all file hard links are included in the archive-f
use an archive file (for historical reasons, tar writes its output to stdout by default rather than a file).Putting these together:
tar -cvWlf mydata.tar mydata\n
will create and verify an archive.
To extract files from a tar file, the option -x
is used. For example:
tar -xf mydata.tar\n
will recover the contents of mydata.tar
to the current working directory.
To verify an existing tar file against a set of data, the -d
(diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n
Note
tar files do not store checksums with their data, requiring the original data to be present during verification.
Tip
Further information on using tar
can be found in the tar
manual (accessed via man tar
or at man tar).
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]\n
Common options are:
-r
used to zip up a directory-#
where \"#\" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process.Together:
zip -0r mydata.zip mydata\n
will create an archive.
Note
Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.
The corresponding unzip
command is used to extract data from the archive. The simplest use case is:
unzip mydata.zip\n
which recovers the contents of the archive to the current working directory.
Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip
provides options for verifying this checksum against the stored files. The relevant flag is -t
and is used as follows:
$> unzip -t mydata.zip\nArchive: mydata.zip\n testing: mydata/ OK\n testing: mydata/file OK\nNo errors detected in compressed data of mydata.zip.\n
Tip
Further information on using zip
can be found in the zip
manual (accessed via man zip
or at man zip).
The easiest way of transferring data to/from Tursa is to use one of the standard programs based on the SSH protocol such as scp
, sftp
or rsync
. These all use the same underlying mechanism (SSH) as you normally use to log-in to Tursa. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (Tursa in this case).
To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting
.
The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The Tursa login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr
or aes256-ctr
algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp
so you will need to manually specify them.
A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.
In addition, you should consider the following when transferring data:
gzip
.The scp
command creates a copy of a file, or if given the -r
flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.
For example, to transfer files to Tursa from a local machine:
scp [options] source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
In the above example, the [destination]
is optional, as when left out scp
will copy the source into your home directory. Also, the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
If you want to request a different encryption algorithm add the -c [algorithm-name]
flag to the scp
options. For example, to use the (usually faster) arcfour encryption algorithm you would use:
scp [options] -c aes128-ctr source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
The rsync
command can also transfer data between hosts using a ssh
connection. It creates a copy of a file or, if given the -r
flag, a directory at the given destination, similar to scp
above.
Given the -a
option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync
command is executed with ssh
to create the copy on a remote machine.
To transfer files to Tursa using rsync
with ssh
the command has the form:
rsync [options] -e ssh source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
In the above example, the [destination]
is optional, as when left out rsync will copy the source into your home directory. Also the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
Additional flags can be specified for the underlying ssh
command by using a quoted string as the argument of the -e
flag. e.g.
rsync [options] -e \"ssh -c arcfour\" source user@tursa.dirac.ed.ac.uk:[destination]\n
(Remember to replace user
with your Tursa username in the example above.)
Tip
Further information on using rsync
can be found in the rsync
manual (accessed via man rsync
or at man rsync).
Here we have a short example demonstrating transfer of data directly from a laptop/workstation to Tursa.
Note
This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.
Before we can transfer of data to Tursa we need to make sure we have an SSH key setup to access Tursa from the system we are transferring data from. If you are using the same system that you use to log into Tursa then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to Tursa.
Tip
Remember that you will need to use both a key and your password to transfer data to Tursa.
Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n
We then initiate the data transfer from our system to Tursa, here using rsync
to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2
on our local system:
rsync -Pv -e\"ssh -c aes128-gcm@openssh.com -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@tursa.dirac.ed.ac.uk:/home/z19/z19/otbz19/\n
Note the use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-gcm
which has been found to increase performance. Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on Tursa.
Note
Remember to replace otbz19
with your username on Tursa.
If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp
command,
scp -c aes128-gcm@openssh.com -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@transfer.dyn.tursa.ac.uk:/home/z19/z19/otbz19/\n
but rsync
is recommended for larger transfers.
Note
Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.
"},{"location":"tursa-user-guide/hardware/#system-overview","title":"System overview","text":"Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect.
There are additional login nodes, which provide access to the system.
Compute nodes are only accessible via the Slurm job scheduling system.
There is a single file system which is available on login and compute nodes (see Data management and transfer).
The Lustre file system has a capacity of 5.1 PiB.
The interconnect uses a Fat Tree topology.
"},{"location":"tursa-user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology:
As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
Hint
If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
"},{"location":"tursa-user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"tursa-user-guide/scheduler/#gpuh","title":"GPUh","text":"Time used on Tursa nodes is measured in GPUh. 1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost 4 GPUh per hour.
Note
The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.
"},{"location":"tursa-user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. dp123 resources
and then under Resource Pool to the right of this, a note of the remaining budgets.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins%75\n
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins \n---------- ---------- --------------------------------------------------------------------------- \n t01 dc-user1 gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 \n z01 dc-user1 \n
This shows that dc-user1
is a member of budgets t01
and z01
. However, the gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0
indicates that the t01
budget can only run GPU jobs in standard (charged) partitions (all other options are disabled, indicated by =0
for CPU standard, CPU low and GPU low). This user can also submit jobs to any partition using the z01
budget.
To see the number of coreh or GPUh remaining you must check in SAFE.
"},{"location":"tursa-user-guide/scheduler/#charging","title":"Charging","text":"Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
"},{"location":"tursa-user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"There are three key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called: jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID 12345
)We cover each of these commands in more detail below.
"},{"location":"tursa-user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo
: information on resources","text":"sinfo
is used to query information about available resources and partitions. Without any options, sinfo
lists the status of all resources and partitions, e.g.
[dc-user1@tursa-login1 ~]$ sinfo \n\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\ncpu up 2-00:00:00 4 alloc tu-c0r0n[66-69]\ncpu up 2-00:00:00 2 idle tu-c0r0n[70-71]\ngpu up 2-00:00:00 1 plnd tu-c0r2n93\ngpu up 2-00:00:00 11 drain tu-c0r0n75,tu-c0r5n[48,51,54,57],tu-c0r6n[48,51,54,57],tu-c0r7n[00,48]\ngpu up 2-00:00:00 112 mix tu-c0r0n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,72,87,90],tu-c0r1n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90],tu-c0r3n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,90,93],tu-c0r4n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,81,84,87,90,93]\ngpu up 2-00:00:00 56 resv tu-c0r0n93,tu-c0r4n78,tu-c0r5n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45],tu-c0r6n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,60,63,66,69],tu-c0r7n[03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,51,54,57]\ngpu up 2-00:00:00 1 idle tu-c0r3n87\n
alloc
nodes are those that are running jobsidle
nodes are emptydrain
, down
, maint
nodes are unavailable to usersplnd
nodes are reserved for future jobssbatch
: submitting jobs","text":"sbatch
is used to submit a job script to the job submission system. The script will typically contain one or more mpirun
commands to launch parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"tursa-user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue
: monitoring jobs","text":"squeue
without any options or arguments shows the current status of all jobs known to the scheduler. For example:
squeue\n
will list all jobs on Tursa.
The output of this is often large. You can restrict the output to just your jobs by adding the --me
option:
squeue --me\n
"},{"location":"tursa-user-guide/scheduler/#scancel-deleting-jobs","title":"scancel
: deleting jobs","text":"scancel
is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately. You need to provide the job ID of the job you wish to cancel/stop. For example:
scancel 12345\n
will cancel (if waiting) or stop (if running) the job with ID 12345
.
The Tursa resource limits for any given job are covered by three separate attributes.
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on Tursa which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
"},{"location":"tursa-user-guide/scheduler/#partitions","title":"Partitions","text":"On Tursa, compute nodes are grouped into partitions. You will have to specify a partition using the --partition
option in your Slurm submission script. The following table has a list of active partitions on Tursa.
You can list the active partitions by running sinfo
.
Tip
You may not have access to all the available partitions.
"},{"location":"tursa-user-guide/scheduler/#quality-of-service-qos","title":"Quality of Service (QoS)","text":"On Tursa, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos
Slurm directive. The following table lists the active QoS on Tursa.
gpu-a100-40
(1-node maximum) or gpu-a100-80
(2-node maximum) partitions. You can find out the QoS that you can use by running the following command:
sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50\n
As long as you have a positive budget, you should use the standard
QoS. Once you have exhausted your budget you can use the low
QoS to continue to run jobs at a lower priority than jobs in the standard
QoS.
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
Important
Only jobs sizes that are powers of 2 nodes are allowed. i.e. 1, 2, 4, 8, 16, 32, 64 nodes on the gpu
partition and 1, 2, 4 nodes on the cpu
partition.
Job priority on Tursa depends on a number of different factors:
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]\n
The priority factors are:
standard
QoS has a value of 5000 and low
QoS a value of 1.You can view the priorities for current queued jobs on the system with the sprio
command:
[dc-user1@tursa-login1 ~]$ sprio \n JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS\n 43963 gpu 5055 0 51 5 5000\n 43975 gpu 5061 0 41 20 5000\n 43976 gpu 5061 0 41 20 5000\n 43982 gpu 5046 0 26 20 5000\n 43986 gpu 5011 0 6 5 5000\n 43996 gpu 5020 0 0 20 5000\n 43997 gpu 5020 0 0 20 5000\n
"},{"location":"tursa-user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"tursa-user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
sbatch: unrecognized option <text>
One of your options is invalid or has a typo. man sbatch
to help.
error: Batch job submission failed: No partition specified or system default partition
A --partition=
option is missing. You must specify the partition (see the list above). This is most often --partition=standard
.
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the --account=
options against valid accounts in SAFE.
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a --qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.)
error: Your job has no time specification (--time=)...
Add an option of the form --time=hours:minutes:seconds
to the submission script. E.g., --time=01:30:00
gives a time limit of 90 minutes.
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
The squeue
command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.
For a full list of see Job State Codes.
Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.For a full list of see Job Reasons.
"},{"location":"tursa-user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out
. This file appears in the job's working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered
to the srun
command in your SLURM script.
You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH
.
Hint
Most options provided using #SBATCH
directives can also be specified as command line options to srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something like t01
or t01-test
. You can see which budget codes you can charge to in SAFE.Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use --time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it inIn addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--tasks-per-node=<processes per node>
the number of parallel processes (e.g. MPI ranks) per node. For Grid on GPU nodes this will typically be 4 to give 1 MPI process per GPU. The CPU nodes have 128 cores per node.--cpus-per-task=<stride between processes>
for Grid jobs on GPU nodes where you typically use 1 MPI process per GPU, 4 per node, this will usually be 12 (so that the 48 cores on a node are evenly divided between the 4 MPI processes)--gres=gpu:4
the number of GPU to use per node. This will almost always be 4 to use all GPUs on a node. (This option should not be specified for jobs on the CPU nodes.)If you are happy to have any GPU type for your job (A100-40 or A100-80) then you select the gpu
partition:
--partition=gpu
If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:
--partition=gpu-a100-80
request the job is placed on nodes with high-memory (80 GB) GPUs - there are 64 high memory GPU nodes on the system. To just use the A100-40 GPU nodes:
--partition=gpu-a100-40
request the job is placed on nodes with standard memory (40 GB) GPUs.If you do not specfy a partition, the scheduler may use any available node types for the job (equivalent of --partition=gpu
).
Note
For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes).
To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.Using the --export=none
means that the behaviour of batch submissions should be repeatable. We strongly recommend its use, although see the following section to enable access to the usual modules.
Important
The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.
Users can control the GPU frequency in their job submission scripts:
--gpu-freq=<desired GPU freq in MHz>
allows users to set the GPU frequency on a per job basis. The frequency can be set in the range 210 - 1410 MHz in steps of 15 MHz.Bug
When setting the GPU frequency you will see an error in the output from the job that says control disabled
. This is an incorrect message due to an issue with how Slurm sets the GPU frequency and can be safely ignored.
srun
: Launching parallel jobs","text":"If you are running parallel jobs, your job submission script should contain one or more srun commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block and --hint=nomultithread to your srun command to ensure you get the correct pinning of processes to cores on a compute node.
A brief explanation of these options: - --hint=nomultithread
- do not use hyperthreads/SMP - --distribution=block:block
- the first block
means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block
means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).
Important
The Slurm definition of a \"socket\" does not usually correspond to a physical CPU socket. On Tursa GPU nodes it corresponds to half the cores on a socket as the GPU nodes are configured with NPS2.
On the Tursa CPU nodes, the Slurm definition of a scoket does correspond to a physical CPU socket (64 cores) as the COU nodes are configured with NPS1.
"},{"location":"tursa-user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":""},{"location":"tursa-user-guide/scheduler/#example-job-submission-script-for-a-parallel-job-using-cuda","title":"Example: job submission script for a parallel job using CUDA","text":"A job submission script for a parallel job that uses 4 compute nodes, 4 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Grid_job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=8\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\nexport OMP_NUM_THREADS=8\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\n# We have reserved the full nodes, now distribute the processes as\n# required: 4 MPI processes per node, stride of 12 cores between \n# MPI processes\n# \n# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning \nsrun --nodes=4 --tasks-per-node=4 --cpus-per-task=12 \\\n --hint=nomultithread --distribution=block:block \\\n gpu_launch.sh \\\n ${application} ${options}\n
This will run your executable \"my_mpi_opnemp_app.x\" in parallel usimg 16 MPI processes on 4 nodes. 4 GPUs will be used per node.
Important
You must use the gpu_launch.sh
wrapper script to get the correct biniding of GPU to MPI processes and of network interface to GPU and MPI process. This script is described in more detail below.
gpu_launch.sh
wrapper script","text":"The gpu_launch.sh
wrapper script is required to set the correct binding of GPU to MPI processes and the correct binding of interconnect interfaces to MPI process and GPU. We provide this centrally for convenience but its contents are simple:
#!/bin/bash\n\n# Compute the raw process ID for binding to GPU and NIC\nlrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))\n\n# Bind the process to the correct GPU and NIC\nexport CUDA_VISIBLE_DEVICES=${lrank}\nexport UCX_NET_DEVICES=mlx5_${lrank}:1\n\n$@\n
"},{"location":"tursa-user-guide/scheduler/#using-the-dev-qos","title":"Using the dev
QoS","text":"The dev
QoS is designed for faster turnaround of short jobs than is usually available through the production QoS. It is subject to a number of restrictions:
gpu-a100-80
partitiongpu-a100-40
partitionIn addtion, you must specify either the gpu-a100-80
or gpu-a100-40
partitions when using the dev
QoS.
Tip
The generic gpu
partition will not work consistently when using the dev
QoS.
Here is an example job submission script for a 2-node job in the dev
QoS using the gpu-a100-80
partition. Note the use of the gpu_launch.sh
wrapper script to get correct GPU and NIC binding.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Dev_Job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=48\n#SBATCH --cpus-per-task=\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu-a100-80\n#SBATCH --qos=dev\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n\nexport OMP_NUM_THREADS=1\nexport OMP_PLACES=cores\n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\n# We have reserved the full nodes, now distribute the processes as\n# required: 4 MPI processes per node, stride of 12 cores between \n# MPI processes\n# \n# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning \nsrun --nodes=2 --tasks-per-node=4 --cpus-per-task=12 \\\n --hint=nomultithread --distribution=block:block \\\n gpu_launch.sh \\\n ${application} ${options}\n
"},{"location":"tursa-user-guide/sw-environment/","title":"Software environment","text":"The software environment on Tursa is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on Tursa start with the default software environment loaded.
Software modules on Tursa are provided by both Eviden and by EPCC.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the module
command here. For full documentation please see the Linux manual page on modules
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module savelist
- List module collections available (usually used for accessing different programming environments)module restore name
- Restore the module collection called name
(usually used for setting up a programming environment)module load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule swap old new
- Swap module new
for module old
in your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
"},{"location":"tursa-user-guide/sw-environment/#information-on-the-available-modules","title":"Information on the available modules","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment. By default, you will have no modules loaded when you first log into Tursa
Finding out which software modules are available on the system is performed using the module avail
command. To list all software modules available, use:
[dc-user1@tursa-login1 ~]$ module avail\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\ncuda/11.0.2 openmpi/4.1.1-cuda11.0.2 ucx/1.10.1-cuda11.0.2 \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\ncuda/11.4.1 openmpi/4.1.1-cuda11.4 ucx/1.12.0-cuda11.4 \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\ncuda/11.4.1 openmpi/4.1.1-cuda11.4.1 ucx/1.12.0-cuda11.4.1 \n\n------------------------------------------------ /mnt/lustre/tursafs1/apps/modulefiles -------------------------------------------------\ncuda/11.0.3 dot gcc/9.3.0 module-git module-info modules null openmpi/4.1.1 ucx/1.10.1 use.own xpmem/2.6.5 \n
This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the OpenMPI library, use:
[dc-user1@tursa-login1 ~]$ module avail openmpi\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.0.2 \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\nopenmpi/4.1.1-cuda11.4 \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.4.1 \n\n----------------\n
The module show
command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default openmpi module:
[dc-user1@tursa-login1 ~]$ module show openmpi\n-------------------------------------------------------------------\n/mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles/openmpi/4.1.1-cuda11.0.2:\n\nmodule-whatis Sets up OpenMPI on your environment\nsetenv MPI_ROOT /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/\nprepend-path PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/bin/\nprepend-path LD_LIBRARY_PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/lib\nprepend-path MANPATH /opt/mpi/openmpi/4.0.4.1/share/man\nmodule load ucx/1.10.1\nsetenv OMPI_CC cc\nsetenv OMPI_CXX g++\nsetenv OMPI_CFLAGS -g -m64\nsetenv OMPI_CXXFLAGS -g -m64\n-------------------------------------------------------------------\n
"},{"location":"tursa-user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To load a module to use the module load
command. For example, to load the default version of OpenMPI into your environment, use:
[dc-user1@tursa-login1 ~]$ module load openmpi\n\n UCX 1.10 loaded\n\n\n OpenMPI 4.1.1 loaded\n
Once you have done this, your environment will be setup to use the OpenMPI library. The above command will load the default version of OpenMPI. If you need a specific version of the software, you can add more information:
[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4.1\n\n UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0 loaded\n
will load OpenMPI version 4.1.1 with CUDA 11.4.1 into your environment, regardless of the default.
If you want to remove software from your environment, module rm
will remove a loaded module:
[dc-user1@tursa-login1 ~]$ module rm openmpi\n
will unload what ever version of openmpi
(even if it is not the default) you might have loaded.
There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
Suppose you have loaded version 4.1.1 of openmpi
, the following command will change to version 4.1.1-cuda11.4.1:
[dc-user1@tursa-login1 ~]$ module swap openmpi openmpi/4.1.1-cuda11.4.1\n\n UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0 loaded\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"tursa-user-guide/sw-environment/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:
[dc-user1@tursa-login1 ~]$ module save [collection_name]\n
Note
If you do not specify the environment name, it is called default
.
You can find the list of saved module environments by executing:
[dc-user1@tursa-login1 ~]$ module savelist\nNamed collection list:\n 1) default\n
To list the modules in a collection, you can execute, e.g.,:
[dc-user1@tursa-login1 ~]$ module saveshow default\n-------------------------------------------------------------------\n/home/z01/z01/dc-turn1/.module/default:\n\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/modulefilesintel\nmodule use --append /mnt/lustre/tursafs1/apps/modulefiles\nmodule load ucx/1.12.0-cuda11.4.1\nmodule load openmpi/4.1.1-cuda11.4.1\n\n-------------------------------------------------------------------\n
Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,
[dc-user1@tursa-login1 ~]$ module save /home/t01/z01/auser/my-module-collection\n
if you want to save to a specific file name.
To delete a module environment, you can execute:
[dc-user1@tursa-login1 ~]$ module saverm <environment_name>\n
"},{"location":"tursa-user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to Tursa, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
"},{"location":"tursa-user-guide/sw-environment/#compiler-environment","title":"Compiler environment","text":"The system supports two different primary compiler environments:
To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:
[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load cuda/11.4.1 \n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) gcc/9.3.0 2) cuda/11.4.1 3) ucx/1.12.0-cuda11.4 4) openmpi/4.1.1-cuda11.4 \n
Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:
mpicc
mpicxx
mpif90
You can find more information on these scripts in the OpenMPI documentation.
"},{"location":"tursa-user-guide/sw-environment/#nvhpc-toolchain","title":"NVHPC toolchain","text":"To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:
[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load nvhpc/21.7-nompi\n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) /mnt/lustre/tursafs1/home/y07/shared/tursa-modules/setup-env 2) nvhpc/21.7-nompi\n 3) ucx/1.12.0-cuda11.4 4) openmpi/4.1.1-cuda11.4 5) gcc/9.3.0\n
Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:
mpicc
mpicxx
mpif90
and the NVIDIA compilers are available as:
nvcc
nvc++
nvfortran
Tip
Both the NVIDIA compilers and the MPI compiler wrapper scripts will use the GCC compilers directly in the default configuration - this is often what you want. If you want the compiler wrappers to call the NVIDIA compilers themselves rather than GCC directly, you would use:
export OMPI_CC=nvcc\nexport OMPI_CXX=nvc++\nexport OMPI_FC=nvfortran\n
"},{"location":"tursa-user-guide/sw-environment/#other-build-tools","title":"Other build tools","text":""},{"location":"tursa-user-guide/sw-environment/#cmake","title":"cmake","text":"CMake is available by using the commands:
[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load cmake\n
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index b817ed3..e986db1 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ
diff --git a/tursa-user-guide/scheduler/index.html b/tursa-user-guide/scheduler/index.html
index f27a881..c90a229 100644
--- a/tursa-user-guide/scheduler/index.html
+++ b/tursa-user-guide/scheduler/index.html
@@ -1632,12 +1632,20 @@ dev
QoSdev
QoSdev
QoS