diff --git a/search/search_index.json b/search/search_index.json index 4726c00..d0f6029 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"DiRAC Extreme Scaling User Documentation","text":"

DiRAC Extreme Scaling is part of the DiRAC National HPC Service. You can find more information on the service and the research it supports on the DiRAC website.

The DiRAC Extreme Scaling service is an HPC resource for UK researchers. DiRAC Extreme Scaling is provided by UKRI, EPCC and the University of Edinburgh. The hardware is provided by ATOS.

"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"

The documentation currently includes:

"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"

The source for this documentation is publicly available in the DiRAC Extreme Scaling documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.

Full details of how to contribute can be found in the README.md file of the repository.

"},{"location":"tursa-user-guide/","title":"Tursa User Guide","text":"

The Tursa User Guide covers all aspects of use of the Tursa resource. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of Tursa and more technical topics.

The Tursa User Guide contains the following sections:

"},{"location":"tursa-user-guide/connecting/","title":"Connecting to Tursa","text":"

On the Tursa system, interactive access can be achieved via SSH, either directly from a command line terminal or using an SSH client. In addition data can be transferred to and from the Tursa system using scp from the command line or by using a file transfer client.

This section covers the basic connection methods.

Before following the process below, we assume you have setup an account on Tursa through the DiRAC SAFE. Documentation on how to do this can be found at:

"},{"location":"tursa-user-guide/connecting/#command-line-terminal","title":"Command line terminal","text":""},{"location":"tursa-user-guide/connecting/#linux","title":"Linux","text":"

Linux distributions come installed with a terminal application that can be used for SSH access to the login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g. GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.

"},{"location":"tursa-user-guide/connecting/#macos","title":"MacOS","text":"

MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.

"},{"location":"tursa-user-guide/connecting/#windows","title":"Windows","text":"

A typical Windows installation will not include a terminal client, though there are various clients available. We recommend all our Windows users to download and install MobaXterm to access Tursa. It is very easy to use and includes an integrated X server with SSH client to run any graphical applications on Tursa.

You can download MobaXterm Home Edition (Installer Edition) from the following link:

Double-click the downloaded Microsoft Installer file (.msi), and the Windows wizard will automatically guides you through the installation process. Note, you might need to have administrator rights to install on some Windows OS. Also make sure to check whether Windows Firewall hasn't blocked any features of this program after installation.

Start MobaXterm and then click \"Start local terminal\"

Tips

"},{"location":"tursa-user-guide/connecting/#access-credentials","title":"Access credentials","text":"

To access Tursa, you need to use two credentials:

You can find more detailed instructions on how to set up your credentials to access Tursa from Windows, macOS and Linux below.

"},{"location":"tursa-user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"

You will need to generate an SSH key pair protected by a passphrase to access Tursa.

Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:

$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n|    . ...+o++++. |\n| . . . =o..      |\n|+ . . .......o o |\n|oE .   .         |\n|o =     .   S    |\n|.    +.+     .   |\n|.  oo            |\n|.  .             |\n| ..              |\n+-----------------+\n

(remember to replace \"your@email.com\" with your e-mail address).

"},{"location":"tursa-user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"

You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:

Login to SAFE. Then:

  1. Go to the Menu Login accounts and select the Tursa account you want to add the SSH key to
  2. On the subsequent Login account details page click the Add Credential button
  3. Select SSH public key as the Credential Type and click Next
  4. Either copy and paste the public part of your SSH key into the SSH Public key box or use the button to select the public key file on your computer.
  5. Click Add to associate the public SSH key part with your account

Once you have done this, your SSH key will be added to your Tursa account.

Remember, you will need to use both an SSH key and password to log into Tursa so you will also need to collect your initial password before you can log into Tursa. We cover this next.

Note

If you want to connect to Tursa from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.

"},{"location":"tursa-user-guide/connecting/#initial-passwords-up-to-13-feb-2024","title":"Initial passwords (up to 13 Feb 2024)","text":"

The SAFE web interface is used to provide your initial password for logging onto Tursa (see the SAFE Documentation for more details on requesting accounts and picking up passwords).

Note

You may now change your password on the Tursa machine itself using the passwd command or when you are prompted the first time you login. This change will not be reflected in the SAFE. If you forget your password, you should use the SAFE to request a new one-shot password.

"},{"location":"tursa-user-guide/connecting/#mfa-time-based-one-time-passcode-totp-from-13-feb-2024","title":"MFA Time-based one-time passcode (TOTP) (from 13 Feb 2024)","text":"

You will need to use both an SSH key and time-based one-time passcode to log into Tursa so you will also need to set up a method for generating a TOTP code before you can log into Tursa.

"},{"location":"tursa-user-guide/connecting/#first-login-from-a-new-account-password-required","title":"First login from a new account: password required","text":"

Important

You will not use your password when logging on to Tursa after the first login for a new account.

As an additional security measure, you will also need to use a password from SAFE for your first login to Tursa with a new account. When you log into Tursa for the first time with a new account, you will be prompted to change your initial password. This is a three step process:

  1. When promoted to enter your ldap password: Enter the password which you retrieve from SAFE
  2. When prompted to enter your new password: type in a new password
  3. When prompted to re-enter the new password: re-enter the new password

Your password has now been changed. You will no longer need this password to log into Tursa from this point forwards, you will use your SSH key and TOTP as described above.

"},{"location":"tursa-user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"

Interaction with Tursa is done remotely, over an encrypted communication channel, Secure Shell version 2 (SSH-2). This allows command-line access to one of the login nodes of a Tursa, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers when used in conjunction with an X client.

"},{"location":"tursa-user-guide/connecting/#logging-in","title":"Logging in","text":"

You can use the following command from the terminal window to login into Tursa:

ssh username@tursa.dirac.ed.ac.uk\n

You need to enter both credentials correctly to be able to access Tursa.

Tip

If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa) on your local system, you may need to specify the path to the private part of the key wih the -i option to ssh. For example, if your key is in a file called keys/id_rsa_Tursa you would use the command ssh -i keys/id_rsa_Tursa username@login.tursa.ac.uk to log in.

Tip

When you first log into Tursa, you will be prompted to change your initial password. This is a three step process:

  1. When promoted to enter your ldap password: Re-enter the password you retrieved from SAFE
  2. When prompted to enter your new password: type in a new password
  3. When prompted to re-enter the new password: re-enter the new password

Your password has now been changed

To allow remote programs, especially graphical applications to control your local display, such as being able to open up a new GUI window (such as for a debugger), use:

ssh -X username@tursa.dirac.ed.ac.uk\n

Some sites recommend using the -Y flag. While this can fix some compatibility issues, the -X flag is more secure.

Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:

"},{"location":"tursa-user-guide/connecting/#making-access-more-convenient-using-the-ssh-configuration-file","title":"Making access more convenient using the SSH configuration file","text":"

Typing in the full command to login or transfer data to Tursa can become tedious as it often has to be repeated many times. You can use the SSH configuration file, usually located on your local machine at .ssh/config to make things a bit more convenient.

Each remote site (or group of sites) can have an entry in this file which may look something like:

Host tursa\n  HostName tursa.dirac.ed.ac.uk\n  User username\n

(remember to replace username with your actual username!).

The Host tursa line defines a short name for the entry. In this case, instead of typing ssh username@tursa.dirac.ed.ac.uk to access the Tursa login nodes, you could use ssh tursa instead. The remaining lines define the options for the tursa host.

Now you can use SSH to access Tursa without needing to enter your username or the full hostname every time:

$ ssh tursa\n

You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config man page (or man ssh_config on any machine with SSH installed) for a description of the SSH configuration file. You may find the IdentityFile option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.

Bug

There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512), you will need to either specify the path to your ssh key in the command line (using the -i option as described above) or add the path to your SSH config file by using the IdentityFile option.

"},{"location":"tursa-user-guide/connecting/#ssh-debugging-tips","title":"SSH debugging tips","text":"

If you find you are unable to connect via SSH there are a number of ways you can try and diagnose the issue. Some of these are collected below - if you are having difficulties connecting we suggest trying these before contacting the Tursa service desk.

"},{"location":"tursa-user-guide/connecting/#use-the-usertursadiracedacuk-syntax-rather-than-l-user-tursadiracedacuk","title":"Use the user@tursa.dirac.ed.ac.uk syntax rather than -l user tursa.dirac.ed.ac.uk","text":"

We have seen a number of instances where people using the syntax

ssh -l user tursa.dirac.ed.ac.uk\n

have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:

ssh user@tursa.dirac.ed.ac.uk\n

works more reliably. If you are using the -l user option to connect and are seeing issues, then try using user@tursa.dirac.ed.ac.uk instead.

"},{"location":"tursa-user-guide/connecting/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"

Try the command ping -c 3 tursa.dirac.ed.ac.uk. If you successfully connect to the login node, the output should include:

--- tursa.dirac.ed.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n

(the ping time '38ms' is not important). If not all packets are received there could be a problem with your internet connection, or the login node could be unavailable.

"},{"location":"tursa-user-guide/connecting/#password","title":"Password","text":"

If you are having trouble entering your password consider using a password manager, from which you can copy and paste it. This will also help you generate a secure password. If you need to reset your password, instructions for doing so can be found in the SAFE documentation

Windows users please note that Ctrl+V does not work to paste in to PuTTY, MobaXterm, or PowerShell. Instead use Shift+Ins to paste. Alternatively, right-click and select 'Paste' in PuTTY and MobaXterm, or simply right-click to paste in PowerShell.

"},{"location":"tursa-user-guide/connecting/#ssh-key","title":"SSH key","text":"

If you get the error message Permission denied (publickey) this can indicate a problem with your SSH key. Some things to check:

Target Permissions chmod Code Directory drwx------ 700 Private Key -rw------- 600 Public Key -rw-r--r-- 644

chmod can be used to set permissions on the target in the following way: chmod <code> <target>. So for example to set correct permissions on the private key file id_rsa_Tursa one would use the command chmod 600 id_rsa_Tursa.

Tip

Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -, or directory d. The next three characters indicate the owning user's permissions. The first character is r if they have read permission, - if they don't, the second character is w if they have write permission, - if they don't, the third character is x if they have execute permission, - if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r-- indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------ becomes 111 000 000 -> 700.

"},{"location":"tursa-user-guide/connecting/#ssh-verbose-output","title":"SSH verbose output","text":"

Verbose debugging output from ssh can be very useful for diagnosing the issue. In particular, it can be used to distinguish between problems with the SSH key and password - further details are given below. To enable verbose output add the -vvv flag to your SSH command. For example:

ssh -vvv username@tursa.dirac.ed.ac.uk\n

The output is lengthy, but somewhere in there you should see lines similar to the following:

debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: publickey,password\n

If you do not see the Password: prompt you may have connection issues, or there could be a problem with the Tursa login nodes. If you do not see Authenticated with partial success it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success, it means your password was accepted, and your SSH key will now be checked.

You should next see something similiar to:

debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (publickey).\n

Most importantly, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded indicates that the SSH key has been accepted. By default ssh will go through a list of standard private key files, as well as any you have specified with -i or a config file. This is fine, as long as one of the files mentioned is the one that matches the public key uploaded to SAFE.

If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey). If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.

"},{"location":"tursa-user-guide/data/","title":"Data management and transfer","text":"

This section covers best practice and tools for data management on Tursa as well as information on the storgae available on the system.

Information

If you have any questions on data management and transfer please do not hesitate to contact the DiRAC service desk at dirac-support@epcc.ed.ac.uk.

"},{"location":"tursa-user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":""},{"location":"tursa-user-guide/data/#data-management","title":"Data management","text":"

We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the Tursa service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.

"},{"location":"tursa-user-guide/data/#tursa-storage","title":"Tursa storage","text":"

Tursa has two different storage systems available:

"},{"location":"tursa-user-guide/data/#parallel-lustre-file-system","title":"Parallel Lustre file system","text":"

The Tursa storage is provided by a parallel Lustre file system that provides your home directories and working storage. When you log in you will be placed in your home directory.

The home directory for each user is located at:

/home/[project code]/[group code]/[username]\n

where

Each project is allocated a portion of the total storage available, and the project PI will be able to sub-divide this quota among the groups and users within the project. As is standard practice on UNIX and Linux systems, the environment variable $HOME is automatically set to point to your home directory.

"},{"location":"tursa-user-guide/data/#tape-storage","title":"Tape storage","text":"

The tape storage can be made available to any Tursa user on request and can be used to store data from the Lustre parallel file system.

Managing and transferring data to/from the Tursa tape storage is done via the Miria web interface via an SSH tunnel to the Tursa login nodes.

Important

All data on the tape storage is shared project data rather than data associated with individual user accounts. Any data you move to tape will be visible to all users in the same project as you who have access to the tape storage.

"},{"location":"tursa-user-guide/data/#requesting-access-to-the-tape-storage","title":"Requesting access to the tape storage","text":"

If you want to use the Tursa tape storage, you should contact the DiRAC Service Desk with the username and project ID you want to use to access the storage.

"},{"location":"tursa-user-guide/data/#data-locations","title":"Data locations","text":"

In order to move data to the tape storage it must exist in a specific directory on the Tursa Lustre file system. You will need to move or copy the data to this location before it can be moved to tape and when you restore data from tape it will be placed in this location.

There is one directory per project on Tursa. The directory has the path:

/mnt/lustre/tursafs1/archive/[project code]\n

So, for example, the directory for project dp001 would be:

/mnt/lustre/tursafs1/archive/dp001\n
"},{"location":"tursa-user-guide/data/#setup-the-ssh-tunnel-for-miria","title":"Setup the SSH tunnel for Miria","text":"

Once your tape storage access has been setup and you have moved data to the archive directory, you will need to connect to the Miria web interface in a web browser on your local system by setting up an SSH tunnel to the Tursa login nodes.

You do this by logging into Tursa in the usual way (with your SSH key and password) and adding the -L 9080:10.144.20.95:443 option to the ssh command.

For example, if your username is dc-user1, you would setup the tunnel by logging into Tursa with (assuming your SSH key is in the default location):

ssh -L 9080:10.144.20.95:443 dc-user1@tursa.dirac.ed.ac.uk\n

Enter your SSH key passphrase and password in the usual way.

Note

You will need to setup the SSH tunnel each time you want to access the Miria interface.

"},{"location":"tursa-user-guide/data/#access-the-miria-interface","title":"Access the Miria interface","text":"

Once you have setup the SSH tunnel, you should be able to access the Miria interface in a web browser on your local system. Open a new tab and enter the URL:

You should see an interface asking you for a username and password. Use the username and password that you use to log into Tursa to log into the tape storage interface.

"},{"location":"tursa-user-guide/data/#transfer-data-from-tursa-lustre-to-tape","title":"Transfer data from Tursa Lustre to tape","text":"

You use the \"Easy Move\" option from the left-hand menu to transfer data.

  1. Click on \"Easy Move\"
  2. Click on the \"Find a source\" menu and select the disk with your project ID (e.g. \"dp001\")
  3. Click on the \"Find a target\" menu and select the archive with your project ID (e.g. \"dp001\")
  4. Use the file explorer to select the files/directories you wish to move to tape
  5. Click the \"Add\" button
  6. Scroll to the bottom of the page and select \"Validate basket\" and confirm you wish to proceed

Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.

"},{"location":"tursa-user-guide/data/#restore-data-from-tape-to-tursa-lustre","title":"Restore data from tape to Tursa Lustre","text":"

You use the \"Easy Move\" option from the left-hand menu to transfer data.

  1. Click on \"Easy Move\"
  2. Select the \"Repository\" icon next to the \"Find a source\" menu
  3. Click on the \"Find a source\" menu and select the archive with your project ID (e.g. \"dp001\")
  4. Select the \"Platform\" icon next to the \"Find a source\" menu
  5. Click on the \"Find a target\" menu and select the disk with your project ID (e.g. \"dp001\")
  6. Use the source file explorer to select the files/directories you wish to restore
  7. Use the target file explorer to select the location oon disk to restore the data
  8. Click the \"Add\" button
  9. Scroll to the bottom of the page and select \"Validate basket\" and confirm you wish to proceed

Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.

Bug

If you restore a file rather than a directory, the Miria tool will give the file the name NULL once it is restored, you should use the mv command to rename the file to the correct name once it has been restored.

"},{"location":"tursa-user-guide/data/#sharing-data-with-other-tursa-users","title":"Sharing data with other Tursa users","text":"

How you share data with other Tursa users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.

"},{"location":"tursa-user-guide/data/#sharing-data-with-tursa-users-in-your-project","title":"Sharing data with Tursa users in your project","text":"

Each project has an inner shared folder.

/home/[project code]/[project code]/shared\n

This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /home/x01/x01/shared.

"},{"location":"tursa-user-guide/data/#sharing-data-with-all-tursa-users","title":"Sharing data with all Tursa users","text":"

Each project also has an outer shared folder.:

/home/[project code]/shared\n

It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other Tursa users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /home/x01/shared.

"},{"location":"tursa-user-guide/data/#permissions","title":"Permissions","text":"

You should check the permissions of any files that you place in the shared area, especially if those files were created in your own Tursa account Files of the latter type are likely to be readable by you only.

The chmod command below shows how to make sure that a file placed in the outer shared folder is also readable by all Tursa users.

chmod a+r /home/x01/shared/your-shared-file.txt\n

Similarly, for the inner shared folder, chmod can be called such that read permission is granted to all users within the x01 project.

chmod g+r /home/x01/x01/shared/your-shared-file.txt\n

If you're sharing a set of files stored within a folder hierarchy the chmod is slightly more complicated.

chmod -R a+Xr /home/x01/shared/my-shared-folder\nchmod -R g+Xr /home/x01/x01/shared/my-shared-folder\n

The -R option ensures that the read permission is enabled recursively and the +X guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder.

"},{"location":"tursa-user-guide/data/#archiving-and-data-transfer","title":"Archiving and data transfer","text":"

Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.

The method you use to transfer data to/from Tursa will depend on how much you want to transfer and where to. The methods we cover in this guide are:

Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.

"},{"location":"tursa-user-guide/data/#archiving","title":"Archiving","text":"

If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar and zip.

"},{"location":"tursa-user-guide/data/#tar","title":"tar","text":"

The tar command packs files into a \"tape archive\" format. The command has general form:

tar [options] [file(s)]\n

Common options include:

Putting these together:

tar -cvWlf mydata.tar mydata\n

will create and verify an archive.

To extract files from a tar file, the option -x is used. For example:

tar -xf mydata.tar\n

will recover the contents of mydata.tar to the current working directory.

To verify an existing tar file against a set of data, the -d (diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:

$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n

Note

tar files do not store checksums with their data, requiring the original data to be present during verification.

Tip

Further information on using tar can be found in the tar manual (accessed via man tar or at man tar).

"},{"location":"tursa-user-guide/data/#zip","title":"zip","text":"

The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:

zip [options] mydata.zip [file(s)]\n

Common options are:

Together:

zip -0r mydata.zip mydata\n

will create an archive.

Note

Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.

The corresponding unzip command is used to extract data from the archive. The simplest use case is:

unzip mydata.zip\n

which recovers the contents of the archive to the current working directory.

Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip provides options for verifying this checksum against the stored files. The relevant flag is -t and is used as follows:

$> unzip -t mydata.zip\nArchive:  mydata.zip\n    testing: mydata/                 OK\n    testing: mydata/file             OK\nNo errors detected in compressed data of mydata.zip.\n

Tip

Further information on using zip can be found in the zip manual (accessed via man zip or at man zip).

"},{"location":"tursa-user-guide/data/#data-transfer-via-ssh","title":"Data transfer via SSH","text":"

The easiest way of transferring data to/from Tursa is to use one of the standard programs based on the SSH protocol such as scp, sftp or rsync. These all use the same underlying mechanism (SSH) as you normally use to log-in to Tursa. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (Tursa in this case).

To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting.

"},{"location":"tursa-user-guide/data/#ssh-data-transfer-performance-considerations","title":"SSH data transfer performance considerations","text":"

The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The Tursa login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr or aes256-ctr algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp so you will need to manually specify them.

A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.

In addition, you should consider the following when transferring data:

"},{"location":"tursa-user-guide/data/#scp","title":"scp","text":"

The scp command creates a copy of a file, or if given the -r flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.

For example, to transfer files to Tursa from a local machine:

scp [options] source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

In the above example, the [destination] is optional, as when left out scp will copy the source into your home directory. Also, the source should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

If you want to request a different encryption algorithm add the -c [algorithm-name] flag to the scp options. For example, to use the (usually faster) arcfour encryption algorithm you would use:

scp [options] -c aes128-ctr source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

"},{"location":"tursa-user-guide/data/#rsync","title":"rsync","text":"

The rsync command can also transfer data between hosts using a ssh connection. It creates a copy of a file or, if given the -r flag, a directory at the given destination, similar to scp above.

Given the -a option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync command is executed with ssh to create the copy on a remote machine.

To transfer files to Tursa using rsync with ssh the command has the form:

rsync [options] -e ssh source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

In the above example, the [destination] is optional, as when left out rsync will copy the source into your home directory. Also the source should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

Additional flags can be specified for the underlying ssh command by using a quoted string as the argument of the -e flag. e.g.

rsync [options] -e \"ssh -c arcfour\" source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

Tip

Further information on using rsync can be found in the rsync manual (accessed via man rsync or at man rsync).

"},{"location":"tursa-user-guide/data/#ssh-data-transfer-example-laptopworkstation-to-tursa","title":"SSH data transfer example: laptop/workstation to Tursa","text":"

Here we have a short example demonstrating transfer of data directly from a laptop/workstation to Tursa.

Note

This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.

Before we can transfer of data to Tursa we need to make sure we have an SSH key setup to access Tursa from the system we are transferring data from. If you are using the same system that you use to log into Tursa then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to Tursa.

Tip

Remember that you will need to use both a key and your password to transfer data to Tursa.

Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:

tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n

We then initiate the data transfer from our system to Tursa, here using rsync to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2 on our local system:

rsync -Pv -e\"ssh -c aes128-gcm@openssh.com -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@tursa.dirac.ed.ac.uk:/home/z19/z19/otbz19/\n

Note the use of the -P flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c option specifies the cipher to be used as aes128-gcm which has been found to increase performance. Unfortunately the ~ shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on Tursa.

Note

Remember to replace otbz19 with your username on Tursa.

If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp command,

scp -c aes128-gcm@openssh.com -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@transfer.dyn.tursa.ac.uk:/home/z19/z19/otbz19/\n

but rsync is recommended for larger transfers.

"},{"location":"tursa-user-guide/hardware/","title":"ARCHER2 hardware","text":"

Note

Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.

"},{"location":"tursa-user-guide/hardware/#system-overview","title":"System overview","text":"

Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect.

There are additional login nodes, which provide access to the system.

Compute nodes are only accessible via the Slurm job scheduling system.

There is a single file system which is available on login and compute nodes (see Data management and transfer).

The Lustre file system has a capacity of 5.1 PiB.

The interconnect uses a Fat Tree topology.

"},{"location":"tursa-user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"

Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology:

"},{"location":"tursa-user-guide/scheduler/","title":"Running jobs on Tursa","text":"

As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.

Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.

Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.

Hint

If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.

You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.

"},{"location":"tursa-user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"tursa-user-guide/scheduler/#gpuh","title":"GPUh","text":"

Time used on Tursa nodes is measured in GPUh. 1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost 4 GPUh per hour.

Note

The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.

"},{"location":"tursa-user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"

You can check in SAFE by selecting Login accounts from the menu, select the login account you want to query.

Under Login account details you will see each of the budget codes you have access to listed e.g. dp123 resources and then under Resource Pool to the right of this, a note of the remaining budgets.

When logged in to the machine you can also use the command

sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins%75\n

This will list all the budget codes that you have access to e.g.

Account       User                                                                 MaxTRESMins \n---------- ---------- --------------------------------------------------------------------------- \n       t01   dc-user1                           gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 \n       z01   dc-user1   \n

This shows that dc-user1 is a member of budgets t01 and z01. However, the gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 indicates that the t01 budget can only run GPU jobs in standard (charged) partitions (all other options are disabled, indicated by =0 for CPU standard, CPU low and GPU low). This user can also submit jobs to any partition using the z01 budget.

To see the number of coreh or GPUh remaining you must check in SAFE.

"},{"location":"tursa-user-guide/scheduler/#charging","title":"Charging","text":"

Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).

Jobs are charged for the full number of nodes which are requested, even if they are not all used.

Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.

"},{"location":"tursa-user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"

There are three key commands used to interact with the Slurm on the command line:

We cover each of these commands in more detail below.

"},{"location":"tursa-user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo: information on resources","text":"

sinfo is used to query information about available resources and partitions. Without any options, sinfo lists the status of all resources and partitions, e.g.

[dc-user1@tursa-login1 ~]$ sinfo \n\nPARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST\ncpu          up 2-00:00:00      4  alloc tu-c0r0n[66-69]\ncpu          up 2-00:00:00      2   idle tu-c0r0n[70-71]\ngpu          up 2-00:00:00      1   plnd tu-c0r2n93\ngpu          up 2-00:00:00     11  drain tu-c0r0n75,tu-c0r5n[48,51,54,57],tu-c0r6n[48,51,54,57],tu-c0r7n[00,48]\ngpu          up 2-00:00:00    112    mix tu-c0r0n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,72,87,90],tu-c0r1n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90],tu-c0r3n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,90,93],tu-c0r4n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,81,84,87,90,93]\ngpu          up 2-00:00:00     56   resv tu-c0r0n93,tu-c0r4n78,tu-c0r5n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45],tu-c0r6n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,60,63,66,69],tu-c0r7n[03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,51,54,57]\ngpu          up 2-00:00:00      1   idle tu-c0r3n87\n
"},{"location":"tursa-user-guide/scheduler/#sbatch-submitting-jobs","title":"sbatch: submitting jobs","text":"

sbatch is used to submit a job script to the job submission system. The script will typically contain one or more mpirun commands to launch parallel tasks.

When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.

sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"tursa-user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue: monitoring jobs","text":"

squeue without any options or arguments shows the current status of all jobs known to the scheduler. For example:

squeue\n

will list all jobs on Tursa.

The output of this is often large. You can restrict the output to just your jobs by adding the --me option:

squeue --me\n
"},{"location":"tursa-user-guide/scheduler/#scancel-deleting-jobs","title":"scancel: deleting jobs","text":"

scancel is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately. You need to provide the job ID of the job you wish to cancel/stop. For example:

scancel 12345\n

will cancel (if waiting) or stop (if running) the job with ID 12345.

"},{"location":"tursa-user-guide/scheduler/#resource-limits","title":"Resource Limits","text":"

The Tursa resource limits for any given job are covered by three separate attributes.

"},{"location":"tursa-user-guide/scheduler/#primary-resource","title":"Primary resource","text":"

The primary resource you can request for your job is the compute node.

Information

The --exclusive option is enforced on Tursa which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.

Note

You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.

"},{"location":"tursa-user-guide/scheduler/#partitions","title":"Partitions","text":"

On Tursa, compute nodes are grouped into partitions. You will have to specify a partition using the --partition option in your Slurm submission script. The following table has a list of active partitions on Tursa.

Partition Description Max nodes available cpu CPU nodes with AMD EPYC 48-core processor \u00d7 2 6 gpu GPU nodes with AMD EPYC 48-core processor and NVIDIA A100 GPU \u00d7 4 (this includes both A100-40 and A100-80 GPU) 181 gpu-a100-40 GPU nodes with 2 AMD EPYC 16-core processors and NVIDIA A100-40 GPU \u00d7 4 114 gpu-a100-80 GPU nodes with 2 AMD EPYC 24-core processor (3 nodes have 2 AMD EPYC 16-core processors) and NVIDIA A100-80 GPU \u00d7 4 67

You can list the active partitions by running sinfo.

Tip

You may not have access to all the available partitions.

"},{"location":"tursa-user-guide/scheduler/#quality-of-service-qos","title":"Quality of Service (QoS)","text":"

On Tursa, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos Slurm directive. The following table lists the active QoS on Tursa.

QoS Max Nodes Per Job Max Walltime Jobs Queued Jobs Running Partition(s) Notes standard 64 48 hrs 32 16 gpu, gpu-a100-40, gpu-a100-80, cpu Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is positive. low 64 24 hrs 4 4 gpu, gpu-a100-40, gpu-a100-40, cpu Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is zero or negative dev 2 4 hrs 2 1 gpu For faster turnaround for development jobs and interactive sessions, only available when your budget is positive. The dev QoS must be used with the gpu-a100-40 (1-node maximum) or gpu-a100-80 (2-node maximum) partitions.

You can find out the QoS that you can use by running the following command:

sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50\n

As long as you have a positive budget, you should use the standard QoS. Once you have exhausted your budget you can use the low QoS to continue to run jobs at a lower priority than jobs in the standard QoS.

Hint

If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.

Important

Only jobs sizes that are powers of 2 nodes are allowed. i.e. 1, 2, 4, 8, 16, 32, 64 nodes on the gpu partition and 1, 2, 4 nodes on the cpu partition.

"},{"location":"tursa-user-guide/scheduler/#priority","title":"Priority","text":"

Job priority on Tursa depends on a number of different factors:

Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:

Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]\n

The priority factors are:

You can view the priorities for current queued jobs on the system with the sprio command:

[dc-user1@tursa-login1 ~]$ sprio \n          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE        QOS\n          43963 gpu             5055          0         51          5       5000\n          43975 gpu             5061          0         41         20       5000\n          43976 gpu             5061          0         41         20       5000\n          43982 gpu             5046          0         26         20       5000\n          43986 gpu             5011          0          6          5       5000\n          43996 gpu             5020          0          0         20       5000\n          43997 gpu             5020          0          0         20       5000\n
"},{"location":"tursa-user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"tursa-user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"

An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:

"},{"location":"tursa-user-guide/scheduler/#slurm-queued-reasons","title":"Slurm queued reasons","text":"

The squeue command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.

Status Code Description PENDING PD Job is awaiting resource allocation. RUNNING R Job currently has an allocation. SUSPENDED S Job currently has an allocation. COMPLETING CG Job is in the process of completing. Some processes on some nodes may still be active. COMPLETED CD Job has terminated all processes on all nodes with an exit code of zero. TIMEOUT TO Job terminated upon reaching its time limit. STOPPED ST Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. OUT_OF_MEMORY OOM Job experienced out of memory error. FAILED F Job terminated with non-zero exit code or other failure condition. NODE_FAIL NF Job terminated due to failure of one or more allocated nodes. CANCELLED CA Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

For a full list of see Job State Codes.

Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.

For a full list of see Job Reasons.

"},{"location":"tursa-user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"

Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out. This file appears in the job's working directory once your job starts running.

Hint

Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered to the srun command in your SLURM script.

"},{"location":"tursa-user-guide/scheduler/#specifying-resources-in-job-scripts","title":"Specifying resources in job scripts","text":"

You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH.

Hint

Most options provided using #SBATCH directives can also be specified as command line options to srun.

If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:

Other common options that are used are:

In addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.

If you are happy to have any GPU type for your job (A100-40 or A100-80) then you select the gpu partition:

If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:

To just use the A100-40 GPU nodes:

If you do not specfy a partition, the scheduler may use any available node types for the job (equivalent of --partition=gpu).

Note

For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes).

To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option

Using the --export=none means that the behaviour of batch submissions should be repeatable. We strongly recommend its use, although see the following section to enable access to the usual modules.

"},{"location":"tursa-user-guide/scheduler/#gpu-frequency","title":"GPU frequency","text":"

Important

The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.

Users can control the GPU frequency in their job submission scripts:

Bug

When setting the GPU frequency you will see an error in the output from the job that says control disabled. This is an incorrect message due to an issue with how Slurm sets the GPU frequency and can be safely ignored.

"},{"location":"tursa-user-guide/scheduler/#srun-launching-parallel-jobs","title":"srun: Launching parallel jobs","text":"

If you are running parallel jobs, your job submission script should contain one or more srun commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block and --hint=nomultithread to your srun command to ensure you get the correct pinning of processes to cores on a compute node.

A brief explanation of these options: - --hint=nomultithread - do not use hyperthreads/SMP - --distribution=block:block - the first block means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).

Important

The Slurm definition of a \"socket\" does not usually correspond to a physical CPU socket. On Tursa GPU nodes it corresponds to half the cores on a socket as the GPU nodes are configured with NPS2.

On the Tursa CPU nodes, the Slurm definition of a scoket does correspond to a physical CPU socket (64 cores) as the COU nodes are configured with NPS1.

"},{"location":"tursa-user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":""},{"location":"tursa-user-guide/scheduler/#example-job-submission-script-for-a-parallel-job-using-cuda","title":"Example: job submission script for a parallel job using CUDA","text":"

A job submission script for a parallel job that uses 4 compute nodes, 4 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used:

#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Grid_job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=8\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]             \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\nexport OMP_NUM_THREADS=8\n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\nsrun --hint=nomultithread --distribution=block:block \\\n     gpu_launch.sh \\\n     ${application} ${options}\n

This will run your executable \"my_mpi_opnemp_app.x\" in parallel usimg 16 MPI processes on 4 nodes. 4 GPUs will be used per node.

Important

You must use the gpu_launch.sh wrapper script to get the correct biniding of GPU to MPI processes and of network interface to GPU and MPI process. This script is described in more detail below.

"},{"location":"tursa-user-guide/scheduler/#gpu_launchsh-wrapper-script","title":"gpu_launch.sh wrapper script","text":"

The gpu_launch.sh wrapper script is required to set the correct binding of GPU to MPI processes and the correct binding of interconnect interfaces to MPI process and GPU. We provide this centrally for convenience but its contents are simple:

#!/bin/bash\n\n# Compute the raw process ID for binding to GPU and NIC\nlrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))\n\n# Bind the process to the correct GPU and NIC\nexport CUDA_VISIBLE_DEVICES=${lrank}\nexport UCX_NET_DEVICES=mlx5_${lrank}:1\n\n$@\n
"},{"location":"tursa-user-guide/scheduler/#using-the-dev-qos","title":"Using the dev QoS","text":"

The dev QoS is designed for faster turnaround of short jobs than is usually available through the production QoS. It is subject to a number of restrictions:

In addtion, you must specify either the gpu-a100-80 or gpu-a100-40 partitions when using the dev QoS.

Tip

The generic gpu partition will not work consistently when using the dev QoS.

Here is an example job submission script for a 2-node job in the dev QoS using the gpu-a100-80 partition. Note the use of the gpu_launch.sh wrapper script to get correct GPU and NIC binding.

#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Dev_Job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=12\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu-a100-80\n#SBATCH --qos=dev\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n\nexport OMP_NUM_THREADS=1\n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\nsrun --hint=nomultithread --distribution=block:block \\\n     gpu_launch.sh \\\n     ${application} ${options}\n
"},{"location":"tursa-user-guide/sw-environment/","title":"Software environment","text":"

The software environment on Tursa is primarily controlled through the module command. By loading and switching software modules you control which software and versions are available to you.

Information

A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.

By default, all users on Tursa start with the default software environment loaded.

Software modules on Tursa are provided by both Eviden and by EPCC.

In this section, we provide:

"},{"location":"tursa-user-guide/sw-environment/#using-the-module-command","title":"Using the module command","text":"

We only cover basic usage of the module command here. For full documentation please see the Linux manual page on modules

The module command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:

These are described in more detail below.

"},{"location":"tursa-user-guide/sw-environment/#information-on-the-available-modules","title":"Information on the available modules","text":"

The module list command will give the names of the modules and their versions you have presently loaded in your environment. By default, you will have no modules loaded when you first log into Tursa

Finding out which software modules are available on the system is performed using the module avail command. To list all software modules available, use:

[dc-user1@tursa-login1 ~]$ module avail\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\ncuda/11.0.2  openmpi/4.1.1-cuda11.0.2  ucx/1.10.1-cuda11.0.2  \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\ncuda/11.4.1  openmpi/4.1.1-cuda11.4  ucx/1.12.0-cuda11.4  \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\ncuda/11.4.1  openmpi/4.1.1-cuda11.4.1  ucx/1.12.0-cuda11.4.1  \n\n------------------------------------------------ /mnt/lustre/tursafs1/apps/modulefiles -------------------------------------------------\ncuda/11.0.3  dot  gcc/9.3.0  module-git  module-info  modules  null  openmpi/4.1.1  ucx/1.10.1  use.own  xpmem/2.6.5   \n

This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.

You can list all the modules of a particular type by providing an argument to the module avail command. For example, to list all available versions of the OpenMPI library, use:

[dc-user1@tursa-login1 ~]$ module avail openmpi\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.0.2  \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\nopenmpi/4.1.1-cuda11.4  \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.4.1  \n\n----------------\n

The module show command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default openmpi module:

[dc-user1@tursa-login1 ~]$ module show openmpi\n-------------------------------------------------------------------\n/mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles/openmpi/4.1.1-cuda11.0.2:\n\nmodule-whatis   Sets up OpenMPI on your environment\nsetenv          MPI_ROOT        /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/\nprepend-path    PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/bin/\nprepend-path    LD_LIBRARY_PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/lib\nprepend-path    MANPATH /opt/mpi/openmpi/4.0.4.1/share/man\nmodule load     ucx/1.10.1\nsetenv          OMPI_CC cc\nsetenv          OMPI_CXX        g++\nsetenv          OMPI_CFLAGS     -g -m64\nsetenv          OMPI_CXXFLAGS   -g -m64\n-------------------------------------------------------------------\n
"},{"location":"tursa-user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"

To load a module to use the module load command. For example, to load the default version of OpenMPI into your environment, use:

[dc-user1@tursa-login1 ~]$ module load openmpi\n\n        UCX 1.10 loaded\n\n\n        OpenMPI 4.1.1 loaded\n

Once you have done this, your environment will be setup to use the OpenMPI library. The above command will load the default version of OpenMPI. If you need a specific version of the software, you can add more information:

[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4.1\n\n        UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n        OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0  loaded\n

will load OpenMPI version 4.1.1 with CUDA 11.4.1 into your environment, regardless of the default.

If you want to remove software from your environment, module rm will remove a loaded module:

[dc-user1@tursa-login1 ~]$ module rm openmpi\n

will unload what ever version of openmpi (even if it is not the default) you might have loaded.

There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule.

Suppose you have loaded version 4.1.1 of openmpi, the following command will change to version 4.1.1-cuda11.4.1:

[dc-user1@tursa-login1 ~]$ module swap openmpi openmpi/4.1.1-cuda11.4.1\n\n        UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n        OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0  loaded\n

You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.

"},{"location":"tursa-user-guide/sw-environment/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"

Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:

[dc-user1@tursa-login1 ~]$ module save [collection_name]\n

Note

If you do not specify the environment name, it is called default.

You can find the list of saved module environments by executing:

[dc-user1@tursa-login1 ~]$ module savelist\nNamed collection list:\n 1) default\n

To list the modules in a collection, you can execute, e.g.,:

[dc-user1@tursa-login1 ~]$ module saveshow default\n-------------------------------------------------------------------\n/home/z01/z01/dc-turn1/.module/default:\n\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/modulefilesintel\nmodule use --append /mnt/lustre/tursafs1/apps/modulefiles\nmodule load ucx/1.12.0-cuda11.4.1\nmodule load openmpi/4.1.1-cuda11.4.1\n\n-------------------------------------------------------------------\n

Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,

[dc-user1@tursa-login1 ~]$ module save /home/t01/z01/auser/my-module-collection\n

if you want to save to a specific file name.

To delete a module environment, you can execute:

[dc-user1@tursa-login1 ~]$ module saverm <environment_name>\n
"},{"location":"tursa-user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"

When you log in to Tursa, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv or export.

The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS define the number of threads.

To define an environment variable, you need to execute:

export OMP_NUM_THREADS=4\n

Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.

You can show the value of a specific environment variable if you print it:

echo $OMP_NUM_THREADS\n

Do not forget the dollar symbol. To remove an environment variable, just execute:

unset OMP_NUM_THREADS\n
"},{"location":"tursa-user-guide/sw-environment/#compiler-environment","title":"Compiler environment","text":"

The system supports two different primary compiler environments:

"},{"location":"tursa-user-guide/sw-environment/#gcc-toolchain","title":"GCC toolchain","text":"

To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:

[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load cuda/11.4.1 \n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) gcc/9.3.0   2) cuda/11.4.1   3) ucx/1.12.0-cuda11.4   4) openmpi/4.1.1-cuda11.4 \n

Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:

You can find more information on these scripts in the OpenMPI documentation.

"},{"location":"tursa-user-guide/sw-environment/#nvhpc-toolchain","title":"NVHPC toolchain","text":"

To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:

[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load nvhpc/21.7-nompi\n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) /mnt/lustre/tursafs1/home/y07/shared/tursa-modules/setup-env   2) nvhpc/21.7-nompi\n 3) ucx/1.12.0-cuda11.4   4) openmpi/4.1.1-cuda11.4    5) gcc/9.3.0\n

Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:

and the NVIDIA compilers are available as:

Tip

Both the NVIDIA compilers and the MPI compiler wrapper scripts will use the GCC compilers directly in the default configuration - this is often what you want. If you want the compiler wrappers to call the NVIDIA compilers themselves rather than GCC directly, you would use:

export OMPI_CC=nvcc\nexport OMPI_CXX=nvc++\nexport OMPI_FC=nvfortran\n
"},{"location":"tursa-user-guide/sw-environment/#other-build-tools","title":"Other build tools","text":""},{"location":"tursa-user-guide/sw-environment/#cmake","title":"cmake","text":"

CMake is available by using the commands:

[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load cmake\n
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"DiRAC Extreme Scaling User Documentation","text":"

DiRAC Extreme Scaling is part of the DiRAC National HPC Service. You can find more information on the service and the research it supports on the DiRAC website.

The DiRAC Extreme Scaling service is an HPC resource for UK researchers. DiRAC Extreme Scaling is provided by UKRI, EPCC and the University of Edinburgh. The hardware is provided by ATOS.

"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"

The documentation currently includes:

"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"

The source for this documentation is publicly available in the DiRAC Extreme Scaling documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.

Full details of how to contribute can be found in the README.md file of the repository.

"},{"location":"tursa-user-guide/","title":"Tursa User Guide","text":"

The Tursa User Guide covers all aspects of use of the Tursa resource. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of Tursa and more technical topics.

The Tursa User Guide contains the following sections:

"},{"location":"tursa-user-guide/connecting/","title":"Connecting to Tursa","text":"

On the Tursa system, interactive access can be achieved via SSH, either directly from a command line terminal or using an SSH client. In addition data can be transferred to and from the Tursa system using scp from the command line or by using a file transfer client.

This section covers the basic connection methods.

Before following the process below, we assume you have setup an account on Tursa through the DiRAC SAFE. Documentation on how to do this can be found at:

"},{"location":"tursa-user-guide/connecting/#command-line-terminal","title":"Command line terminal","text":""},{"location":"tursa-user-guide/connecting/#linux","title":"Linux","text":"

Linux distributions come installed with a terminal application that can be used for SSH access to the login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g. GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.

"},{"location":"tursa-user-guide/connecting/#macos","title":"MacOS","text":"

MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.

"},{"location":"tursa-user-guide/connecting/#windows","title":"Windows","text":"

A typical Windows installation will not include a terminal client, though there are various clients available. We recommend all our Windows users to download and install MobaXterm to access Tursa. It is very easy to use and includes an integrated X server with SSH client to run any graphical applications on Tursa.

You can download MobaXterm Home Edition (Installer Edition) from the following link:

Double-click the downloaded Microsoft Installer file (.msi), and the Windows wizard will automatically guides you through the installation process. Note, you might need to have administrator rights to install on some Windows OS. Also make sure to check whether Windows Firewall hasn't blocked any features of this program after installation.

Start MobaXterm and then click \"Start local terminal\"

Tips

"},{"location":"tursa-user-guide/connecting/#access-credentials","title":"Access credentials","text":"

To access Tursa, you need to use two credentials:

You can find more detailed instructions on how to set up your credentials to access Tursa from Windows, macOS and Linux below.

"},{"location":"tursa-user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"

You will need to generate an SSH key pair protected by a passphrase to access Tursa.

Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:

$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n|    . ...+o++++. |\n| . . . =o..      |\n|+ . . .......o o |\n|oE .   .         |\n|o =     .   S    |\n|.    +.+     .   |\n|.  oo            |\n|.  .             |\n| ..              |\n+-----------------+\n

(remember to replace \"your@email.com\" with your e-mail address).

"},{"location":"tursa-user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"

You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:

Login to SAFE. Then:

  1. Go to the Menu Login accounts and select the Tursa account you want to add the SSH key to
  2. On the subsequent Login account details page click the Add Credential button
  3. Select SSH public key as the Credential Type and click Next
  4. Either copy and paste the public part of your SSH key into the SSH Public key box or use the button to select the public key file on your computer.
  5. Click Add to associate the public SSH key part with your account

Once you have done this, your SSH key will be added to your Tursa account.

Remember, you will need to use both an SSH key and password to log into Tursa so you will also need to collect your initial password before you can log into Tursa. We cover this next.

Note

If you want to connect to Tursa from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.

"},{"location":"tursa-user-guide/connecting/#initial-passwords-up-to-13-feb-2024","title":"Initial passwords (up to 13 Feb 2024)","text":"

The SAFE web interface is used to provide your initial password for logging onto Tursa (see the SAFE Documentation for more details on requesting accounts and picking up passwords).

Note

You may now change your password on the Tursa machine itself using the passwd command or when you are prompted the first time you login. This change will not be reflected in the SAFE. If you forget your password, you should use the SAFE to request a new one-shot password.

"},{"location":"tursa-user-guide/connecting/#mfa-time-based-one-time-passcode-totp-from-13-feb-2024","title":"MFA Time-based one-time passcode (TOTP) (from 13 Feb 2024)","text":"

You will need to use both an SSH key and time-based one-time passcode to log into Tursa so you will also need to set up a method for generating a TOTP code before you can log into Tursa.

"},{"location":"tursa-user-guide/connecting/#first-login-from-a-new-account-password-required","title":"First login from a new account: password required","text":"

Important

You will not use your password when logging on to Tursa after the first login for a new account.

As an additional security measure, you will also need to use a password from SAFE for your first login to Tursa with a new account. When you log into Tursa for the first time with a new account, you will be prompted to change your initial password. This is a three step process:

  1. When promoted to enter your ldap password: Enter the password which you retrieve from SAFE
  2. When prompted to enter your new password: type in a new password
  3. When prompted to re-enter the new password: re-enter the new password

Your password has now been changed. You will no longer need this password to log into Tursa from this point forwards, you will use your SSH key and TOTP as described above.

"},{"location":"tursa-user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"

Interaction with Tursa is done remotely, over an encrypted communication channel, Secure Shell version 2 (SSH-2). This allows command-line access to one of the login nodes of a Tursa, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers when used in conjunction with an X client.

"},{"location":"tursa-user-guide/connecting/#logging-in","title":"Logging in","text":"

You can use the following command from the terminal window to login into Tursa:

ssh username@tursa.dirac.ed.ac.uk\n

You need to enter both credentials correctly to be able to access Tursa.

Tip

If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa) on your local system, you may need to specify the path to the private part of the key wih the -i option to ssh. For example, if your key is in a file called keys/id_rsa_Tursa you would use the command ssh -i keys/id_rsa_Tursa username@login.tursa.ac.uk to log in.

Tip

When you first log into Tursa, you will be prompted to change your initial password. This is a three step process:

  1. When promoted to enter your ldap password: Re-enter the password you retrieved from SAFE
  2. When prompted to enter your new password: type in a new password
  3. When prompted to re-enter the new password: re-enter the new password

Your password has now been changed

To allow remote programs, especially graphical applications to control your local display, such as being able to open up a new GUI window (such as for a debugger), use:

ssh -X username@tursa.dirac.ed.ac.uk\n

Some sites recommend using the -Y flag. While this can fix some compatibility issues, the -X flag is more secure.

Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:

"},{"location":"tursa-user-guide/connecting/#making-access-more-convenient-using-the-ssh-configuration-file","title":"Making access more convenient using the SSH configuration file","text":"

Typing in the full command to login or transfer data to Tursa can become tedious as it often has to be repeated many times. You can use the SSH configuration file, usually located on your local machine at .ssh/config to make things a bit more convenient.

Each remote site (or group of sites) can have an entry in this file which may look something like:

Host tursa\n  HostName tursa.dirac.ed.ac.uk\n  User username\n

(remember to replace username with your actual username!).

The Host tursa line defines a short name for the entry. In this case, instead of typing ssh username@tursa.dirac.ed.ac.uk to access the Tursa login nodes, you could use ssh tursa instead. The remaining lines define the options for the tursa host.

Now you can use SSH to access Tursa without needing to enter your username or the full hostname every time:

$ ssh tursa\n

You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config man page (or man ssh_config on any machine with SSH installed) for a description of the SSH configuration file. You may find the IdentityFile option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.

Bug

There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512), you will need to either specify the path to your ssh key in the command line (using the -i option as described above) or add the path to your SSH config file by using the IdentityFile option.

"},{"location":"tursa-user-guide/connecting/#ssh-debugging-tips","title":"SSH debugging tips","text":"

If you find you are unable to connect via SSH there are a number of ways you can try and diagnose the issue. Some of these are collected below - if you are having difficulties connecting we suggest trying these before contacting the Tursa service desk.

"},{"location":"tursa-user-guide/connecting/#use-the-usertursadiracedacuk-syntax-rather-than-l-user-tursadiracedacuk","title":"Use the user@tursa.dirac.ed.ac.uk syntax rather than -l user tursa.dirac.ed.ac.uk","text":"

We have seen a number of instances where people using the syntax

ssh -l user tursa.dirac.ed.ac.uk\n

have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:

ssh user@tursa.dirac.ed.ac.uk\n

works more reliably. If you are using the -l user option to connect and are seeing issues, then try using user@tursa.dirac.ed.ac.uk instead.

"},{"location":"tursa-user-guide/connecting/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"

Try the command ping -c 3 tursa.dirac.ed.ac.uk. If you successfully connect to the login node, the output should include:

--- tursa.dirac.ed.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n

(the ping time '38ms' is not important). If not all packets are received there could be a problem with your internet connection, or the login node could be unavailable.

"},{"location":"tursa-user-guide/connecting/#password","title":"Password","text":"

If you are having trouble entering your password consider using a password manager, from which you can copy and paste it. This will also help you generate a secure password. If you need to reset your password, instructions for doing so can be found in the SAFE documentation

Windows users please note that Ctrl+V does not work to paste in to PuTTY, MobaXterm, or PowerShell. Instead use Shift+Ins to paste. Alternatively, right-click and select 'Paste' in PuTTY and MobaXterm, or simply right-click to paste in PowerShell.

"},{"location":"tursa-user-guide/connecting/#ssh-key","title":"SSH key","text":"

If you get the error message Permission denied (publickey) this can indicate a problem with your SSH key. Some things to check:

Target Permissions chmod Code Directory drwx------ 700 Private Key -rw------- 600 Public Key -rw-r--r-- 644

chmod can be used to set permissions on the target in the following way: chmod <code> <target>. So for example to set correct permissions on the private key file id_rsa_Tursa one would use the command chmod 600 id_rsa_Tursa.

Tip

Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -, or directory d. The next three characters indicate the owning user's permissions. The first character is r if they have read permission, - if they don't, the second character is w if they have write permission, - if they don't, the third character is x if they have execute permission, - if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r-- indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------ becomes 111 000 000 -> 700.

"},{"location":"tursa-user-guide/connecting/#ssh-verbose-output","title":"SSH verbose output","text":"

Verbose debugging output from ssh can be very useful for diagnosing the issue. In particular, it can be used to distinguish between problems with the SSH key and password - further details are given below. To enable verbose output add the -vvv flag to your SSH command. For example:

ssh -vvv username@tursa.dirac.ed.ac.uk\n

The output is lengthy, but somewhere in there you should see lines similar to the following:

debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: publickey,password\n

If you do not see the Password: prompt you may have connection issues, or there could be a problem with the Tursa login nodes. If you do not see Authenticated with partial success it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success, it means your password was accepted, and your SSH key will now be checked.

You should next see something similiar to:

debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (publickey).\n

Most importantly, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded indicates that the SSH key has been accepted. By default ssh will go through a list of standard private key files, as well as any you have specified with -i or a config file. This is fine, as long as one of the files mentioned is the one that matches the public key uploaded to SAFE.

If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey). If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.

"},{"location":"tursa-user-guide/data/","title":"Data management and transfer","text":"

This section covers best practice and tools for data management on Tursa as well as information on the storgae available on the system.

Information

If you have any questions on data management and transfer please do not hesitate to contact the DiRAC service desk at dirac-support@epcc.ed.ac.uk.

"},{"location":"tursa-user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":""},{"location":"tursa-user-guide/data/#data-management","title":"Data management","text":"

We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the Tursa service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.

"},{"location":"tursa-user-guide/data/#tursa-storage","title":"Tursa storage","text":"

Tursa has two different storage systems available:

"},{"location":"tursa-user-guide/data/#parallel-lustre-file-system","title":"Parallel Lustre file system","text":"

The Tursa storage is provided by a parallel Lustre file system that provides your home directories and working storage. When you log in you will be placed in your home directory.

The home directory for each user is located at:

/home/[project code]/[group code]/[username]\n

where

Each project is allocated a portion of the total storage available, and the project PI will be able to sub-divide this quota among the groups and users within the project. As is standard practice on UNIX and Linux systems, the environment variable $HOME is automatically set to point to your home directory.

"},{"location":"tursa-user-guide/data/#tape-storage","title":"Tape storage","text":"

The tape storage can be made available to any Tursa user on request and can be used to store data from the Lustre parallel file system.

Managing and transferring data to/from the Tursa tape storage is done via the Miria web interface via an SSH tunnel to the Tursa login nodes.

Important

All data on the tape storage is shared project data rather than data associated with individual user accounts. Any data you move to tape will be visible to all users in the same project as you who have access to the tape storage.

"},{"location":"tursa-user-guide/data/#requesting-access-to-the-tape-storage","title":"Requesting access to the tape storage","text":"

If you want to use the Tursa tape storage, you should contact the DiRAC Service Desk with the username and project ID you want to use to access the storage.

"},{"location":"tursa-user-guide/data/#data-locations","title":"Data locations","text":"

In order to move data to the tape storage it must exist in a specific directory on the Tursa Lustre file system. You will need to move or copy the data to this location before it can be moved to tape and when you restore data from tape it will be placed in this location.

There is one directory per project on Tursa. The directory has the path:

/mnt/lustre/tursafs1/archive/[project code]\n

So, for example, the directory for project dp001 would be:

/mnt/lustre/tursafs1/archive/dp001\n
"},{"location":"tursa-user-guide/data/#setup-the-ssh-tunnel-for-miria","title":"Setup the SSH tunnel for Miria","text":"

Once your tape storage access has been setup and you have moved data to the archive directory, you will need to connect to the Miria web interface in a web browser on your local system by setting up an SSH tunnel to the Tursa login nodes.

You do this by logging into Tursa in the usual way (with your SSH key and password) and adding the -L 9080:10.144.20.95:443 option to the ssh command.

For example, if your username is dc-user1, you would setup the tunnel by logging into Tursa with (assuming your SSH key is in the default location):

ssh -L 9080:10.144.20.95:443 dc-user1@tursa.dirac.ed.ac.uk\n

Enter your SSH key passphrase and password in the usual way.

Note

You will need to setup the SSH tunnel each time you want to access the Miria interface.

"},{"location":"tursa-user-guide/data/#access-the-miria-interface","title":"Access the Miria interface","text":"

Once you have setup the SSH tunnel, you should be able to access the Miria interface in a web browser on your local system. Open a new tab and enter the URL:

You should see an interface asking you for a username and password. Use the username and password that you use to log into Tursa to log into the tape storage interface.

"},{"location":"tursa-user-guide/data/#transfer-data-from-tursa-lustre-to-tape","title":"Transfer data from Tursa Lustre to tape","text":"

You use the \"Easy Move\" option from the left-hand menu to transfer data.

  1. Click on \"Easy Move\"
  2. Click on the \"Find a source\" menu and select the disk with your project ID (e.g. \"dp001\")
  3. Click on the \"Find a target\" menu and select the archive with your project ID (e.g. \"dp001\")
  4. Use the file explorer to select the files/directories you wish to move to tape
  5. Click the \"Add\" button
  6. Scroll to the bottom of the page and select \"Validate basket\" and confirm you wish to proceed

Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.

"},{"location":"tursa-user-guide/data/#restore-data-from-tape-to-tursa-lustre","title":"Restore data from tape to Tursa Lustre","text":"

You use the \"Easy Move\" option from the left-hand menu to transfer data.

  1. Click on \"Easy Move\"
  2. Select the \"Repository\" icon next to the \"Find a source\" menu
  3. Click on the \"Find a source\" menu and select the archive with your project ID (e.g. \"dp001\")
  4. Select the \"Platform\" icon next to the \"Find a source\" menu
  5. Click on the \"Find a target\" menu and select the disk with your project ID (e.g. \"dp001\")
  6. Use the source file explorer to select the files/directories you wish to restore
  7. Use the target file explorer to select the location oon disk to restore the data
  8. Click the \"Add\" button
  9. Scroll to the bottom of the page and select \"Validate basket\" and confirm you wish to proceed

Your transfer request will be added to the queue. You can check on progress by selecting the \"Activity\" option in the left hand menu.

Bug

If you restore a file rather than a directory, the Miria tool will give the file the name NULL once it is restored, you should use the mv command to rename the file to the correct name once it has been restored.

"},{"location":"tursa-user-guide/data/#sharing-data-with-other-tursa-users","title":"Sharing data with other Tursa users","text":"

How you share data with other Tursa users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.

"},{"location":"tursa-user-guide/data/#sharing-data-with-tursa-users-in-your-project","title":"Sharing data with Tursa users in your project","text":"

Each project has an inner shared folder.

/home/[project code]/[project code]/shared\n

This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /home/x01/x01/shared.

"},{"location":"tursa-user-guide/data/#sharing-data-with-all-tursa-users","title":"Sharing data with all Tursa users","text":"

Each project also has an outer shared folder.:

/home/[project code]/shared\n

It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other Tursa users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /home/x01/shared.

"},{"location":"tursa-user-guide/data/#permissions","title":"Permissions","text":"

You should check the permissions of any files that you place in the shared area, especially if those files were created in your own Tursa account Files of the latter type are likely to be readable by you only.

The chmod command below shows how to make sure that a file placed in the outer shared folder is also readable by all Tursa users.

chmod a+r /home/x01/shared/your-shared-file.txt\n

Similarly, for the inner shared folder, chmod can be called such that read permission is granted to all users within the x01 project.

chmod g+r /home/x01/x01/shared/your-shared-file.txt\n

If you're sharing a set of files stored within a folder hierarchy the chmod is slightly more complicated.

chmod -R a+Xr /home/x01/shared/my-shared-folder\nchmod -R g+Xr /home/x01/x01/shared/my-shared-folder\n

The -R option ensures that the read permission is enabled recursively and the +X guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder.

"},{"location":"tursa-user-guide/data/#archiving-and-data-transfer","title":"Archiving and data transfer","text":"

Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.

The method you use to transfer data to/from Tursa will depend on how much you want to transfer and where to. The methods we cover in this guide are:

Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.

"},{"location":"tursa-user-guide/data/#archiving","title":"Archiving","text":"

If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar and zip.

"},{"location":"tursa-user-guide/data/#tar","title":"tar","text":"

The tar command packs files into a \"tape archive\" format. The command has general form:

tar [options] [file(s)]\n

Common options include:

Putting these together:

tar -cvWlf mydata.tar mydata\n

will create and verify an archive.

To extract files from a tar file, the option -x is used. For example:

tar -xf mydata.tar\n

will recover the contents of mydata.tar to the current working directory.

To verify an existing tar file against a set of data, the -d (diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:

$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n

Note

tar files do not store checksums with their data, requiring the original data to be present during verification.

Tip

Further information on using tar can be found in the tar manual (accessed via man tar or at man tar).

"},{"location":"tursa-user-guide/data/#zip","title":"zip","text":"

The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:

zip [options] mydata.zip [file(s)]\n

Common options are:

Together:

zip -0r mydata.zip mydata\n

will create an archive.

Note

Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.

The corresponding unzip command is used to extract data from the archive. The simplest use case is:

unzip mydata.zip\n

which recovers the contents of the archive to the current working directory.

Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip provides options for verifying this checksum against the stored files. The relevant flag is -t and is used as follows:

$> unzip -t mydata.zip\nArchive:  mydata.zip\n    testing: mydata/                 OK\n    testing: mydata/file             OK\nNo errors detected in compressed data of mydata.zip.\n

Tip

Further information on using zip can be found in the zip manual (accessed via man zip or at man zip).

"},{"location":"tursa-user-guide/data/#data-transfer-via-ssh","title":"Data transfer via SSH","text":"

The easiest way of transferring data to/from Tursa is to use one of the standard programs based on the SSH protocol such as scp, sftp or rsync. These all use the same underlying mechanism (SSH) as you normally use to log-in to Tursa. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (Tursa in this case).

To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting.

"},{"location":"tursa-user-guide/data/#ssh-data-transfer-performance-considerations","title":"SSH data transfer performance considerations","text":"

The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The Tursa login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr or aes256-ctr algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp so you will need to manually specify them.

A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.

In addition, you should consider the following when transferring data:

"},{"location":"tursa-user-guide/data/#scp","title":"scp","text":"

The scp command creates a copy of a file, or if given the -r flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.

For example, to transfer files to Tursa from a local machine:

scp [options] source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

In the above example, the [destination] is optional, as when left out scp will copy the source into your home directory. Also, the source should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

If you want to request a different encryption algorithm add the -c [algorithm-name] flag to the scp options. For example, to use the (usually faster) arcfour encryption algorithm you would use:

scp [options] -c aes128-ctr source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

"},{"location":"tursa-user-guide/data/#rsync","title":"rsync","text":"

The rsync command can also transfer data between hosts using a ssh connection. It creates a copy of a file or, if given the -r flag, a directory at the given destination, similar to scp above.

Given the -a option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync command is executed with ssh to create the copy on a remote machine.

To transfer files to Tursa using rsync with ssh the command has the form:

rsync [options] -e ssh source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

In the above example, the [destination] is optional, as when left out rsync will copy the source into your home directory. Also the source should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

Additional flags can be specified for the underlying ssh command by using a quoted string as the argument of the -e flag. e.g.

rsync [options] -e \"ssh -c arcfour\" source user@tursa.dirac.ed.ac.uk:[destination]\n

(Remember to replace user with your Tursa username in the example above.)

Tip

Further information on using rsync can be found in the rsync manual (accessed via man rsync or at man rsync).

"},{"location":"tursa-user-guide/data/#ssh-data-transfer-example-laptopworkstation-to-tursa","title":"SSH data transfer example: laptop/workstation to Tursa","text":"

Here we have a short example demonstrating transfer of data directly from a laptop/workstation to Tursa.

Note

This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.

Before we can transfer of data to Tursa we need to make sure we have an SSH key setup to access Tursa from the system we are transferring data from. If you are using the same system that you use to log into Tursa then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to Tursa.

Tip

Remember that you will need to use both a key and your password to transfer data to Tursa.

Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:

tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n

We then initiate the data transfer from our system to Tursa, here using rsync to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2 on our local system:

rsync -Pv -e\"ssh -c aes128-gcm@openssh.com -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@tursa.dirac.ed.ac.uk:/home/z19/z19/otbz19/\n

Note the use of the -P flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c option specifies the cipher to be used as aes128-gcm which has been found to increase performance. Unfortunately the ~ shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on Tursa.

Note

Remember to replace otbz19 with your username on Tursa.

If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp command,

scp -c aes128-gcm@openssh.com -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@transfer.dyn.tursa.ac.uk:/home/z19/z19/otbz19/\n

but rsync is recommended for larger transfers.

"},{"location":"tursa-user-guide/hardware/","title":"ARCHER2 hardware","text":"

Note

Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.

"},{"location":"tursa-user-guide/hardware/#system-overview","title":"System overview","text":"

Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect.

There are additional login nodes, which provide access to the system.

Compute nodes are only accessible via the Slurm job scheduling system.

There is a single file system which is available on login and compute nodes (see Data management and transfer).

The Lustre file system has a capacity of 5.1 PiB.

The interconnect uses a Fat Tree topology.

"},{"location":"tursa-user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"

Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology:

"},{"location":"tursa-user-guide/scheduler/","title":"Running jobs on Tursa","text":"

As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.

Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.

Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.

Hint

If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.

You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.

"},{"location":"tursa-user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"tursa-user-guide/scheduler/#gpuh","title":"GPUh","text":"

Time used on Tursa nodes is measured in GPUh. 1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost 4 GPUh per hour.

Note

The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.

"},{"location":"tursa-user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"

You can check in SAFE by selecting Login accounts from the menu, select the login account you want to query.

Under Login account details you will see each of the budget codes you have access to listed e.g. dp123 resources and then under Resource Pool to the right of this, a note of the remaining budgets.

When logged in to the machine you can also use the command

sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins%75\n

This will list all the budget codes that you have access to e.g.

Account       User                                                                 MaxTRESMins \n---------- ---------- --------------------------------------------------------------------------- \n       t01   dc-user1                           gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 \n       z01   dc-user1   \n

This shows that dc-user1 is a member of budgets t01 and z01. However, the gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0 indicates that the t01 budget can only run GPU jobs in standard (charged) partitions (all other options are disabled, indicated by =0 for CPU standard, CPU low and GPU low). This user can also submit jobs to any partition using the z01 budget.

To see the number of coreh or GPUh remaining you must check in SAFE.

"},{"location":"tursa-user-guide/scheduler/#charging","title":"Charging","text":"

Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).

Jobs are charged for the full number of nodes which are requested, even if they are not all used.

Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.

"},{"location":"tursa-user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"

There are three key commands used to interact with the Slurm on the command line:

We cover each of these commands in more detail below.

"},{"location":"tursa-user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo: information on resources","text":"

sinfo is used to query information about available resources and partitions. Without any options, sinfo lists the status of all resources and partitions, e.g.

[dc-user1@tursa-login1 ~]$ sinfo \n\nPARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST\ncpu          up 2-00:00:00      4  alloc tu-c0r0n[66-69]\ncpu          up 2-00:00:00      2   idle tu-c0r0n[70-71]\ngpu          up 2-00:00:00      1   plnd tu-c0r2n93\ngpu          up 2-00:00:00     11  drain tu-c0r0n75,tu-c0r5n[48,51,54,57],tu-c0r6n[48,51,54,57],tu-c0r7n[00,48]\ngpu          up 2-00:00:00    112    mix tu-c0r0n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,72,87,90],tu-c0r1n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90],tu-c0r3n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,90,93],tu-c0r4n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,81,84,87,90,93]\ngpu          up 2-00:00:00     56   resv tu-c0r0n93,tu-c0r4n78,tu-c0r5n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45],tu-c0r6n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,60,63,66,69],tu-c0r7n[03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,51,54,57]\ngpu          up 2-00:00:00      1   idle tu-c0r3n87\n
"},{"location":"tursa-user-guide/scheduler/#sbatch-submitting-jobs","title":"sbatch: submitting jobs","text":"

sbatch is used to submit a job script to the job submission system. The script will typically contain one or more mpirun commands to launch parallel tasks.

When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.

sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"tursa-user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue: monitoring jobs","text":"

squeue without any options or arguments shows the current status of all jobs known to the scheduler. For example:

squeue\n

will list all jobs on Tursa.

The output of this is often large. You can restrict the output to just your jobs by adding the --me option:

squeue --me\n
"},{"location":"tursa-user-guide/scheduler/#scancel-deleting-jobs","title":"scancel: deleting jobs","text":"

scancel is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately. You need to provide the job ID of the job you wish to cancel/stop. For example:

scancel 12345\n

will cancel (if waiting) or stop (if running) the job with ID 12345.

"},{"location":"tursa-user-guide/scheduler/#resource-limits","title":"Resource Limits","text":"

The Tursa resource limits for any given job are covered by three separate attributes.

"},{"location":"tursa-user-guide/scheduler/#primary-resource","title":"Primary resource","text":"

The primary resource you can request for your job is the compute node.

Information

The --exclusive option is enforced on Tursa which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.

Note

You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.

"},{"location":"tursa-user-guide/scheduler/#partitions","title":"Partitions","text":"

On Tursa, compute nodes are grouped into partitions. You will have to specify a partition using the --partition option in your Slurm submission script. The following table has a list of active partitions on Tursa.

Partition Description Max nodes available cpu CPU nodes with AMD EPYC 48-core processor \u00d7 2 6 gpu GPU nodes with AMD EPYC 48-core processor and NVIDIA A100 GPU \u00d7 4 (this includes both A100-40 and A100-80 GPU) 181 gpu-a100-40 GPU nodes with 2 AMD EPYC 16-core processors and NVIDIA A100-40 GPU \u00d7 4 114 gpu-a100-80 GPU nodes with 2 AMD EPYC 24-core processor (3 nodes have 2 AMD EPYC 16-core processors) and NVIDIA A100-80 GPU \u00d7 4 67

You can list the active partitions by running sinfo.

Tip

You may not have access to all the available partitions.

"},{"location":"tursa-user-guide/scheduler/#quality-of-service-qos","title":"Quality of Service (QoS)","text":"

On Tursa, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos Slurm directive. The following table lists the active QoS on Tursa.

QoS Max Nodes Per Job Max Walltime Jobs Queued Jobs Running Partition(s) Notes standard 64 48 hrs 32 16 gpu, gpu-a100-40, gpu-a100-80, cpu Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is positive. low 64 24 hrs 4 4 gpu, gpu-a100-40, gpu-a100-40, cpu Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is zero or negative dev 2 4 hrs 2 1 gpu For faster turnaround for development jobs and interactive sessions, only available when your budget is positive. The dev QoS must be used with the gpu-a100-40 (1-node maximum) or gpu-a100-80 (2-node maximum) partitions.

You can find out the QoS that you can use by running the following command:

sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50\n

As long as you have a positive budget, you should use the standard QoS. Once you have exhausted your budget you can use the low QoS to continue to run jobs at a lower priority than jobs in the standard QoS.

Hint

If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.

Important

Only jobs sizes that are powers of 2 nodes are allowed. i.e. 1, 2, 4, 8, 16, 32, 64 nodes on the gpu partition and 1, 2, 4 nodes on the cpu partition.

"},{"location":"tursa-user-guide/scheduler/#priority","title":"Priority","text":"

Job priority on Tursa depends on a number of different factors:

Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:

Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]\n

The priority factors are:

You can view the priorities for current queued jobs on the system with the sprio command:

[dc-user1@tursa-login1 ~]$ sprio \n          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE        QOS\n          43963 gpu             5055          0         51          5       5000\n          43975 gpu             5061          0         41         20       5000\n          43976 gpu             5061          0         41         20       5000\n          43982 gpu             5046          0         26         20       5000\n          43986 gpu             5011          0          6          5       5000\n          43996 gpu             5020          0          0         20       5000\n          43997 gpu             5020          0          0         20       5000\n
"},{"location":"tursa-user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"tursa-user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"

An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:

"},{"location":"tursa-user-guide/scheduler/#slurm-queued-reasons","title":"Slurm queued reasons","text":"

The squeue command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.

Status Code Description PENDING PD Job is awaiting resource allocation. RUNNING R Job currently has an allocation. SUSPENDED S Job currently has an allocation. COMPLETING CG Job is in the process of completing. Some processes on some nodes may still be active. COMPLETED CD Job has terminated all processes on all nodes with an exit code of zero. TIMEOUT TO Job terminated upon reaching its time limit. STOPPED ST Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. OUT_OF_MEMORY OOM Job experienced out of memory error. FAILED F Job terminated with non-zero exit code or other failure condition. NODE_FAIL NF Job terminated due to failure of one or more allocated nodes. CANCELLED CA Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

For a full list of see Job State Codes.

Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.

For a full list of see Job Reasons.

"},{"location":"tursa-user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"

Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out. This file appears in the job's working directory once your job starts running.

Hint

Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered to the srun command in your SLURM script.

"},{"location":"tursa-user-guide/scheduler/#specifying-resources-in-job-scripts","title":"Specifying resources in job scripts","text":"

You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH.

Hint

Most options provided using #SBATCH directives can also be specified as command line options to srun.

If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:

Other common options that are used are:

In addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.

If you are happy to have any GPU type for your job (A100-40 or A100-80) then you select the gpu partition:

If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:

To just use the A100-40 GPU nodes:

If you do not specfy a partition, the scheduler may use any available node types for the job (equivalent of --partition=gpu).

Note

For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes, 128 cores on CPU nodes).

To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option

Using the --export=none means that the behaviour of batch submissions should be repeatable. We strongly recommend its use, although see the following section to enable access to the usual modules.

"},{"location":"tursa-user-guide/scheduler/#gpu-frequency","title":"GPU frequency","text":"

Important

The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.

Users can control the GPU frequency in their job submission scripts:

Bug

When setting the GPU frequency you will see an error in the output from the job that says control disabled. This is an incorrect message due to an issue with how Slurm sets the GPU frequency and can be safely ignored.

"},{"location":"tursa-user-guide/scheduler/#srun-launching-parallel-jobs","title":"srun: Launching parallel jobs","text":"

If you are running parallel jobs, your job submission script should contain one or more srun commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block and --hint=nomultithread to your srun command to ensure you get the correct pinning of processes to cores on a compute node.

A brief explanation of these options: - --hint=nomultithread - do not use hyperthreads/SMP - --distribution=block:block - the first block means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).

Important

The Slurm definition of a \"socket\" does not usually correspond to a physical CPU socket. On Tursa GPU nodes it corresponds to half the cores on a socket as the GPU nodes are configured with NPS2.

On the Tursa CPU nodes, the Slurm definition of a scoket does correspond to a physical CPU socket (64 cores) as the COU nodes are configured with NPS1.

"},{"location":"tursa-user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":""},{"location":"tursa-user-guide/scheduler/#example-job-submission-script-for-a-parallel-job-using-cuda","title":"Example: job submission script for a parallel job using CUDA","text":"

A job submission script for a parallel job that uses 4 compute nodes, 4 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used:

#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Grid_job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=4\n#SBATCH --cpus-per-task=8\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]             \n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\nexport OMP_NUM_THREADS=8\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\n# We have reserved the full nodes, now distribute the processes as\n# required: 4 MPI processes per node, stride of 12 cores between \n# MPI processes\n# \n# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning \nsrun --nodes=4 --tasks-per-node=4 --cpus-per-task=12 \\\n     --hint=nomultithread --distribution=block:block \\\n     gpu_launch.sh \\\n     ${application} ${options}\n

This will run your executable \"my_mpi_opnemp_app.x\" in parallel usimg 16 MPI processes on 4 nodes. 4 GPUs will be used per node.

Important

You must use the gpu_launch.sh wrapper script to get the correct biniding of GPU to MPI processes and of network interface to GPU and MPI process. This script is described in more detail below.

"},{"location":"tursa-user-guide/scheduler/#gpu_launchsh-wrapper-script","title":"gpu_launch.sh wrapper script","text":"

The gpu_launch.sh wrapper script is required to set the correct binding of GPU to MPI processes and the correct binding of interconnect interfaces to MPI process and GPU. We provide this centrally for convenience but its contents are simple:

#!/bin/bash\n\n# Compute the raw process ID for binding to GPU and NIC\nlrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))\n\n# Bind the process to the correct GPU and NIC\nexport CUDA_VISIBLE_DEVICES=${lrank}\nexport UCX_NET_DEVICES=mlx5_${lrank}:1\n\n$@\n
"},{"location":"tursa-user-guide/scheduler/#using-the-dev-qos","title":"Using the dev QoS","text":"

The dev QoS is designed for faster turnaround of short jobs than is usually available through the production QoS. It is subject to a number of restrictions:

In addtion, you must specify either the gpu-a100-80 or gpu-a100-40 partitions when using the dev QoS.

Tip

The generic gpu partition will not work consistently when using the dev QoS.

Here is an example job submission script for a 2-node job in the dev QoS using the gpu-a100-80 partition. Note the use of the gpu_launch.sh wrapper script to get correct GPU and NIC binding.

#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Dev_Job\n#SBATCH --time=12:0:0\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=48\n#SBATCH --cpus-per-task=\n#SBATCH --gres=gpu:4\n#SBATCH --partition=gpu-a100-80\n#SBATCH --qos=dev\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n\nexport OMP_NUM_THREADS=1\nexport OMP_PLACES=cores\n\n# Load the correct modules\nmodule load /home/y07/shared/tursa-modules/setup-env\nmodule load gcc/9.3.0\nmodule load cuda/12.3\nmodule load openmpi/4.1.5-cuda12.3 \n\n# These will need to be changed to match the actual application you are running\napplication=\"my_mpi_openmp_app.x\"\noptions=\"arg 1 arg2\"\n\n# We have reserved the full nodes, now distribute the processes as\n# required: 4 MPI processes per node, stride of 12 cores between \n# MPI processes\n# \n# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning \nsrun --nodes=2 --tasks-per-node=4 --cpus-per-task=12 \\\n     --hint=nomultithread --distribution=block:block \\\n     gpu_launch.sh \\\n     ${application} ${options}\n
"},{"location":"tursa-user-guide/sw-environment/","title":"Software environment","text":"

The software environment on Tursa is primarily controlled through the module command. By loading and switching software modules you control which software and versions are available to you.

Information

A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.

By default, all users on Tursa start with the default software environment loaded.

Software modules on Tursa are provided by both Eviden and by EPCC.

In this section, we provide:

"},{"location":"tursa-user-guide/sw-environment/#using-the-module-command","title":"Using the module command","text":"

We only cover basic usage of the module command here. For full documentation please see the Linux manual page on modules

The module command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:

These are described in more detail below.

"},{"location":"tursa-user-guide/sw-environment/#information-on-the-available-modules","title":"Information on the available modules","text":"

The module list command will give the names of the modules and their versions you have presently loaded in your environment. By default, you will have no modules loaded when you first log into Tursa

Finding out which software modules are available on the system is performed using the module avail command. To list all software modules available, use:

[dc-user1@tursa-login1 ~]$ module avail\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\ncuda/11.0.2  openmpi/4.1.1-cuda11.0.2  ucx/1.10.1-cuda11.0.2  \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\ncuda/11.4.1  openmpi/4.1.1-cuda11.4  ucx/1.12.0-cuda11.4  \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\ncuda/11.4.1  openmpi/4.1.1-cuda11.4.1  ucx/1.12.0-cuda11.4.1  \n\n------------------------------------------------ /mnt/lustre/tursafs1/apps/modulefiles -------------------------------------------------\ncuda/11.0.3  dot  gcc/9.3.0  module-git  module-info  modules  null  openmpi/4.1.1  ucx/1.10.1  use.own  xpmem/2.6.5   \n

This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.

You can list all the modules of a particular type by providing an argument to the module avail command. For example, to list all available versions of the OpenMPI library, use:

[dc-user1@tursa-login1 ~]$ module avail openmpi\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.0.2  \n\n------------------------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles --------------------------------------------\nopenmpi/4.1.1-cuda11.4  \n\n------------------------------------------ /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles -------------------------------------------\nopenmpi/4.1.1-cuda11.4.1  \n\n----------------\n

The module show command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default openmpi module:

[dc-user1@tursa-login1 ~]$ module show openmpi\n-------------------------------------------------------------------\n/mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles/openmpi/4.1.1-cuda11.0.2:\n\nmodule-whatis   Sets up OpenMPI on your environment\nsetenv          MPI_ROOT        /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/\nprepend-path    PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/bin/\nprepend-path    LD_LIBRARY_PATH /mnt/lustre/tursafs1/apps/basestack/cuda-11.0.2/openmpi/4.1.1/lib\nprepend-path    MANPATH /opt/mpi/openmpi/4.0.4.1/share/man\nmodule load     ucx/1.10.1\nsetenv          OMPI_CC cc\nsetenv          OMPI_CXX        g++\nsetenv          OMPI_CFLAGS     -g -m64\nsetenv          OMPI_CXXFLAGS   -g -m64\n-------------------------------------------------------------------\n
"},{"location":"tursa-user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"

To load a module to use the module load command. For example, to load the default version of OpenMPI into your environment, use:

[dc-user1@tursa-login1 ~]$ module load openmpi\n\n        UCX 1.10 loaded\n\n\n        OpenMPI 4.1.1 loaded\n

Once you have done this, your environment will be setup to use the OpenMPI library. The above command will load the default version of OpenMPI. If you need a specific version of the software, you can add more information:

[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4.1\n\n        UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n        OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0  loaded\n

will load OpenMPI version 4.1.1 with CUDA 11.4.1 into your environment, regardless of the default.

If you want to remove software from your environment, module rm will remove a loaded module:

[dc-user1@tursa-login1 ~]$ module rm openmpi\n

will unload what ever version of openmpi (even if it is not the default) you might have loaded.

There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule.

Suppose you have loaded version 4.1.1 of openmpi, the following command will change to version 4.1.1-cuda11.4.1:

[dc-user1@tursa-login1 ~]$ module swap openmpi openmpi/4.1.1-cuda11.4.1\n\n        UCX 1.12.0 compiled with cuda 11.4.1 loaded\n\n\n        OpenMPI 4.1.1 with cuda-11.4.1 and UCX 1.12.0  loaded\n

You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.

"},{"location":"tursa-user-guide/sw-environment/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"

Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:

[dc-user1@tursa-login1 ~]$ module save [collection_name]\n

Note

If you do not specify the environment name, it is called default.

You can find the list of saved module environments by executing:

[dc-user1@tursa-login1 ~]$ module savelist\nNamed collection list:\n 1) default\n

To list the modules in a collection, you can execute, e.g.,:

[dc-user1@tursa-login1 ~]$ module saveshow default\n-------------------------------------------------------------------\n/home/z01/z01/dc-turn1/.module/default:\n\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.0.2-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles\nmodule use --append /mnt/lustre/tursafs1/apps/modulefilesintel\nmodule use --append /mnt/lustre/tursafs1/apps/modulefiles\nmodule load ucx/1.12.0-cuda11.4.1\nmodule load openmpi/4.1.1-cuda11.4.1\n\n-------------------------------------------------------------------\n

Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,

[dc-user1@tursa-login1 ~]$ module save /home/t01/z01/auser/my-module-collection\n

if you want to save to a specific file name.

To delete a module environment, you can execute:

[dc-user1@tursa-login1 ~]$ module saverm <environment_name>\n
"},{"location":"tursa-user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"

When you log in to Tursa, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv or export.

The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS define the number of threads.

To define an environment variable, you need to execute:

export OMP_NUM_THREADS=4\n

Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.

You can show the value of a specific environment variable if you print it:

echo $OMP_NUM_THREADS\n

Do not forget the dollar symbol. To remove an environment variable, just execute:

unset OMP_NUM_THREADS\n
"},{"location":"tursa-user-guide/sw-environment/#compiler-environment","title":"Compiler environment","text":"

The system supports two different primary compiler environments:

"},{"location":"tursa-user-guide/sw-environment/#gcc-toolchain","title":"GCC toolchain","text":"

To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:

[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load cuda/11.4.1 \n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) gcc/9.3.0   2) cuda/11.4.1   3) ucx/1.12.0-cuda11.4   4) openmpi/4.1.1-cuda11.4 \n

Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:

You can find more information on these scripts in the OpenMPI documentation.

"},{"location":"tursa-user-guide/sw-environment/#nvhpc-toolchain","title":"NVHPC toolchain","text":"

To compile on the system for GPU nodes using the GCC toolchain, you would typically load the required modules:

[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load gcc/9.3.0\n[dc-user1@tursa-login1 ~]$ module load nvhpc/21.7-nompi\n[dc-user1@tursa-login1 ~]$ module load openmpi/4.1.1-cuda11.4\n[dc-user1@tursa-login1 ~]$ module list\nCurrently Loaded Modulefiles:\n 1) /mnt/lustre/tursafs1/home/y07/shared/tursa-modules/setup-env   2) nvhpc/21.7-nompi\n 3) ucx/1.12.0-cuda11.4   4) openmpi/4.1.1-cuda11.4    5) gcc/9.3.0\n

Once you have loaded the modules, the standard OpenMPI compiler wrapper scripts are available:

and the NVIDIA compilers are available as:

Tip

Both the NVIDIA compilers and the MPI compiler wrapper scripts will use the GCC compilers directly in the default configuration - this is often what you want. If you want the compiler wrappers to call the NVIDIA compilers themselves rather than GCC directly, you would use:

export OMPI_CC=nvcc\nexport OMPI_CXX=nvc++\nexport OMPI_FC=nvfortran\n
"},{"location":"tursa-user-guide/sw-environment/#other-build-tools","title":"Other build tools","text":""},{"location":"tursa-user-guide/sw-environment/#cmake","title":"cmake","text":"

CMake is available by using the commands:

[dc-user1@tursa-login1 ~]$ module load /home/y07/shared/tursa-modules/setup-env\n[dc-user1@tursa-login1 ~]$ module load cmake\n
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index b817ed3..e986db1 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/tursa-user-guide/scheduler/index.html b/tursa-user-guide/scheduler/index.html index f27a881..c90a229 100644 --- a/tursa-user-guide/scheduler/index.html +++ b/tursa-user-guide/scheduler/index.html @@ -1632,12 +1632,20 @@

Example: jo module load openmpi/4.1.5-cuda12.3 export OMP_NUM_THREADS=8 +export OMP_PLACES=cores +export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK # These will need to be changed to match the actual application you are running application="my_mpi_openmp_app.x" options="arg 1 arg2" -srun --hint=nomultithread --distribution=block:block \ +# We have reserved the full nodes, now distribute the processes as +# required: 4 MPI processes per node, stride of 12 cores between +# MPI processes +# +# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning +srun --nodes=4 --tasks-per-node=4 --cpus-per-task=12 \ + --hint=nomultithread --distribution=block:block \ gpu_launch.sh \ ${application} ${options} @@ -1694,8 +1702,8 @@

Using the dev QoS

#SBATCH --job-name=Example_Dev_Job #SBATCH --time=12:0:0 #SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=12 +#SBATCH --tasks-per-node=48 +#SBATCH --cpus-per-task= #SBATCH --gres=gpu:4 #SBATCH --partition=gpu-a100-80 #SBATCH --qos=dev @@ -1704,12 +1712,7 @@

Using the dev QoS

#SBATCH --account=[budget code] export OMP_NUM_THREADS=1 - -# Load the correct modules -module load /home/y07/shared/tursa-modules/setup-env -module load gcc/9.3.0 -module load cuda/12.3 -module load openmpi/4.1.5-cuda12.3 +export OMP_PLACES=cores # Load the correct modules module load /home/y07/shared/tursa-modules/setup-env @@ -1721,7 +1724,13 @@

Using the dev QoS

application="my_mpi_openmp_app.x" options="arg 1 arg2" -srun --hint=nomultithread --distribution=block:block \ +# We have reserved the full nodes, now distribute the processes as +# required: 4 MPI processes per node, stride of 12 cores between +# MPI processes +# +# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning +srun --nodes=2 --tasks-per-node=4 --cpus-per-task=12 \ + --hint=nomultithread --distribution=block:block \ gpu_launch.sh \ ${application} ${options}