diff --git a/search/search_index.json b/search/search_index.json index 6f1d78b31..7376f35f5 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"EIDF User Documentation","text":"
The Edinburgh International Data Facility (EIDF) is built and operated by EPCC at the University of Edinburgh. EIDF is a place to store, find and work with data of all kinds. You can find more information on the service and the research it supports on the EIDF website.
For more information or for support with our services, please email eidf@epcc.ed.ac.uk
in the first instance.
This documentation gives more in-depth coverage of current EIDF services. It is aimed primarily at developers or power users.
"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"The source for this documentation is publicly available in the EIDF documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or additions to the content and/or addition of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
This documentation set is a work in progress.
"},{"location":"#credits","title":"Credits","text":"This documentation draws on the ARCHER2 National Supercomputing Service documentation.
"},{"location":"access/","title":"Accessing EIDF","text":"Some EIDF services are accessed via a Web browser and some by \"traditional\" command-line ssh
.
All EIDF services use the EPCC SAFE service management back end, to ensure compatibility with other EPCC high-performance computing services.
"},{"location":"access/#web-access-to-virtual-machines","title":"Web Access to Virtual Machines","text":"The Virtual Desktop VM service is browser-based, providing a virtual desktop interface (Apache Guacamole) for \"desktop-in-a-browser\" access. Applications to use the VM service are made through the EIDF Portal.
EIDF Portal: how to ask to join an existing EIDF project and how to apply for a new project
VDI access to virtual machines: how to connect to the virtual desktop interface.
"},{"location":"access/#ssh-access-to-virtual-machines","title":"SSH Access to Virtual Machines","text":"Users with the appropriate permissions can also use ssh
to login to Virtual Desktop VMs
Includes access to the following services:
To login to most command-line services with ssh
you should use the username and password you obtained from SAFE when you applied for access, along with the SSH Key you registered when creating the account. You can then login to the host following the appropriately linked instructions above.
Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.
The EIDF Portal uses EPCC's SAFE service management software to manage user accounts across all EPCC services. To log in to the Portal you will first be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"access/project/#how-to-request-to-join-a-project","title":"How to request to join a project","text":"Log in to the EIDF Portal and navigate to \"Projects\" and choose \"Request access\". Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".
Now you have to wait for your PI or project manager to accept your request to register.
"},{"location":"access/project/#how-to-apply-for-a-project-as-a-principal-investigator","title":"How to apply for a project as a Principal Investigator","text":""},{"location":"access/project/#create-a-new-project-application","title":"Create a new project application","text":"Navigate to the EIDF Portal and log in via SAFE if necessary (see above).
Once you have logged in click on \"Applications\" in the menu and choose \"New Application\".
Once the application has been created you see an overview of the form you are required to fill in. You can revisit the application at any time by clicking on \"Applications\" and choosing \"Your applications\" to display all your current and past applications and their status, or follow the link https://portal.eidf.ac.uk/proposal/.
"},{"location":"access/project/#populate-a-project-application","title":"Populate a project application","text":"Fill in each section of the application as required:
You can edit and save each section separately and revisit the application at a later time.
"},{"location":"access/project/#datasets","title":"Datasets","text":"You are required to fill in a \"Dataset\" form for each dataset that you are planning to store and process as part of your project.
We are required to ensure that projects involving \"sensitive\" data have the necessary permissions in place. The answers to these questions will enable us to decide what additional documentation we may need, and whether your project may need to be set up in an independently governed Safe Haven. There may be some projects we are simply unable to host for data protection reasons.
"},{"location":"access/project/#resource-requirements","title":"Resource Requirements","text":"Add an estimate for each size and type of VM that is required.
"},{"location":"access/project/#submission","title":"Submission","text":"When you are happy with your application, click \"Submit\". If there are missing fields that are required these are highlighted and your submission will fail.
When your submission was successful the application status is marked as \"Submitted\" and now you have to wait while the EIDF approval team considers your application. You may be contacted if there are any questions regarding your application or further information is required, and you will be notified of the outcome of your application.
"},{"location":"access/project/#approved-project","title":"Approved Project","text":"If your application was approved, refer to Data Science Virtual Desktops: Quickstart how to view your project and to Data Science Virtual Desktops: Managing VMs how to manage a project and how to create virtual machines and user accounts.
"},{"location":"access/ssh/","title":"SSH Access to Virtual Machines using the EIDF-Gateway Jump Host","text":"The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. You will require three things to use the gateway:
Steps to meet all of these requirements are explained below.
"},{"location":"access/ssh/#generating-and-adding-an-ssh-key","title":"Generating and Adding an SSH Key","text":"In order to make use of the EIDF-Gateway, your EIDF account needs an SSH-Key associated with it. If you added one while creating your EIDF account, you can skip this step.
"},{"location":"access/ssh/#check-for-an-existing-ssh-key","title":"Check for an existing SSH Key","text":"To check if you have an SSH Key associated with your account:
If there is an entry under 'Credentials', then you're all setup. If not, you'll need to generate an SSH-Key, to do this:
"},{"location":"access/ssh/#generate-a-new-ssh-key","title":"Generate a new SSH Key","text":"Generate a new SSH Key:
ssh-keygen\n
It is fine to accept the default name and path for the key unless you manage a number of keys.
This should not be necessary for most users, so only follow this process if you have an issue or have been told to by the EPCC Helpdesk. If you need to add an SSH Key directly to SAFE, you can follow this guide. However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide.
"},{"location":"access/ssh/#enabling-mfa-via-the-portal","title":"Enabling MFA via the Portal","text":"A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
To enable this for your EIDF account:
Note
TOTP is only required for the SSH Gateway, not to the VMs themselves, and not through the VDI. An MFA token will have to be set for each account you'd like to use to access the EIDF SSH Gateway.
"},{"location":"access/ssh/#using-the-ssh-key-and-totp-code-to-access-eidf-windows-and-linux","title":"Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux","text":"From your local terminal, import the SSH Key you generated above: ssh-add /path/to/ssh-key
This should return \"Identity added [Path to SSH Key]\" if successful. You can then follow the steps below to access your VM.
OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the commands below:
ssh-add /path/to/ssh-key\nssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]\n
For example:
ssh-add ~/.ssh/keys/id_ed25519\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n
Info
If the ssh-add
command fails saying the SSH Agent is not running, run the below command:
eval `ssh-agent`
Then re-run the ssh-add command above.
The -J
flag is use to specify that we will access the second specified host by jumping through the first specified host.
You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.
"},{"location":"access/ssh/#accessing-from-windows","title":"Accessing from Windows","text":"Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won\u2019t be covered in this tutorial.
"},{"location":"access/ssh/#installing-and-using-openssh","title":"Installing and using OpenSSH","text":"Import the SSH Key you generated above:
ssh-add \\path\\to\\sshkey\n\nFor Example:\nssh-add .\\.ssh\\id_ed25519\n
This should return \"Identity added [Path to SSH Key]\" if successful. If it doesn't, run the following in Powershell:
Get-Service -Name ssh-agent | Set-Service -StartupType Manual\nStart-Service ssh-agent\nssh-add \\path\\to\\sshkey\n
Login by jumping through the gateway.
ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip]\n\nFor Example:\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n
You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.
"},{"location":"access/ssh/#ssh-aliases","title":"SSH Aliases","text":"You can use SSH Aliases to access your VMs with a single command.
Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. Using the text editor of your choice (vi used as an example), edit the .ssh/config file:
vi ~/.ssh/config\n
Insert the following lines:
Host eidf-gateway\n Hostname eidf-gateway.epcc.ed.ac.uk\n User <eidf project username>\n IdentityFile /path/to/ssh/key\n
For example:
Host eidf-gateway\n Hostname eidf-gateway.epcc.ed.ac.uk\n User alice\n IdentityFile ~/.ssh/id_ed25519\n
Save and quit the file.
Now you can ssh to your VM using the below command:
ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key\n
For Example:
ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519\n
You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM:
Host <vm name/alias>\n HostName 10.24.VM.IP\n User <vm username>\n IdentityFile /path/to/ssh/key\n ProxyCommand ssh eidf-gateway -W %h:%p\n
For Example:
Host demo\n HostName 10.24.1.1\n User alice\n IdentityFile ~/.ssh/id_ed25519\n ProxyCommand ssh eidf-gateway -W %h:%p\n
Now, by running ssh demo
your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: HostKeyAlgorithms +ssh-rsa
Info
This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. This method allows for a much less complicated ssh command to reach your VMs. You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like. The -J
flag is use to specify that we will access the second specified host by jumping through the first specified host.
You do not have to set a password to log into virtual machines. However, if you have been given sudo permission, you will need to set a password to be able to make use of sudo. You can set (or reset) a password using the web form in the EIDF Portal following the instructions in Set or change the password for a user account.
"},{"location":"access/virtualmachines-vdi/","title":"Virtual Machines (VMs) and the EIDF Virtual Desktop Interface (VDI)","text":"Using the EIDF VDI, members of EIDF projects can connect to VMs that they have been granted access to. The EIDF VDI is a web portal that displays the connections to VMs a user has available to them, and then those connections can be easily initiated by clicking on them in the user interface. Once connected to the target VM, all interactions are mediated through the user's web browser by the EIDF VDI.
"},{"location":"access/virtualmachines-vdi/#login-to-the-eidf-vdi","title":"Login to the EIDF VDI","text":"Once your membership request to join the appropriate EIDF project has been approved, you will be able to login to the EIDF VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi.
Authentication to the VDI is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"access/virtualmachines-vdi/#navigating-the-eidf-vdi","title":"Navigating the EIDF VDI","text":"After you have been authenticated through SAFE and logged into the EIDF VDI, if you have multiple connections available to you that have been associated with your user (typically in the case of research projects), you will be presented with the VDI home screen as shown below:
VDI home page with list of available VM connections
Adding connections
Note that if a project manager has added a new connection for you it may not appear in the list of connections immediately. You must log out and log in again to refresh your connections list.
"},{"location":"access/virtualmachines-vdi/#connecting-to-a-vm","title":"Connecting to a VM","text":"If you have only one connection associated with your VDI user account (typically in the case of workshops), you will be automatically connected to the target VM's virtual desktop. Once you are connected to the VM, you will be asked for your username and password as shown below (if you are participating in a workshop, then you may not be asked for credentials)
Warning
If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.
VM virtual desktop connection user account login screen
Once your credentials have been accepted, you will be connected to your VM's desktop environment. For instance, the screenshot below shows a resulting connection to a Xubuntu 20.04 VM with the Xfce desktop environment.
VM virtual desktop
"},{"location":"access/virtualmachines-vdi/#vdi-features-for-the-virtual-desktop","title":"VDI Features for the Virtual Desktop","text":"The EIDF VDI is an instance of the Apache Guacamole clientless remote desktop gateway. Since the connection to your VM virtual desktop is entirely managed through Guacamole in the web browser, there are some additional features to be aware of that may assist you when using the VDI.
"},{"location":"access/virtualmachines-vdi/#the-vdi-menu","title":"The VDI Menu","text":"The Guacamole menu is a sidebar which is hidden until explicitly shown. On a desktop or other device which has a hardware keyboard, you can show this menu by pressing <Ctrl> + <Alt> + <Shift> on a Windows PC client, or <Ctrl> + <Command> + <Shift> on a Mac client. To hide the menu, you press the same key combination once again. The menu provides various options, including:
After you have activated the Guacamole menu using the key combination above, at the top of the menu is a text area labeled \u201cclipboard\u201d along with some basic instructions:
Text copied/cut within Guacamole will appear here. Changes to the text below will affect the remote clipboard.
The text area functions as an interface between the remote clipboard and the local clipboard. Text from the local clipboard can be pasted into the text area, causing that text to be sent to the clipboard of the remote desktop. Similarly, if you copy or cut text within the remote desktop, you will see that text within the text area, and can manually copy it into the local clipboard if desired.
You can use the standard keyboard shortcuts to copy text from your client PC or Mac to the Guacamole menu clipboard, then again copy that text from the Guacamole menu clipboard into an application or CLI terminal on the VM's remote desktop. An example of using the copy and paste clipboard is shown in the screenshot below.
The EIDF VDI Clipboard
"},{"location":"access/virtualmachines-vdi/#keyboard-language-and-layout-settings","title":"Keyboard Language and Layout Settings","text":"For users who do not have standard English (UK)
keyboard layouts, key presses can have unexpected translations as they are transmitted to your VM. Please contact the EIDF helpdesk at eidf@epcc.ed.ac.uk if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration.
Submit a query in the EIDF Portal by selecting \"Submit a Support Request\" in the \"Help and Support\" menu and filling in this form.
You can also email us at eidf@epcc.ed.ac.uk.
"},{"location":"faq/#how-do-i-request-more-resources-for-my-project-can-i-extend-my-project","title":"How do I request more resources for my project? Can I extend my project?","text":"Submit a support request: In the form select the project that your request relates to and select \"EIDF Project extension: duration and quota\" from the dropdown list of categories. Then enter the new quota or extension date in the description text box below and submit the request.
The EIDF approval team will consider the extension and you will be notified of the outcome.
"},{"location":"faq/#new-vms-and-vdi-connections","title":"New VMs and VDI connections","text":"My project manager gave me access to a VM but the connection doesn't show up in the VDI connections list?
This may happen when a machine/VM was added to your connections list while you were logged in to the VDI. Please refresh the connections list by logging out and logging in again, and the new connections should appear.
"},{"location":"faq/#non-default-ssh-keys","title":"Non-default SSH Keys","text":"I have different SSH keys for the SSH gateway and my VM, or I use a key which does not have the default name (~/.ssh/id_rsa) and I cannot login.
The command syntax shown in our SSH documentation (using the -J <username>@eidf-gateway
stanza) makes assumptions about SSH keys and their naming. You should try the full version of the command:
ssh -o ProxyCommand=\"ssh -i ~/.ssh/<gateway_private_key> -W %h:%p <gateway_username>@eidf-gateway.epcc.ed.ac.uk\" -i ~/.ssh/<vm_private_key> <vm_username>@<vm_ip>\n
Note that for the majority of users, gateway_username and vm_username are the same, as are gateway_private_key and vm_private_key
"},{"location":"faq/#username-policy","title":"Username Policy","text":"I already have an EIDF username for project Y, can I use this for project X?
We mandate that every username must be unique across our estate. EPCC machines including EIDF services such as the SDF and DSC VMs, and HPC services such as Cirrus require you to create a new machine account with a unique username for each project you work on. Usernames cannot be used on multiple projects, even if the previous project has finished. However, some projects span multiple machines so you may be able to login to multiple machines with the same username.
"},{"location":"known-issues/","title":"Known Issues","text":""},{"location":"known-issues/#virtual-desktops","title":"Virtual desktops","text":"No known issues.
"},{"location":"overview/","title":"A Unique Service for Academia and Industry","text":"The Edinburgh International Data Facility (EIDF) is a growing set of data and compute services developed to support the Data Driven Innovation Programme at the University of Edinburgh.
Our goal is to support learners, researchers and innovators across the spectrum, with services from data discovery through simple learn-as-you-play-with-data notebooks to GPU-enabled machine-learning platforms for driving AI application development.
"},{"location":"overview/#eidf-and-the-data-driven-innovation-initiative","title":"EIDF and the Data-Driven Innovation Initiative","text":"Launched at the end of 2018, the Data-Driven Innovation (DDI) programme is one of six funded within the Edinburgh & South-East Scotland City Region Deal. The DDI programme aims to make Edinburgh the \u201cData Capital of Europe\u201d, with ambitious targets to support, enhance and improve talent, research, commercial adoption and entrepreneurship across the region through better use of data.
The programme targets ten industry sectors, with interactions managed through five DDI Hubs: the Bayes Centre, the Usher Institute, Edinburgh Futures Institute, the National Robotarium, and Easter Bush. The activities of these Hubs are underpinned by EIDF.
"},{"location":"overview/acknowledgements/","title":"Acknowledging EIDF","text":"If you make use of EIDF services in your work, we encourage you to acknowledge us in any publications.
Acknowledgement of using the facility in publications can be used as an identifiable metric to evaluate the scientific support provided, and helps promote the impact of the wider DDI Programme.
We encourage our users to ensure that an acknowledgement of EIDF is included in the relevant section of their manuscript. We would suggest:
This work was supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.
"},{"location":"overview/contacts/","title":"Contact","text":"The Edinburgh International Data Facility is located at the Advanced Computing Facility of EPCC, the supercomputing centre based at the University of Edinburgh.
"},{"location":"overview/contacts/#email-us","title":"Email us","text":"Email EIDF: eidf@epcc.ed.ac.uk
"},{"location":"overview/contacts/#sign-up","title":"Sign up","text":"Join our mailing list to receive updates about EIDF.
"},{"location":"safe-haven-services/network-access-controls/","title":"Safe Haven Network Access Controls","text":"The TRE Safe Haven services are protected against open, global access by IPv4 source address filtering. These network access controls ensure that connections are permitted only from Safe Haven controller partner networks and collaborating research institutions.
The network access controls are managed by the Safe Haven service controllers who instruct EPCC to add and remove the IPv4 addresses allowed to connect to the service gateway. Researchers must connect to the Safe Haven service by first connecting to their institution or corporate VPN and then connecting to the Safe Haven.
The Safe Haven IG controller and research project co-ordination teams must submit and confirm IPv4 address filter changes to their service help desk via email.
"},{"location":"safe-haven-services/overview/","title":"Safe Haven Services","text":"The EIDF Trusted Research Environment (TRE) hosts several Safe Haven services that enable researchers to work with sensitive data in a secure environment. These services are operated by EPCC in partnership with Safe Haven controllers who manage the Information Governance (IG) appropriate for the research activities and the data access of their Safe Haven service.
It is the responsibility of EPCC as the Safe Haven operator to design, implement and administer the technical controls required to deliver the Safe Haven security regime demanded by the Safe Haven controller.
The role of the Safe Haven controller is to satisfy the needs of the researchers and the data suppliers. The controller is responsible for guaranteeing the confidentiality needs of the data suppliers and matching these with the availability needs of the researchers.
The service offers secure data sharing and analysis environments allowing researchers access to sensitive data under the terms and conditions prescribed by the data providers. The service prioritises the requirements of the data provider over the demands of the researcher and is an academic TRE operating under the guidance of the Five Safes framework.
The TRE has dedicated, private cloud infrastructure at EPCC's Advanced Computing Facility (ACF) data centre and has its own HPC cluster and high-performance file systems. When a new Safe Haven service is commissioned in the TRE it is created in a new virtual private cloud providing the Safe Haven service controller with an independent IG domain separate from other Safe Havens in the TRE. All TRE service infrastructure and all TRE project data are hosted at ACF.
If you have any questions about the EIDF TRE or about Safe Haven services, please contact us.
"},{"location":"safe-haven-services/safe-haven-access/","title":"Safe Haven Service Access","text":"Safe Haven services are accessed from a registered network connection address using a browser. The service URL will be \"https://shs.epcc.ed.ac.uk/<service>\" where <service> is the Safe Haven service name.
The Safe Haven access process is in three stages from multi-factor authentication to project desktop login.
Researchers who are active in many research projects and in more than one Safe Haven will need to pay attention to the service they connect to, the project desktop they login to, and the accounts and identities they are using.
"},{"location":"safe-haven-services/safe-haven-access/#safe-haven-login","title":"Safe Haven Login","text":"The first step in the process prompts the user for a Safe Haven username and then for a session PIN code sent via SMS text to the mobile number registered for the username.
Valid PIN code entry allows the user access to all of the Safe Haven service remote desktop gateways for up to 24 hours without entry of a new PIN code. A user who has successfully entered a PIN code once can access shs.epcc.ed.ac.uk/haven1 and shs.epcc.ed.ac.uk/haven2 without repeating PIN code identity verification.
When a valid PIN code is accepted, the user is prompted to accept the service use terms and conditions.
Registration of the user mobile phone number is managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/safe-haven-access/#remote-desktop-gateway-login","title":"Remote Desktop Gateway Login","text":"The second step in the access process is for the user to login to the Safe Haven service remote desktop gateway so that a project desktop connection can be chosen. The user is prompted for a Safe Haven service account identity.
VDI Safe Haven Service Login Page
Safe Haven accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/safe-haven-access/#project-desktop-connection","title":"Project Desktop Connection","text":"The third stage in the process is to select the virtual connection from those available on the account's home page. An example home page is shown below offering two connection options to the same virtual machine. Remote desktop connections will have an _rdp suffix and SSH terminal connections have an _ssh suffix. The most recently used connections are shown as screen thumbnails at the top of the page and all the connections available to the user are shown in a tree list below this.
VM connections available home page
The remote desktop gateway software used in the Safe Haven services in the TRE is the Apache Guacamole web application. Users new to this application can find the user manual here. It is recommended that users read this short guide, but note that the data sharing features such as copy and paste, connection sharing, and file transfers are disabled on all connections in the TRE Safe Havens.
A remote desktop or SSH connection is used to access data provided for a specific research project. If a researcher is working on multiple projects within a Safe Haven they can only login to one project at a time. Some connections may allow the user to login to any project and some connections will only allow the user to login into one specific project. This depends on project IG restrictions specified by the Safe Haven and project controllers.
Project desktop accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/using-the-hpc-cluster/","title":"Using the TRE HPC Cluster","text":""},{"location":"safe-haven-services/using-the-hpc-cluster/#introduction","title":"Introduction","text":"The TRE HPC system, also called the SuperDome Flex, is a single node, large memory HPC system. It is provided for compute and data intensive workloads that require more CPU, memory, and better IO performance than can be provided by the project VMs, which have the performance equivalent of small rack mount servers.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#specifications","title":"Specifications","text":"The system is an HPE SuperDome Flex configured with 1152 hyper-threaded cores (576 physical cores) and 18TB of memory, of which 17TB is available to users. User home and project data directories are on network mounted storage pods running the BeeGFS parallel filesystem. This storage is built in blocks of 768TB per pod. Multiple pods are available in the TRE for use by the HPC system and the total storage available will vary depending on the project configuration.
The HPC system runs Red Hat Enterprise Linux, which is not the same flavour of Linux as the Ubuntu distribution running on the desktop VMs. However, most jobs in the TRE run Python and R, and there are few issues moving between the two version of Linux. Use of virtual environments is strongly encouraged to ensure there are no differences between the desktop and HPC runtimes.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#software-management","title":"Software Management","text":"All system level software installed and configured on the TRE HPC system is managed by the TRE admin team. Software installation requests may be made by the Safe Haven IG controllers, research project co-ordinators, and researchers by submitting change requests through the dedicated service help desk via email.
Minor software changes will be made as soon as admin effort can be allocated. Major changes are likely to be scheduled for the TRE monthly maintenance session on the first Thursday of each month.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#hpc-login","title":"HPC Login","text":"Login to the HPC system is from the project VM using SSH and is not direct from the VDI. The HPC cluster accounts are the same accounts used on the project VMs, with the same username and password. All project data access on the HPC system is private to the project accounts as it is on the VMs, but it is important to understand that the TRE HPC cluster is shared by projects in other TRE Safe Havens.
To login to the HPC cluster from the project VMs use ssh shs-sdf01
from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the SSH key configuration instructions here. SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged.
To use the HPC system fully and fairly, all jobs must be run using the SLURM job manager. More information about SLURM, running batch jobs and running interactive jobs can be found here. Please read this carefully before using the cluster if you have not used SLURM before. The SLURM site also has a set of useful tutorials on HPC clusters and job scheduling.
All analysis and processing jobs must be run via SLURM. SLURM manages access to all the cores on the system beyond the first 32. If SLURM is not used and programs are run directly from the command line, then there are only 32 cores available, and these are shared by the other users. Normal code development, short test runs, and debugging can be done from the command line without using SLURM.
There is only one node
The HPC system is a single node with all cores sharing all the available memory. SLURM jobs should always specify '#SBATCH --nodes=1' to run correctly.
SLURM email alerts for job status change events are not supported in the TRE.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#resource-limits","title":"Resource Limits","text":"There are no resource constraints imposed on the default SLURM partition at present. There are user limits (see the output of ulimit -a
). If a project has a requirement for more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours, a resource reservation request should be made by the researchers through email to the service help desk.
There are no storage quotas enforced in the HPC cluster storage at present. The project storage requirements are negotiated, and space allocated before the project accounts are released. Storage use is monitored, and guidance will be issued before quotas are imposed on projects.
The HPC system is managed in the spirit of utilising it as fully as possible and as fairly as possible. This approach works best when researchers are aware of their project workload demands and cooperate rather than compete for cluster resources.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#python-jobs","title":"Python Jobs","text":"A basic script to run a Python job in a virtual environment is shown below.
#!/bin/bash\n#\n#SBATCH --export=ALL # Job inherits all env vars\n#SBATCH --job-name=my_job_name # Job name\n#SBATCH --mem=512G # Job memory request\n#SBATCH --output=job-%j.out # Standard output file\n#SBATCH --error=job-%j.err # Standard error file\n#SBATCH --nodes=1 # Run on a single node\n#SBATCH --ntasks=1 # Run one task per node\n#SBATCH --time=02:00:00 # Time limit hrs:min:sec\n#SBATCH --partition standard # Run on partition (queue)\n\npwd\nhostname\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\necho \"Running job on a single CPU core\"\n\n# Create the job\u2019s virtual environment\nsource ${HOME}/my_venv/bin/activate\n\n# Run the job code\npython3 ${HOME}/my_job.py\n\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\n
"},{"location":"safe-haven-services/using-the-hpc-cluster/#mpi-jobs","title":"MPI Jobs","text":"An example script for a multi-process MPI example is shown. The system currently supports MPICH MPI.
#!/bin/bash\n#\n#SBATCH --export=ALL\n#SBATCH --job-name=mpi_test\n#SBATCH --output=job-%j.out\n#SBATCH --error=job-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=5\n#SBATCH --time=05:00\n#SBATCH --partition standard\n\necho \"Submitted Open MPI job\"\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# load Open MPI module\nmodule purge\nmodule load mpi/mpich-x86_64\n\n# run mpi program\nmpirun ${HOME}/test_mpi\n
"},{"location":"safe-haven-services/using-the-hpc-cluster/#managing-files-and-data","title":"Managing Files and Data","text":"There are three file systems to manage in the VM and HPC environment.
The /safe_data file system with the project data cannot be used by the HPC system. The /safe_data file system has restricted access and a relatively slow IO performance compared to the parallel BeeGFS file system storage on the HPC system.
The process to use the TRE HPC service is to copy and synchronise the project code and data files on the /safe_data file system with the HPC /home file system before and after login sessions and job runs on the HPC cluster. Assuming all the code and data required for the job is in a directory 'current_wip' on the project VM, the workflow is as follows:
rsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:
ssh shs-sdf01
, cd current_wip
, sbatch/srun my_job
rsync -avPz -e ssh shs-sdf01:current_wip /safe_data/my_project
Sessions on project VMs may be either remote desktop (RDP) logins or SSH terminal logins. Most users will prefer to use the remote desktop connections, but the SSH terminal connection is useful when remote network performance is poor and it must be used for account password changes.
"},{"location":"safe-haven-services/virtual-desktop-connections/#first-time-login-and-account-password-changes","title":"First Time Login and Account Password Changes","text":"Account Password Changes
Note that first time account login cannot be through RDP as a password change is required. Password reset logins must be SSH terminal sessions as password changes can only be made through SSH connections.
"},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-ssh-session","title":"Connecting to a Remote SSH Session","text":"When a VM SSH connection is selected the browser screen becomes a text terminal and the user is prompted to \"Login as: \" with a project account name, and then prompted for the account password. This connection type is equivalent to a standard xterm SSH session.
"},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-desktop-session","title":"Connecting to a Remote Desktop Session","text":"Remote desktop connections work best by first placing the browser in Full Screen mode and leaving it in this mode for the entire duration of the Safe Haven session.
When a VM RDP connection is selected the browser screen becomes a remote desktop presenting the login screen shown below.
VM virtual desktop connection user account login screen
Once the project account credentials have been accepted, a remote dekstop similar to the one shown below is presented. The default VM environment in the TRE is Ubuntu 22.04 with the Xfce desktop.
VM virtual desktop
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/","title":"Accessing the Superdome Flex inside the EPCC Trusted Research Environment","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#what-is-the-superdome-flex","title":"What is the Superdome Flex?","text":"The Superdome Flex (SDF) is a high-performance computing cluster manufactured by Hewlett Packard Enterprise. It has been designed to handle multi-core, high-memory tasks in environments where security is paramount. The hardware specifications of the SDF within the Trusted Research Environment (TRE) are as follows:
The software specification of the SDF are:
The SDF is within the TRE. Therefore, the same restrictions apply, i.e. the SDF is isolated from the internet (no downloading code from public GitHub repos) and copying/recording/extracting anything on the SDF outside of the TRE is strictly prohibited unless through approved processes.
Users can only access the SDF by ssh-ing into it via their VM desktop.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#hello-world","title":"Hello world","text":"**** On the VM desktop terminal ****\n\nssh shs-sdf01\n<Enter VM password>\n\necho \"Hello World\"\n\nexit\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#sdf-vs-vm-file-systems","title":"SDF vs VM file systems","text":"The SDF file system is separate from the VM file system, which is again separate from the project data space. Files need to be transferred between the three systems for any analysis to be completed within the SDF.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-showing-separate-sdf-and-vm-file-systems","title":"Example showing separate SDF and VM file systems","text":"**** On the VM desktop terminal ****\n\ncd ~\ntouch test.txt\nls\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is not here\nexit\n\nscp test.txt shs-sdf01:/home/<USERNAME>/\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is here\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-copying-data-between-project-data-space-and-sdf","title":"Example copying data between project data space and SDF","text":"Transferring and synchronising data sets between the project data space and the SDF is easier with the rsync command (rather than manually checking and copying files/folders with scp). rsync only transfers files that are different between the two targets, more details in its manual.
**** On the VM desktop terminal ****\n\nman rsync # check instructions for using rsync\n\nrsync -avPz -e ssh /safe_data/my_project/ shs-sdf01:/home/<USERNAME>/my_project/ # sync project folder and SDF home folder\n\nssh shs-sdf01\n<Enter VM password>\n\n*** Conduct analysis on SDF ***\n\nexit\n\nrsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:/home/<USERNAME>/my_project/ # sync project file and ssh home page # re-syncronise project folder and SDF home folder\n\n*** Optionally remove the project folder on SDF ***\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/","title":"Running R/Python Scripts","text":"Running analysis scripts on the SDF is slightly different to running scripts on the Desktop VMs. The Linux distribution differs between the two with the SDF using Red Hat Enterprise Linux (RHEL) and the Desktop VMs using Ubuntu. Therefore, it is highly advisable to use virtual environments (e.g. conda environments) to complete any analysis and aid the transition between the two distributions. Conda should run out of the box on the Desktop VMs, but some configuration is required on the SDF.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#setting-up-conda-environments-on-you-first-connection-to-the-sdf","title":"Setting up conda environments on you first connection to the SDF","text":"*** SDF Terminal ***\n\nconda activate base # Test conda environment\n\n# Conda command will not be found. There is no need to install!\n\neval \"$(/opt/anaconda3/bin/conda shell.bash hook)\" # Tells your terminal where conda is\n\nconda init # changes your .bashrc file so conda is automatically available in the future\n\nconda config --set auto_activate_base false # stop conda base from being activated on startup\n\npython # note python version\n\nexit()\n
The base conda environment is now available but note that the python and gcc compilers are not the latest (Python 3.9.7 and gcc 7.5.0).
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#getting-an-up-to-date-python-version","title":"Getting an up-to-date python version","text":"In order to get an up-to-date python version we first need to use an updated gcc version. Fortunately, conda has an updated gcc toolset that can be installed.
*** SDF Terminal ***\n\nconda activate base # If conda isn't already active\n\nconda create -n python-v3.11 gcc_linux-64=11.2.0 python=3.11.3\n\nconda activate python-v3.11\n\npython\n\nexit()\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#running-r-scripts-on-the-sdf","title":"Running R scripts on the SDF","text":"The default version of R available on the SDF is v4.1.2. Alternative R versions can be installed using conda similar to the python conda environment above.
conda create -n r-v4.3 gcc_linux-64=11.2.0 r-base=4.3\n\nconda activate r-v4.3\n\nR\n\nq()\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#final-points","title":"Final points","text":"The SDF, like the rest of the SHS, is separated from the internet. The installation of python/R libraries to your environment is from a local copy of the respective conda/CRAN library repositories. Therefore, not all packages may be available and not all package versions may be available.
It is discouraged to run extensive python/R analyses without submitting them as job requests using Slurm.
Slurm is a workload manager that schedules jobs submitted to a shared resource. Slurm is a well-developed tool that can manage large computing clusters, such as ARCHER2, with thousands of users each with different priorities and allocated computing hours. Inside the TRE, Slurm is used to help ensure all users of the SDF get equitable access. Therefore, users who are submitting jobs with high resource requirements (>80 cores, >1TB of memory) may have to wait longer for resource allocation to enable users with lower resource demands to continue their work.
Slurm is currently set up so all users have equal priority and there is no limit to the total number of CPU hours allocated to a user per month. However, there are limits to the maximum amount of resources that can be allocated to an individual job. Jobs that require more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours will be rejected. If users need to submit jobs with large resource demand, they need to submit a resource reservation request by emailing their project's service desk.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#why-do-you-need-to-use-slurm","title":"Why do you need to use Slurm?","text":"The SDF is a resource shared across all projects within the TRE and all users should have equal opportunity to use the SDF to complete resource-intense tasks appropriate to their projects. Users of the SDF are required to consider the needs of the wider community by:
requesting resources appropriate to their intended task and timeline.
submitting resource requests via Slurm to enable automatic scheduling and fair allocation alongside other user requests.
Users can develop code, complete test runs, and debug from the SDF command line without using Slurm. However, only 32 of the 512 cores are accessible without submitting a job request to Slurm. These cores are accessible to all users simultaneously.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#slurm-basics","title":"Slurm basics","text":"Slurm revolves around four main entities: nodes, partitions, jobs and job steps. Nodes and partitions are relevant for more complex distributed computing clusters so Slurm can allocate appropriate resources to jobs across multiple pieces of hardware. Jobs are requests for resources and job steps are what need to be completed once the resources have been allocated (completed in sequence or parallel). Job steps can be further broken down into tasks.
There are four key commands for Slurm users:
squeue: get details on a job or job step, i.e. has a job been allocated resources or is it still pending?
srun: initiate a job step or execute a job. A versatile function that can initiate job steps as part of a larger batch job or submit a job itself to get resources and run a job step. This is useful for testing job steps, experimenting with different resource allocations or running job steps that require large resources but are relatively easy to define in a line or two (i.e. running a sequence alignment).
scancel: stop a job from continuing.
sbatch: submit a job script containing multiple steps (i.e. srun) to be completed with the defined resources. This is the typical function for submitting jobs to Slurm.
More details on these functions (and several not mentioned here) can be seen on the Slurm website.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-simple-job","title":"Submitting a simple job","text":"*** SDF Terminal ***\n\nsqueue -u $USER # Check if there are jobs already queued or running for you\n\nsrun --job-name=my_first_slurm_job --nodes 1 --ntasks 10 --cpus-per-task 2 echo 'Hello World'\n\nsqueue -u $USER --state=CD # List all completed jobs\n
In this instance, the srun command completes two steps: job submission and job step execution. First, it submits a job request to be allocated 10 CPUs (1 CPU for each of the 10 tasks). Once the resources are available, it executes the job step consisting of 10 tasks each running the 'echo \"Hello World\"' function.
srun accepts a wide variety of options to specify the resources required to complete its job step. Within the SDF, you must always request 1 node (as there is only one node) and never use the --exclusive option (as no one will have exclusive access to this shared resource). Notice that running srun blocks your terminal from accepting any more commands and the output from each task in the job step, i.e. Hello World in the above example, outputs to your terminal. We will compare this to running a sbatch command.\u0011
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-batch-job","title":"Submitting a batch job","text":"Batch jobs are incredibly useful because they run in the background without blocking your terminal. Batch jobs also output the results to a log file rather than straight to your terminal. This allows you to check a job was completed successfully at a later time so you can move on to other things whilst waiting for a job to complete.
A batch job can be submitted to Slurm by passing a job script to the sbatch command. The first few lines of a job script outline the resources to be requested as part of the job. The remainder of a job script consists of one or more srun commands outlining the job steps that need to be completed (in sequence or parallel) once the resources are available. There are numerous options for defining the resource requirements of a job including:
More information on the various options are in the sbatch documentation.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script","title":"Example Job Script","text":"#!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=10\n#SBATCH --cpus-per-task=2\n\n% Run echo task in sequence\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task A. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\n% Run echo task in parallel with the ampersand character\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task A. Time: \" $(date +\u201d%H:%M:%S\u201d) &\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script-submission","title":"Example job script submission","text":"*** SDF Terminal ***\n\nnano example_job_script.sh\n\n*** Copy example job script above ***\n\nsbatch example_job_script.sh\n\nsqueue -u $USER -r 5\n\n*** Wait for the batch job to be completed ***\n\ncat example_job_script.log # The series tasks should be grouped together and the parallel tasks interspersed.\n
The example batch job is intended to show two things: 1) the usefulness of the sbatch command and 2) the versatility of a job script. As the sbatch command allows you to submit scripts and check their outcome at your own discretion, it is the most common way of interacting with Slurm. Meanwhile, the job script command allows you to specify one global resource request and break it up into multiple job steps with different resource demands that can be completed in parallel or in sequence.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-pythonr-code-to-slurm","title":"Submitting python/R code to Slurm","text":"Although submitting job steps containing python/R analysis scripts can be done with srun directly, as below, it is more common to submit bash scripts that call the analysis scripts after setting up the environment (i.e. after calling conda activate).
**** Python code job submission ****\n\nsrun --job-name=my_first_python_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G python3 example_script.py\n\n**** R code job submission ****\n\nsrun --job-name=my_first_r_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G Rscript example_script.R\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#signposting","title":"Signposting","text":"Useful websites for learning more about Slurm:
The Slurm documentation website
The Introduction to HPC carpentries lesson on Slurm
This lesson is adapted from a workshop introducing users to running python scripts on ARCHER2 as developed by Adrian Jackson.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#introduction","title":"Introduction","text":"Python does not have native support for parallelisation. Python contains a Global Interpreter Lock (GIL) which means the python interpreter only allows one thread to execute at a time. The advantage of the GIL is that C libraries can be easily integrated into Python scripts without checking if they are thread-safe. However, this means that most common python modules cannot be easily parallelised. Fortunately, there are now several re-implementations of common python modules that work around the GIL and are therefore parallelisable. Dask is a python module that contains a parallelised version of the pandas data frame as well as a general format for parallelising any python code.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask","title":"Dask","text":"Dask enables thread-safe parallelised python execution by creating task graphs (a graph of the dependencies of the inputs and outputs of each function) and then deducing which ones can be run separately. This lesson introduces some general concepts required for programming using Dask. There are also some exercises with example answers to help you write your first parallelised python scripts.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#arrays-data-frames-and-bags","title":"Arrays, data frames and bags","text":"Dask contains three data objects to enable parallelised analysis of large data sets in a way familiar to most python programmers. If the same operations are being applied to a large data set then Dask can split up the data set and apply the operations in parallel. The three data objects that Dask can easily split up are:
Arrays: Contains large numbers of elements in multiple dimensions, but each element must be of the same type. Each element has a unique index that allows users to specify changes to individual elements.
Data frames: Contains large numbers of elements which are typically highly structured with multiple object types allowed together. Each element has a unique index that allows users to specify changes to individual elements.
Bags: Contains large numbers of elements which are semi/un-structured. Elements are immutable once inside the bag. Bags are useful for conducting initial analysis/wrangling of raw data before more complex analysis is performed.
You may need to install dask or create a new conda environment with it in.
conda create -n dask-env gcc_linux-64=11.2.0 python=3.11.3 dask\n\nconda activate dask-env\n
Try running the following Python using dask:
import dask.array as da\n\nx = da.random.random((10000, 10000), chunks=(1000, 1000))\n\nprint(x)\n\nprint(x.compute())\n\nprint(x.sum())\n\nprint(x.sum().compute())\n
This should demonstrate that dask is both straightforward to implement simple parallelism, but also lazy in that it does not compute anything until you force it to with the .compute() function.
You can also try out dask DataFrames, using the following code:
import dask.dataframe as dd\n\ndf = dd.read_csv('surveys.csv')\n\ndf.head()\ndf.tail()\n\ndf.weight.max().compute()\n
You can try using different blocksizes when reading in the csv file, and then undertaking an operation on the data, as follows: Experiment with varying blocksizes, although you should be aware that making your block size too small is likely to cause poor performance (the blocksize affects the number of bytes read in at each operation).
df = dd.read_csv('surveys.csv', blocksize=\"10000\")\ndf.weight.max().compute()\n
You can also experiment with Dask Bags to see how that functionality works:
import dask.bag as db\nfrom operator import add\nb = db.from_sequence([1, 2, 3, 4, 5], npartitions=2)\nprint(b.compute())\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask-delayed","title":"Dask Delayed","text":"Dask delayed lets you construct your own task graphs/parallelism from Python functions. You can find out more about dask delayed from the dask documentation Try parallelising the code below using the .delayed function or the @delayed decorator, an example answer can be found here.
def inc(x):\n return x + 1\n\ndef double(x):\n return x * 2\n\ndef add(x, y):\n return x + y\n\ndata = [1, 2, 3, 4, 5]\n\noutput = []\nfor x in data:\n a = inc(x)\n b = double(x)\n c = add(a, b)\n output.append(c)\n\ntotal = sum(output)\n\nprint(total)\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#mandelbrot-exercise","title":"Mandelbrot Exercise","text":"The code below calculates the members of a Mandelbrot set using Python functions:
import sys\nimport time\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef mandelbrot(h, w, maxit=20, r=2):\n \"\"\"Returns an image of the Mandelbrot fractal of size (h,w).\"\"\"\n start = time.time()\n\n x = np.linspace(-2.5, 1.5, 4*h+1)\n\n y = np.linspace(-1.5, 1.5, 3*w+1)\n\n A, B = np.meshgrid(x, y)\n\n C = A + B*1j\n\n z = np.zeros_like(C)\n\n divtime = maxit + np.zeros(z.shape, dtype=int)\n\n for i in range(maxit):\n z = z**2 + C\n diverge = abs(z) > r # who is diverging\n div_now = diverge & (divtime == maxit) # who is diverging now\n divtime[div_now] = i # note when\n z[diverge] = r # avoid diverging too much\n\n end = time.time()\n\n return divtime, end-start\n\nh = 2000\nw = 2000\n\nmandelbrot_space, time = mandelbrot(h, w)\n\nplt.imshow(mandelbrot_space)\n\nprint(time)\n
Your task is to parallelise this code using Dask Array functionality. Using the base python code above, extend it with Dask Array for the main arrays in the computation. Remember you need to specify a chunk size with Dask Arrays, and you will also need to call compute at some point to force Dask to actually undertake the computation. Note, depending on where you run this you may not see any actual speed up of the computation. You need access to extra resources (compute cores) for the calculation to go faster. If in doubt, submit a python script of your solution to the SDF compute nodes to see if you see speed up there. If you are struggling with this parallelisation exercise, there is a solution available for you here.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#pi-exercise","title":"Pi Exercise","text":"The code below calculates Pi using a function that can split it up into chunks and calculate each chunk separately. Currently it uses a single chunk to produce the final value of Pi, but that can be changed by calling pi_chunk multiple times with different inputs. This is not necessarily the most efficient method for calculating Pi in serial, but it does enable parallelisation of the calculation of Pi using multiple copies of pi_chunk called simultaneously.
import time\nimport sys\n\n# Calculate pi in chunks\n\n# n - total number of steps to be undertaken across all chunks\n# lower - the lowest number of this chunk\n# upper - the upper limit of this chunk such that i < upper\n\ndef pi_chunk(n, lower, upper):\n step = 1.0 / n\n p = step * sum(4.0/(1.0 + ((i + 0.5) * (i + 0.5) * step * step)) for i in range(lower, upper))\n return p\n\n# Number of slices\n\nnum_steps = 10000000\n\nprint(\"Calculating PI using:\\n \" + str(num_steps) + \" slices\")\n\nstart = time.time()\n\n# Calculate using a single chunk containing all steps\n\np = pi_chunk(num_steps, 1, num_steps)\n\nstop = time.time()\n\nprint(\"Obtained value of Pi: \" + str(p))\n\nprint(\"Time taken: \" + str(stop - start) + \" seconds\")\n
For this exercise, your task is to implemented the above code on the SDF, and then parallelise using Dask. There are a number of different ways you could parallelise this using Dask, but we suggest using the Futures map functionality to run the pi_chunk function on a range of different inputs. Futures map has the following definition:
Client.map(func, *iterables[, key, workers, ...])\n
Where func is the function you want to run, and then the subsequent arguments are inputs to that function. To utilise this for the Pi calculation, you will first need to setup and configure a Dask Client to use, and also create and populate lists or vectors of inputs to be passed to the pi_chunk function for each function run that Dask launches.
If you run Dask with processes then it is possible that you will get errors about forking processes, such as these:
An attempt has been made to start a new process before the current process has finished its bootstrapping phase.\n This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:\n
In that case you need to encapsulate your code within a main function, using something like this:
if __name__ == \"__main__\":\n
If you are struggling with this exercise then there is a solution available for you here.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#signposting","title":"Signposting","text":"More information on parallelised python code can be found in the carpentries lesson
Dask itself has several detailed tutorials
This lesson is adapted from a workshop introducing users to running R scripts on ARCHER2 as developed by Adrian Jackson.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#introduction","title":"Introduction","text":"In this exercise we are going to try different methods of parallelising R on the SDF. This will include single node parallelisation functionality (e.g. using threads or processes to use cores within a single node), and distributed memory functionality that enables the parallelisation of R programs across multiple nodes.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#example-parallelised-r-code","title":"Example parallelised R code","text":"You may need to activate an R conda environment.
conda activate r-v4.2\n
Try running the following R script using R on the SDF login node:
n <- 8*2048\nA <- matrix( rnorm(n*n), ncol=n, nrow=n )\nB <- matrix( rnorm(n*n), ncol=n, nrow=n )\nC <- A %*% B\n
You can run this as follows on the SDF (assuming you have saved the above code into a file named matrix.R):
Rscript ./matrix.R\n
You can check the resources used by R when running on the login node using this command:
top -u $USER\n
If you run the R script in the background using &, as follows, you can then monitor your run using the top command. You may notice when you run your program that at points R uses many more resources than a single core can provide, as demonstrated below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND\n 178357 adrianj 20 0 15.542 0.014t 13064 R 10862 2.773 9:01.66 R\n
In the example above it can be seen that >10862% of a single core is being used by R. This is an example of R using automatic parallelisation. You can experiment with controlling the automatic parallelisation using the OMP_NUM_THREADS variable to restrict the number of cores available to R. Try using the following values:
export OMP_NUM_THREADS=8\n\nexport OMP_NUM_THREADS=4\n\nexport OMP_NUM_THREADS=2\n
You may also notice that not all the R script is parallelised. Only the actual matrix multiplication is undertaken in parallel, the initialisation/creation of the matrices is done in serial.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-datatables","title":"Parallelisation with data.tables","text":"We can also experiment with the implicit parallelism in other libraries, such as data.table. You will first need to install this library on the SDF. To do this you can simply run the following command:
install.packages(data.table)\n
Once you have installed data.table you can experiment with the following code:
library(data.table)\nvenue_data <- data.table( ID = 1:50000000,\nCapacity = sample(100:1000, size = 50000000, replace = T), Code = sample(LETTERS, 50000000, replace = T),\nCountry = rep(c(\"England\",\"Scotland\",\"Wales\",\"NorthernIreland\"), 50000000))\nsystem.time(venue_data[, mean(Capacity), by = Country])\n
This creates some random data in a large data table and then performs a calculation on it. Try running R with varying numbers of threads to see what impact that has on performance. Remember, you can vary the number of threads R uses by setting OMP_NUM_THREADS= before you run R. If you want to try easily varying the number of threads you can save the above code into a script and run it using Rscript, changing OMP_NUM_THREADS each time you run it, e.g.:
export OMP_NUM_THREADS=1\n\nRscript ./data_table_test.R\n\nexport OMP_NUM_THREADS=2\n\nRscript ./data_table_test.R\n
The elapsed time that is printed out when the calculation is run represents how long the script/program took to run. It\u2019s important to bear in mind that, as with the matrix multiplication exercise, not everything will be parallelised. Creating the data table is done in serial so does not benefit from the addition of more threads.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#loop-and-function-parallelism","title":"Loop and function parallelism","text":"R provides a number of different functions to run loops or functions in parallel. One of the most common functions is to use are the {X}apply
functions:
apply
Apply a function over a matrix or data frame
lapply
Apply a function over a list, vector, or data frame
sapply
Same as lapply
but returns a vector
vapply
Same as sapply
but with a specified return type that improves safety and can improve speed
For example:
res <- lapply(1:3, function(i) {\n sqrt(i)*sqrt(i*2)\n })\n
The {X}apply
functionality supports iteration over a dataset without requiring a loop to be constructed. However, the functions outlined above do not exploit parallelism, even if there is potential for parallelisation many operations that utilise them.
There are a number of mechanisms that can be used to implement parallelism using the {X}apply
functions. One of the simplest is using the parallel
library, and the mclapply
function:
library(parallel)\nres <- mclapply(1:3, function(i) {\n sqrt(i)\n})\n
Try experimenting with the above functions on large numbers of iterations, both with lapply and mclapply. Can you achieve better performance using the MC_CORES environment variable to specify how many parallel processes R uses to complete these calculations? The default on the SDF is 2 cores, but you can increase this in the same way we did for OMP_NUM_THREADS, e.g.:
export MC_CORES=16\n
Try different numbers of iterations of the functions (e.g. change 1:3 in the code to something much larger), and different numbers of parallel processes, e.g.:
export MC_CORES=2\n\nexport MC_CORES=8\n\nexport MC_CORES=16\n
If you have separate functions then the above approach will provide a simple method for parallelising using the resources within a single node. However, if your functionality is more loop-based, then you may not wish to have to package this up into separate functions to parallelise.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-foreach","title":"Parallelisation with foreach","text":"The foreach
package can be used to parallelise loops as well as functions. Consider a loop of the following form:
main_list <- c()\nfor (i in 1:3) {\n main_list <- c(main_list, sqrt(i))\n}\n
This can be converted to foreach
functionality as follows:
main_list <- c()\nlibrary(foreach)\nforeach(i=1:3) %do% {\n main_list <- c(main_list, sqrt(i))\n}\n
Whilst this approach does not significantly change the performance or functionality of the code, it does let us then exploit parallel functionality in foreach. The %do%
can be replaced with a %dopar%
which will execute the code in parallel.
To test this out we\u2019re going to try an example using the randomForest
library. We can now run the following code in R:
library(foreach)\nlibrary(randomForest)\nx <- matrix(runif(50000), 1000)\ny <- gl(2, 500)\nrf <- foreach(ntree=rep(250, 4), .combine=combine) %do%\nrandomForest(x, y, ntree=ntree)\nprint(rf)\n
Implement the above code and run with a system.time
to see how long it takes. Once you have done this you can change the %do%
to a %dopar%
and re-run. Does this provide any performance benefits?
To exploit the parallelism with dopar
we need to provide parallel execution functionality and configure it to use extra cores on the system. One method to do this is using the doParallel
package.
library(doParallel)\nregisterDoParallel(8)\n
Does this now improve performance when running the randomForest
example? Experiment with different numbers of workers by changing the number set in registerDoParallel(8)
to see what kind of performance you can get. Note, you may also need to change the number of clusters used in the foreach, e.g. what is specified in the rep(250, 4)
part of the code, to enable more than 4 different sets to be run at once if using more than 4 workers. The amount of parallel workers you can use is dependent on the hardware you have access to, the number of workers you specify when you setup your parallel backend, and the amount of chunks of work you have to distribute with your foreach configuration.
It is possible to use different parallel backends for foreach
. The one we have used in the example above creates new worker processes to provide the parallelism, but you can also use larger numbers of workers through a parallel cluster, e.g.:
my.cluster <- parallel::makeCluster(8)\nregisterDoParallel(cl = my.cluster)\n
By default makeCluster
creates a socket cluster, where each worker is a new independent process. This can enable running the same R program across a range of systems, as it works on Linux and Windows (and other clients). However, you can also fork the existing R process to create your new workers, e.g.:
cl <-makeCluster(5, type=\"FORK\")\n
This saves you from having to create the variables or objects that were setup in the R program/script prior to the creation of the cluster, as they are automatically copied to the workers when using this forking mode. However, it is limited to Linux style systems and cannot scale beyond a single node.
Once you have finished using a parallel cluster you should shut it down to free up computational resources, using stopCluster
, e.g.:
stopCluster(cl)\n
When using clusters without the forking approach, you need to distribute objects and variables from the main process to the workers using the clusterExport
function, e.g.:
library(parallel)\nvariableA <- 10\nvariableB <- 20\nmySum <- function(x) variableA + variableB + x\ncl <- makeCluster(4)\nres <- try(parSapply(cl=cl, 1:40, mySum))\n
The program above will fail because variableA
and variableB
are not present on the cluster workers. Try the above on the SDF and see what result you get.
To fix this issue you can modify the program using clusterExport
to send variableA
and variableB
to the workers, prior to running the parSapply
e.g.:
clusterExport(cl=cl, c('variableA', 'variableB'))\n
"},{"location":"safe-haven-services/tre-container-user-guide/advise-ig-required-software-stack/","title":"Advising Information Governance of required software stack","text":"Projects must establish that the software stack they intend to import in the container is acceptable for the project\u2019s IG approvals. Projects should only seek to use container-based software if the standard TRE desktop environment is not sufficient for the research scope. However, it is broadly understood that the standard desktop software, whilst useful in most cases, is inadequate for many purposes and specifically ML, and software import using containers is intended to address this.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/","title":"Building and Testing Containers","text":""},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#choose-a-container-base-from-dockerhub","title":"Choose a container base from DockerHub","text":"Projects should build containers by starting with a well-known application base container on a public registry. Projects should add a minimum of additional project software and packages so that the container is clearly built for a specific purpose. Containers built for one specific batch job, either a data transformation or analysis, are examples of this approach. Container builds that assemble groups of tools and then used to run a variety of tasks should be avoided. Additionally, container builds that start from generic distributions such as Debian or Ubuntu should also be avoided as leaner and more focussed application and language containers are already available.
Examples of batch job container bases are Python and PyTorch, and other language specific and ML software stacks. Examples of interactive container bases are Rocker, Jupyter Docker Stacks, and NVIDIA RAPIDS extended with additional package sets and code required by your project.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#add-tre-file-system-directories","title":"Add TRE file system directories","text":"Container images built to run in the TRE must implement the following line in the Dockerfile to interface with the project data and the TRE file system:
RUN mkdir /safe_data /safe_outputs /scratch\n
The project\u2019s private /safe_data/<project id>
directory is mapped to the container\u2019s /safe_data
directory. A unique container job output directory is created in the user's home directory and mapped to /safe_outputs
in the container. And /scratch
is a temporary workspace that is deleted when the container exits. If the container processing uses the TMP
environment variable, it should be set to the /scratch
volume mount. Hence, containers have access to three directories located on the host system as detailed in the following table:
/safe_data/<your_project_name>/
/safe_data
Read-only access if required by IG, or read-write access, to data and other project files. ~/outputs_<unique_id>
/safe_outputs
Will be created at container startup as an empty directory. Intended for any outputs: logs, data, models. ~/scratch_<unique_id>
/scratch
Temporary directory that is removed after container termination on the host system. Any temporary files should be placed here. Currently, temporary files can also be written into any directory in the container\u2019s internal file system. This is allowed to prevent container software failure when it is dependent on the container file system being writable. However, the use of /scratch
is encouraged for two reasons:
/scratch
. Writing only to /scratch
at runtime is therefore future proof./scratch
can be more efficient if the service is able to mount it on high-performing storage devices.Research software stacks are complex and dependent on many package sets and libraries, and sometimes on specific version combinations of these. The container build process presents the project team with the opportunity to manage and resolve these dependencies before contact with the project data in the restricted TRE setting.
Unlike the TRE desktop servers, containers do not have access to external network repositories, and do not have access to external licence servers. Any container software that requires a licence to run must be copied into the container at build time. EPCC are not responsible for verifying that the appropriate licences are installed or that license terms are being met.
Projects using configuration files for multiple containers, for example ML models, can also import these to the TRE directly by copying them to the project /safe_data
file system.
Batch jobs built to run as containers should start directly from the ENTRYPOINT
or CMD
section of the Dockerfile. Batch jobs should run without any user interaction after start, should read input from the /safe_data
directory and write outputs to the /scratch
and /safe_outputs
directories.
Interactive containers should present a connection port for the user session. Once the container is started the user can connect to the session port from the TRE desktop. If code files are being changed during the session these should be saved on the /safe_data
or the /safe_outputs
directories as the internal container file space is not preserved when the session ends.
When the container is running in the TRE it will not have any external network or internet access. If the code, or any of its dependencies, rely on data files downloaded at runtime (for example machine learning models) this will fail in the TRE. Code must be modified to load these files explicitly from a file path which is accessible from inside the container.
An example of TRE network isolation significance and the considerations arising from this is provided by Hugging Face, where models are cached in the user local ~/.cache/huggingface/hub/
directory in the container. The environment variable HF_HOME
must be set to a directory in a /safe_data
project folder and the cache_dir
option of the from_pretrained()
call used at runtime.
If a model is downloaded from Hugging Face the advice is to set the environment variable HF_HUB_OFFLINE=1
to prevent attempts at contact the Hugging Face Hub. Any connection attempts will take a significant time to elapse and then fail in the TRE setting.
It is recommended that the checklist for Dockerfile composition is followed: Container Build Guide
Information Governance requirements may require a security scan of your container, and Trivy is a tool that can help with this task. Trivy inspects container images to find items which have known vulnerabilities and produces a report that may be used to help assess the risk. The use of the Trivy misconfiguration tool on Dockerfiles is also recommended. This tool option will highlight many common security issues:
docker run --rm -v $(pwd):/repo ghcr.io/aquasecurity/trivy:latest config \"/repo/Dockerfile\"\n
The security posture of containers and the build process may be of interest to IG teams, however, it is not expected that security issues indicated by the tool need to be addressed before the container is run in the TRE unless the IG team issues specific guidance on vulnerability and configuration remediation and mitigation.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#scan-container-using-trivy-ci","title":"Scan container using Trivy CI","text":"Trivy can be run manually but it is easier to have it run automatically whenever you update your container image. An example GitHub Actions workflow to run Trivy and publish the outputs can be found here
The Trivy report can be downloaded as an artifact from the job summary page. Before using a specific container in the TRE it may be necessary to test the security risk and gain IG team approval.
"},{"location":"safe-haven-services/tre-container-user-guide/development-workflow/","title":"Development workflow","text":"The general guidance for TRE container development is:
Develop all code on a git platform, typically GitHub or a university managed GitLab service
Start Dockerfiles from a well-known base image. This is especially important if using a GPU as your container will need to have a compatible version of CUDA (currently 11.1 or later)
Add all the additional content (code files, libraries, packages, data, and licences) needed for your analysis work to your Dockerfile
Build and test the container to ensure that it has no external runtime dependencies
Push the Dockerfile to the project git repository so the container image build is recorded
Push the container image to the GitHub container registry ghcr.io (GHCR)
Login to a TRE desktop enabled for container execution to pull and run the container
Containers are connected to three external directories when run inside the TRE: one for access to the project data files (which may be read-only in some cases); one for temporary work files that are all deleted when the container exits; and one for the job output files (which may be set as read-only in some cases when the container exits). All container outputs remain inside the TRE project file space and there is no automatic export when the container finishes.
Container images that have been pulled into the TRE are destroyed after they have been run. Only the files written to the container outputs directory are guaranteed to be retained.
You must ensure that, apart from the input data, your container has everything that it needs to run, including all code and dependencies, and any ancillary files such as machine learning models. It is likely that your development environment, which is always outside the TRE, does not normally have these three directories, but you need to build a container that uses them (see below for path names) because there is no ability inside the TRE to change which directories are available.
The input data file names may change so you may not want to hard-code them into your container. For example, instead of your code using open(\"/safe_data/my_patients.csv\")
you should consider putting a list of input file names into a config file and reading that config file in your container start up to determine which input data files to use. This will allow you to re-run your container on different data sets much faster than building a new container each time.
This guide sets out the required activities for researchers using containers in the EPCC TRE (Safe Haven). The intended audience are software developers with experience of containers and Docker and Podman in particular. Online courses such as Intro to Containers demonstrate the base skills needed if there is any doubt.
The Container Execution Service (CES) has been introduced to allow project code developed and tested by researchers outside the TRE in personal development environments to be imported and run on the project data inside the TRE using a well-documented, transparent, secure workflow. The primary role of the TRE is to store and share data securely; it is not intended to be a software development and testing environment. The CES removes the need for software development in the TRE.
The use of containers and software configuration management processes is also strongly advocated by the research software engineering community for experiment management and reproducibility. It is recommended that TRE container builders take a disciplined approach to code management and use git to create container build audit trails to satisfy any IG (Information Governance) concerns about the provenance of the project code.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/","title":"Using Containers in the TRE","text":"Once you have built and tested your container, you are ready to start using it within the TRE.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/#pulling-a-container-into-the-tre","title":"Pulling a container into the TRE","text":"Containers can only be used on the TRE desktop hosts using shell commands. And containers can only be pulled from the GitHub Container Registry (GHCR) into the TRE using a ces-pull
script. Hence containers must be pushed to GHCR for them to be used in the TRE.
As use of containers in the TRE is a new service, it is at this stage regarded as an activity that requires additional security controls. As result the ces-pull
command is a privileged one that can only be run using sudo. Researcher accounts must be explicitly enabled for use of the sudo ces-pull
command through IG approval \u2013 sudo access for these accounts will be constrained to only run the ces-pull
command.
To pull a private image, you must create an access token to authenticate with GHCR (see Authenticating to the container registry). The container is then pulled by the user with the command:
sudo ces-pull <github_user> <github_token> ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
To pull a public image, which does not require authenticating with username and token, pass two empty strings:
sudo ces-pull \"\" \"\" ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
Once the container image has been pulled into the TRE desktop host, the image can be managed with Podman commands. However, containers must not be run directly using Podman. Instead, commands developed for use within the TRE must be used as will now be described.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/#running-the-container-in-the-tre","title":"Running the container in the TRE","text":"Containers may be run in the TRE using one of two commands: use ces-gpu-run
if a GPU is to be connected to the container, otherwise use the ces-run
command. The sudo privilege escalation is not required to run containers. The basic command to start a container is one of:
ces-run ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
or, if your container requires a GPU:
ces-gpu-run ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
Each command supports a number of options to control resource allocation and to pass parameters to the podman run command and to the container itself. Each command has a help option to output the following information:
Usage:\nces-run [options] <container>\nAvailable Options:\n-c|--cores CPU cores to allocate (default is sharing all of them)\n--dry-run Do not run the container, print out all the command options\n--env-file File with env vars to pass to container\n-h|--help Print this stuff\n-m|--memory Memory to allocate in Gb (default is 4Gb)\n-n|--name Assign a name to the container\n--opt-file File with additional options to pass to run command\n-v|--verbose Print out all command options\n--version Print out version string\n
The --env-file
and --opt-file
arguments can be used to extend the command-line script that is executed when the container is started. The --env-file
option is exactly the docker and podman run option with the file containing lines of the form Variable=Value
. See the Docker option reference
The --opt-file
option allows you to have a file containing additional arguments to the ces-run
and ces-gpu-run
command.
Data Science Virtual Desktops
Managed File Transfer
Notebooks
Cerebras CS-2
Ultra2
Graphcore Bow Pod64
"},{"location":"services/#data-services","title":"Data Services","text":"S3
Data Catalogue
"},{"location":"services/cs2/","title":"Cerebras CS-2","text":"Get Access
Running codes
"},{"location":"services/cs2/access/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/access/#getting-access","title":"Getting Access","text":"Access to the Cerebras CS-2 system is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.
"},{"location":"services/cs2/run/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/run/#introduction","title":"Introduction","text":"The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system as a host system which provides login services, access to files, the SLURM batch system etc.
"},{"location":"services/cs2/run/#connecting-to-the-cluster","title":"Connecting to the cluster","text":"To gain access to the CS-2 WSC you need to login to the host system, Ultra2. See the documentation for Ultra2.
"},{"location":"services/cs2/run/#running-jobs","title":"Running Jobs","text":"All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.
"},{"location":"services/cs2/run/#slurm-example","title":"SLURM example","text":"This is based on the sample job from the Cerebras documentation Cerebras documentation - Execute your job
#!/bin/bash\n#SBATCH --job-name=Example # Job name\n#SBATCH --cpus-per-task=2 # Request 2 cores\n#SBATCH --output=example_%j.log # Standard output and error log\n#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour\n#SBATCH --gres=cs:1 # Request CS-2 system\n\nsource venv_cerebras_pt/bin/activate\npython run.py \\\n CSX \\\n --params params.yaml \\\n --num_csx=1 \\\n --model_dir model_dir \\\n --mode {train,eval,eval_all,train_and_eval} \\\n --mount_dirs {paths to modelzoo and to data} \\\n --python_paths {paths to modelzoo and other python code if used}\n
See the 'Troubleshooting' section below for known issues.
"},{"location":"services/cs2/run/#creating-an-environment","title":"Creating an environment","text":"To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this Cerebras setup environment docs however our host system is slightly different so we recommend the following:
"},{"location":"services/cs2/run/#create-the-venv","title":"Create the venv","text":"python3.8 -m venv venv_cerebras_pt\n
"},{"location":"services/cs2/run/#install-the-dependencies","title":"Install the dependencies","text":"source venv_cerebras_pt/bin/activate\npip install --upgrade pip\npip install cerebras_pytorch==2.2.1\n
"},{"location":"services/cs2/run/#validate-the-setup","title":"Validate the setup","text":"source venv_cerebras_pt/bin/activate\ncerebras_install_check\n
"},{"location":"services/cs2/run/#modify-venv-files-to-remove-clock-sync-check-on-epcc-system","title":"Modify venv files to remove clock sync check on EPCC system","text":"Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
"},{"location":"services/cs2/run/#from-within-your-python-venv-edit-the-libpython38site-packagescerebras_pytorchsaverstoragepy-file","title":"From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py filevi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py\n
","text":""},{"location":"services/cs2/run/#navigate-to-line-530","title":"Navigate to line 530 :530\n
The section should look like this:
if modified_time > self._last_modified:\n raise RuntimeError(\n f\"Attempting to materialize deferred tensor with key \"\n f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n f\"since been modified. The loaded tensor value may be \"\n f\"different from originally loaded tensor. Please refrain \"\n f\"from modifying the file while the run is in progress.\"\n )\n
","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-modified_time-self_last_modified","title":"Comment out the section if modified_time > self._last_modified
#if modified_time > self._last_modified:\n # raise RuntimeError(\n # f\"Attempting to materialize deferred tensor with key \"\n # f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n # f\"since been modified. The loaded tensor value may be \"\n # f\"different from originally loaded tensor. Please refrain \"\n # f\"from modifying the file while the run is in progress.\"\n # )\n
","text":""},{"location":"services/cs2/run/#navigate-to-line-774","title":"Navigate to line 774 :774\n
The section should look like this:
if stat.st_mtime_ns > self._stat.st_mtime_ns:\n raise RuntimeError(\n f\"Attempting to {msg} deferred tensor with key \"\n f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n f\"since been modified. The loaded tensor value may be \"\n f\"different from originally loaded tensor. Please refrain \"\n f\"from modifying the file while the run is in progress.\"\n )\n
","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-statst_mtime_ns-self_statst_mtime_ns","title":"Comment out the section if stat.st_mtime_ns > self._stat.st_mtime_ns
#if stat.st_mtime_ns > self._stat.st_mtime_ns:\n # raise RuntimeError(\n # f\"Attempting to {msg} deferred tensor with key \"\n # f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n # f\"since been modified. The loaded tensor value may be \"\n # f\"different from originally loaded tensor. Please refrain \"\n # f\"from modifying the file while the run is in progress.\"\n # )\n
","text":""},{"location":"services/cs2/run/#save-the-file","title":"Save the file","text":""},{"location":"services/cs2/run/#run-jobs-as-per-existing-documentation","title":"Run jobs as per existing documentation","text":""},{"location":"services/cs2/run/#paths-pythonpath-and-mount_dirs","title":"Paths, PYTHONPATH and mount_dirs","text":"There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.
"},{"location":"services/datacatalogue/","title":"EIDF Data Catalogue Information","text":"Metadata information
"},{"location":"services/datacatalogue/metadata/","title":"EIDF Metadata Information","text":""},{"location":"services/datacatalogue/metadata/#what-is-fair","title":"What is FAIR?","text":"FAIR stands for Findable, Accessible, Interoperable, and Reusable, and helps emphasise the best practices with publishing and sharing data (more details: FAIR Principles)
"},{"location":"services/datacatalogue/metadata/#what-is-metadata","title":"What is metadata?","text":"Metadata is data about data, to help describe the dataset. Common metadata fields are things like the title of the dataset, who produced it, where it was generated (if relevant), when it was generated, and some key words describing it
"},{"location":"services/datacatalogue/metadata/#what-is-ckan","title":"What is CKAN?","text":"CKAN is a metadata catalogue - i.e. it is a database for metadata rather than data. This will help with all aspects of FAIR:
Using a standard vocabulary (such as the FAST Vocabulary) has many benefits:
All of these advantages mean that we, as a project, don't need to think about this - there is no need to reinvent the wheel when other institutes (e.g. National Libraries) have created. You might recognise WorldCat - it is an organisation which manages a global catalogue of ~18000 libraries world-wide, so they are in a good position to generate a comprehensive vocabulary of academic topics!
"},{"location":"services/datacatalogue/metadata/#what-about-licensing-what-does-cc-by-sa-40-mean","title":"What about licensing? (What does CC-BY-SA 4.0 mean?)","text":"The R in FAIR stands for reusable - more specifically it includes this subphrase: \"(Meta)data are released with a clear and accessible data usage license\". This means that we have to tell anyone else who uses the data what they're allowed to do with it - and, under the FAIR philosophy, more freedom is better.
CC-BY-SA 4.0 allows anyone to remix, adapt, and build upon your work (even for commercial purposes), as long as they credit you and license their new creations under the identical terms. It also explicitly includes Sui Generis Database Rights, giving rights to the curation of a database even if you don't have the rights to the items in a database (e.g. a Spotify playlist, even though you don't own the rights to each track).
Human readable summary: Creative Commons 4.0 Human Readable Full legal code: Creative Commons 4.0 Legal Code
"},{"location":"services/datacatalogue/metadata/#im-stuck-how-do-i-get-help","title":"I'm stuck! How do I get help?","text":"Contact the EIDF Service Team via eidf@epcc.ed.ac.uk
"},{"location":"services/gpuservice/","title":"Overview","text":"The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon Kubernetes.
MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.
The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.
The service provides access to:
The current full specification of the EIDF GPU Service as of 14 February 2024:
Quotas
This is the full configuration of the cluster.
Each project will have access to a quota across this shared configuration.
Changes to the default quota must be discussed and agreed with the EIDF Services team.
NOTE
If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. Please see Getting started with Kubernetes to learn about specifying GPU resources.
"},{"location":"services/gpuservice/#service-access","title":"Service Access","text":"Users should have an EIDF Account as the EIDF GPU Service is only accessible through EIDF Virtual Machines.
Existing projects can request access to the EIDF GPU Service through a service request to the EIDF helpdesk or emailing eidf@epcc.ed.ac.uk .
New projects wanting to using the GPU Service should include this in their EIDF Project Application.
Each project will be given a namespace within the EIDF GPU service to operate in.
This namespace will normally be the EIDF Project code appended with \u2019ns\u2019, i.e. eidf989ns
for a project with code 'eidf989'.
Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available here.
All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool.
The VM does not require to be GPU-enabled.
A quick check to see if a VM has access to the EIDF GPU service can be completed by typing kubectl -n <project-namespace> get jobs
in to the command line.
If this is first time you have connected to the GPU service the response should be No resources found in <project-namespace> namespace
.
EIDF GPU Service vs EIDF GPU-Enabled VMs
The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs.
This allows a project to access multiple GPUs of different types.
An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.
Projects do not have to apply for a GPU-enabled VM to access the GPU Service.
"},{"location":"services/gpuservice/#project-quotas","title":"Project Quotas","text":"A standard project namespace has the following initial quota (subject to ongoing review):
Quota is a maximum on a Shared Resource
A project quota is the maximum proportion of the service available for use by that project.
Any submitted job requests that would exceed the total project quota will be queued.
"},{"location":"services/gpuservice/#project-queues","title":"Project Queues","text":"EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the Kueue.
Job Queuing
During periods of high demand, jobs will be queued awaiting resource availability on the Service.
As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated.
GPUs in high demand, such as Nvidia H100s, typically have longer wait times.
Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.
"},{"location":"services/gpuservice/#additional-service-policy-information","title":"Additional Service Policy Information","text":"Additional information on service policies can be found here.
"},{"location":"services/gpuservice/#eidf-gpu-service-tutorial","title":"EIDF GPU Service Tutorial","text":"This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.
Lesson Objective Getting started with Kubernetes a. What is Kubernetes?b. How to send a task to a GPU node.c. How to define the GPU resources needed. Requesting persistent volumes with Kubernetes a. What is a persistent volume? b. How to request a PV resource. Running a PyTorch task a. Accessing a Pytorch container.b. Submitting a PyTorch task to the cluster.c. Inspecting the results. Template workflow a. Loading large data sets asynchronously.b. Manually or automatically building Docker images.c. Iteratively changing and testing code in a job."},{"location":"services/gpuservice/#further-reading-and-help","title":"Further Reading and Help","text":"The Nvidia developers blog provides several examples of how to run ML tasks on a Kubernetes GPU cluster.
Kubernetes documentation has a useful kubectl cheat sheet.
More detailed use cases for the kubectl
can be found in the Kubernetes documentation.
The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.
"},{"location":"services/gpuservice/faq/#how-do-i-obtain-my-project-kubeconfig-file","title":"How do I obtain my project kubeconfig file?","text":"Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.
"},{"location":"services/gpuservice/faq/#access-to-gpu-service-resources-in-default-namespace-is-forbidden","title":"Access to GPU Service resources in default namespace is 'Forbidden'","text":"Error from server (Forbidden): error when creating \"myjobfile.yml\": jobs is forbidden: User <user> cannot create resource \"jobs\" in API group \"\" in the namespace \"default\"\n
Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when the project namespace is not included in the kubectl command for submitting job/pods and kubectl tries to use the \"default\" namespace which projects do not have permissions to use. Resubmitting the job/pod with kubectl -n <project-namespace> create \"myjobfile.yml\"
should solve the issue.
The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.
"},{"location":"services/gpuservice/faq/#how-many-gpus-can-i-use-in-a-pod","title":"How many GPUs can I use in a pod?","text":"The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.
"},{"location":"services/gpuservice/faq/#why-did-a-validation-error-occur-when-submitting-a-pod-or-job-with-a-valid-specification-file","title":"Why did a validation error occur when submitting a pod or job with a valid specification file?","text":"If an error like the below occurs:
error: error validating \"myjobfile.yml\": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false\n
There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories.
The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the Kubernetes Version Skew Policy.
"},{"location":"services/gpuservice/faq/#insufficient-shared-memory-size","title":"Insufficient Shared Memory Size","text":"My SHM is very small, and it causes \"OSError: [Errno 28] No space left on device\" when I train a model using multi-GPU. How to increase SHM size?
The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem:
spec:\n containers:\n - name: [NAME]\n image: [IMAGE]\n volumeMounts:\n - mountPath: /dev/shm\n name: dshm\n volumes:\n - name: dshm\n emptyDir:\n medium: Memory\n
"},{"location":"services/gpuservice/faq/#pytorch-slow-performance-issues","title":"Pytorch Slow Performance Issues","text":"Pytorch on Kubernetes may operate slower than expected - much slower than an equivalent VM setup.
Pytorch defaults to auto-detecting the number of OMP Threads and it will report an incorrect number of potential threads compared to your requested CPU core count. This is a consequence in operating in a container environment, the CPU information is reported by standard libraries and tools will be the node level information rather than your container.
To help correct this issue, the environment variable OMP_NUM_THREADS should be set in the job submission file to the number of cores requested or less.
This has been tested using:
Example fragment for a Bash command start:
containers:\n - args:\n - >\n export OMP_NUM_THREADS=1;\n python mypytorchprogram.py;\n command:\n - /bin/bash\n - '-c'\n - '--'\n
"},{"location":"services/gpuservice/faq/#my-large-number-of-gpus-job-takes-a-long-time-to-be-scheduled","title":"My large number of GPUs Job takes a long time to be scheduled","text":"When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
"},{"location":"services/gpuservice/kueue/","title":"Kueue","text":""},{"location":"services/gpuservice/kueue/#overview","title":"Overview","text":"Kueue is a native Kubernetes quota and job management system.
This is the job queue system for the EIDF GPU Service, starting with February 2024.
All users should submit jobs to their local namespace user queue, this queue will have the name eidf project namespace
-user-queue.
Jobs can be submitted as before but will require the addition of a metadata label:
labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\n
This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted.
"},{"location":"services/gpuservice/kueue/#useful-commands-for-looking-at-your-local-queue","title":"Useful commands for looking at your local queue","text":""},{"location":"services/gpuservice/kueue/#kubectl-get-queue","title":"kubectl get queue
","text":"This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start:
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS\neidf001-user-queue eidf001-project-gpu-cq 0 2\n
"},{"location":"services/gpuservice/kueue/#kubectl-describe-queue-queue","title":"kubectl describe queue <queue>
","text":"This command will output more detailed information on the current resource usage in your queue:
Name: eidf001-user-queue\nNamespace: eidf001\nLabels: <none>\nAnnotations: <none>\nAPI Version: kueue.x-k8s.io/v1beta1\nKind: LocalQueue\nMetadata:\n Creation Timestamp: 2024-02-06T13:06:23Z\n Generation: 1\n Managed Fields:\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:spec:\n .:\n f:clusterQueue:\n Manager: kubectl-create\n Operation: Update\n Time: 2024-02-06T13:06:23Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n .:\n f:admittedWorkloads:\n f:conditions:\n .:\n k:{\"type\":\"Active\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n f:flavorUsage:\n .:\n k:{\"name\":\"default-flavor\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"cpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"memory\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-1g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-3g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-80\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n f:flavorsReservation:\n .:\n k:{\"name\":\"default-flavor\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"cpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"memory\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-1g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-3g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-80\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n f:pendingWorkloads:\n f:reservingWorkloads:\n Manager: kueue\n Operation: Update\n Subresource: status\n Time: 2024-02-14T10:54:20Z\n Resource Version: 333898946\n UID: bca097e2-6c55-4305-86ac-d1bd3c767751\nSpec:\n Cluster Queue: eidf001-project-gpu-cq\nStatus:\n Admitted Workloads: 2\n Conditions:\n Last Transition Time: 2024-02-06T13:06:23Z\n Message: Can submit new workloads to clusterQueue\n Reason: Ready\n Status: True\n Type: Active\n Flavor Usage:\n Name: gpu-a100\n Resources:\n Name: nvidia.com/gpu\n Total: 2\n Name: gpu-a100-3g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-1g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-80\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: default-flavor\n Resources:\n Name: cpu\n Total: 16\n Name: memory\n Total: 256Gi\n Flavors Reservation:\n Name: gpu-a100\n Resources:\n Name: nvidia.com/gpu\n Total: 2\n Name: gpu-a100-3g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-1g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-80\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: default-flavor\n Resources:\n Name: cpu\n Total: 16\n Name: memory\n Total: 256Gi\n Pending Workloads: 0\n Reserving Workloads: 2\nEvents: <none>\n
"},{"location":"services/gpuservice/kueue/#kubectl-get-workloads","title":"kubectl get workloads
","text":"This command will return the list of workloads in the queue:
NAME QUEUE ADMITTED BY AGE\njob-jobtest-366ab eidf001-user-queue eidf001-project-gpu-cq 4h45m\njob-jobtest-34ba9 eidf001-user-queue eidf001-project-gpu-cq 6h48m\n
"},{"location":"services/gpuservice/kueue/#kubectl-describe-workload-workload","title":"kubectl describe workload <workload>
","text":"This command will return a detailed summary of the workload including status and resource usage:
Name: job-pytorch-job-0b664\nNamespace: t4\nLabels: kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc\nAnnotations: <none>\nAPI Version: kueue.x-k8s.io/v1beta1\nKind: Workload\nMetadata:\n Creation Timestamp: 2024-02-14T15:22:16Z\n Generation: 2\n Managed Fields:\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n f:admission:\n f:clusterQueue:\n f:podSetAssignments:\n k:{\"name\":\"main\"}:\n .:\n f:count:\n f:flavors:\n f:cpu:\n f:memory:\n f:nvidia.com/gpu:\n f:name:\n f:resourceUsage:\n f:cpu:\n f:memory:\n f:nvidia.com/gpu:\n f:conditions:\n k:{\"type\":\"Admitted\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n k:{\"type\":\"QuotaReserved\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n Manager: kueue-admission\n Operation: Apply\n Subresource: status\n Time: 2024-02-14T15:22:16Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n f:conditions:\n k:{\"type\":\"Finished\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n Manager: kueue-job-controller-Finished\n Operation: Apply\n Subresource: status\n Time: 2024-02-14T15:25:06Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:metadata:\n f:labels:\n .:\n f:kueue.x-k8s.io/job-uid:\n f:ownerReferences:\n .:\n k:{\"uid\":\"33bc1e48-4dca-4252-9387-bf68b99759dc\"}:\n f:spec:\n .:\n f:podSets:\n .:\n k:{\"name\":\"main\"}:\n .:\n f:count:\n f:name:\n f:template:\n .:\n f:metadata:\n .:\n f:labels:\n .:\n f:controller-uid:\n f:job-name:\n f:name:\n f:spec:\n .:\n f:containers:\n f:dnsPolicy:\n f:nodeSelector:\n f:restartPolicy:\n f:schedulerName:\n f:securityContext:\n f:terminationGracePeriodSeconds:\n f:volumes:\n f:priority:\n f:priorityClassSource:\n f:queueName:\n Manager: kueue\n Operation: Update\n Time: 2024-02-14T15:22:16Z\n Owner References:\n API Version: batch/v1\n Block Owner Deletion: true\n Controller: true\n Kind: Job\n Name: pytorch-job\n UID: 33bc1e48-4dca-4252-9387-bf68b99759dc\n Resource Version: 270812029\n UID: 8cfa93ba-1142-4728-bc0c-e8de817e8151\nSpec:\n Pod Sets:\n Count: 1\n Name: main\n Template:\n Metadata:\n Labels:\n Controller - UID: 33bc1e48-4dca-4252-9387-bf68b99759dc\n Job - Name: pytorch-job\n Name: pytorch-pod\n Spec:\n Containers:\n Args:\n /mnt/ceph_rbd/example_pytorch_code.py\n Command:\n python3\n Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n Image Pull Policy: IfNotPresent\n Name: pytorch-con\n Resources:\n Limits:\n Cpu: 4\n Memory: 4Gi\n nvidia.com/gpu: 1\n Requests:\n Cpu: 2\n Memory: 1Gi\n Termination Message Path: /dev/termination-log\n Termination Message Policy: File\n Volume Mounts:\n Mount Path: /mnt/ceph_rbd\n Name: volume\n Dns Policy: ClusterFirst\n Node Selector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB\n Restart Policy: Never\n Scheduler Name: default-scheduler\n Security Context:\n Termination Grace Period Seconds: 30\n Volumes:\n Name: volume\n Persistent Volume Claim:\n Claim Name: pytorch-pvc\n Priority: 0\n Priority Class Source:\n Queue Name: t4-user-queue\nStatus:\n Admission:\n Cluster Queue: project-cq\n Pod Set Assignments:\n Count: 1\n Flavors:\n Cpu: default-flavor\n Memory: default-flavor\n nvidia.com/gpu: gpu-a100\n Name: main\n Resource Usage:\n Cpu: 2\n Memory: 1Gi\n nvidia.com/gpu: 1\n Conditions:\n Last Transition Time: 2024-02-14T15:22:16Z\n Message: Quota reserved in ClusterQueue project-cq\n Reason: QuotaReserved\n Status: True\n Type: QuotaReserved\n Last Transition Time: 2024-02-14T15:22:16Z\n Message: The workload is admitted\n Reason: Admitted\n Status: True\n Type: Admitted\n Last Transition Time: 2024-02-14T15:25:06Z\n Message: Job finished successfully\n Reason: JobFinished\n Status: True\n Type: Finished\n
"},{"location":"services/gpuservice/policies/","title":"GPU Service Policies","text":""},{"location":"services/gpuservice/policies/#namespaces","title":"Namespaces","text":"Each project will be given a namespace which will have an applied quota.
Default Quota:
Each project will be assigned a kubeconfig file for access to the service which will allow operation in the assigned namespace and access to exposed service operators, for example the GPU and CephRBD operators.
"},{"location":"services/gpuservice/policies/#kubernetes-job-time-to-live","title":"Kubernetes Job Time to Live","text":"All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via spec.ttlSecondsAfterFinished
> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service.
Important
This policy is automated and does not require users to change their job specifications.
"},{"location":"services/gpuservice/policies/#kubernetes-active-deadline-seconds","title":"Kubernetes Active Deadline Seconds","text":"All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via spec.spec.activeDeadlineSeconds
automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service.
Important
This policy is automated and does not require users to change their job or pod specifications.
"},{"location":"services/gpuservice/policies/#kueue","title":"Kueue","text":"All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload.
Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata:
labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\n
Jobs without this queue name tag will be rejected.
Pods bypassing the queue system will be deleted.
"},{"location":"services/gpuservice/training/L1_getting_started/","title":"Getting started with Kubernetes","text":""},{"location":"services/gpuservice/training/L1_getting_started/#requirements","title":"Requirements","text":"In order to follow this tutorial on the EIDF GPU Cluster you will need to have:
An account on the EIDF Portal.
An active EIDF Project on the Portal with access to the EIDF GPU Service.
The EIDF GPU Service kubernetes namespace associated with the project, e.g. eidf001ns.
The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue.
Downloaded the kubeconfig file to a Project VM along with the kubectl command line tool to interact with the K8s API.
Downloading the kubeconfig file and kubectl
Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed.
"},{"location":"services/gpuservice/training/L1_getting_started/#introduction","title":"Introduction","text":"Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications.
Nvidia GPUs are supported through K8s native Nvidia GPU Operators.
The use of K8s to manage the EIDF GPU Service provides two key advantages:
An overview of the key components of a K8s container can be seen on the Kubernetes docs website.
The primary component of a K8s cluster is a pod.
A pod is a set of one or more docker containers (and their storage volumes) that share resources.
It is the EIDF GPU Cluster policy that all pods should be wrapped within a K8s job.
This allows GPU/CPU/Memory resource requests to be managed by the cluster queue management system, kueue.
Pods which attempt to bypass the queue mechanism will affect the experience of other project users.
Any pods not associated with a job (or other K8s object) are at risk of being deleted without notice.
K8s jobs also provide additional functionality such as parallelism (described later in this tutorial).
Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers/code to be ran in the pod by defining a template within a job manifest file written in yaml.
The job yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran.
A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs.
Users interact with the K8s API using the kubectl
(short for kubernetes control) commands.
Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces.
Ensure kubectl is interacting with your project namespace.
You will need to pass the name of your project namespace to kubectl
in order for it to have permission to interact with the cluster.
kubectl
will attempt to interact with the default
namespace which will return a permissions error if it is not told otherwise.
kubectl -n <project-namespace> <command>
will tell kubectl to pass the commands to the correct namespace.
Useful commands are:
kubectl -n <project-namespace> create -f <job definition yaml>
: Create a new job with requested resources. Returns an error if a job with the same name already exists.kubectl -n <project-namespace> apply -f <job definition yaml>
: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml.kubectl -n <project-namespace> delete pod <pod name>
: Delete a pod from the cluster.kubectl -n <project-namespace> get pods
: Summarise all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pods
: Verbose description of all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pod <pod name>
: Verbose summary of the specified pod.kubectl -n <project-namespace> logs <pod name>
: Retrieve the log files associated with a running pod.kubectl -n <project-namespace> get jobs
: List all jobs the namespace has active (or pending).kubectl -n <project-namespace> describe job <job name>
: Verbose summary of the specified job.kubectl -n <project-namespace> delete job <job name>
: Delete a job from the cluster.To access the GPUs on the service, it is recommended to start with one of the prebuilt container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs.
The list of Nvidia images is available on their website.
The following example uses their CUDA sample code simulating nbody interactions.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n
The pod resources are defined under the resources
tags using the requests
and limits
tags.
Resources defined under the requests
tags are the reserved resources required for the pod to be scheduled.
If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested.
This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node.
The limits
tag specifies the maximum resources that can be assigned to a pod.
The EIDF GPU Service requires all pods have requests
and limits
tags for CPU and memory defined in order to be accepted.
GPU resources requests are optional and only an entry under the limits
tag is needed to specify the use of a GPU, nvidia.com/gpu: 1
. Without this no GPU will be available to the pod.
The label kueue.x-k8s.io/queue-name
specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users.
<project-namespace>-user-queue
, e.g. eidf001ns-user-queue:kubectl -n <project-namespace> create -f test_NBody.yml
This will output something like:
job.batch/jobtest-b92qg created\n
The five character code appended to the job name, i.e. b92qg
, is randomly generated and will differ from your run.
Run kubectl -n <project-namespace> get jobs
This will output something like:
NAME COMPLETIONS DURATION AGE\njobtest-b92qg 1/1 48s 29m\n
There may be more than one entry as this displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age.
Inspect your job further using the command kubectl -n <project-namespace> describe job jobtest-b92qg
, updating the job name with your five character code.
This will output something like:
Name: jobtest-b92qg\nNamespace: t4\nSelector: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\nLabels: kueue.x-k8s.io/queue-name=t4-user-queue\nAnnotations: batch.kubernetes.io/job-tracking:\nParallelism: 1\nCompletions: 3\nCompletion Mode: NonIndexed\nStart Time: Wed, 14 Feb 2024 14:07:44 +0000\nCompleted At: Wed, 14 Feb 2024 14:08:32 +0000\nDuration: 48s\nPods Statuses: 0 Active (0 Ready) / 3 Succeeded / 0 Failed\nPod Template:\n Labels: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\n job-name=jobtest-b92qg\n Containers:\n cudasample:\n Image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n Port: <none>\n Host Port: <none>\n Args:\n -benchmark\n -numbodies=512000\n -fp64\n -fullscreen\n Limits:\n cpu: 2\n memory: 4Gi\n nvidia.com/gpu: 1\n Requests:\n cpu: 2\n memory: 1Gi\n Environment: <none>\n Mounts: <none>\n Volumes: <none>\nEvents:\nType Reason Age From Message\n---- ------ ---- ---- -------\nNormal Suspended 8m1s job-controller Job suspended\nNormal CreatedWorkload 8m1s batch/job-kueue-controller Created Workload: t4/job-jobtest-b92qg-3b890\nNormal Started 8m1s batch/job-kueue-controller Admitted by clusterQueue project-cq\nNormal SuccessfulCreate 8m job-controller Created pod: jobtest-b92qg-lh64s\nNormal Resumed 8m job-controller Job resumed\nNormal SuccessfulCreate 7m44s job-controller Created pod: jobtest-b92qg-xhvdm\nNormal SuccessfulCreate 7m28s job-controller Created pod: jobtest-b92qg-lvmrf\nNormal Completed 7m12s job-controller Job completed\n
Run kubectl -n <project-namespace> get pods
This will output something like:
NAME READY STATUS RESTARTS AGE\njobtest-b92qg-lh64s 0/1 Completed 0 11m\n
Again, there may be more than one entry as this displays all the jobs in the current namespace. Also, each pod within a job is given another unique 5 character code appended to the job name.
View the logs of a pod from the job you ran kubectl -n <project-namespace> logs jobtest-b92qg-lh64s
- again update with you run's pod and job five letter code.
This will output something like:
Run \"nbody -benchmark [-numbodies=<numBodies>]\" to measure performance.\n -fullscreen (run n-body simulation in fullscreen mode)\n -fp64 (use double precision floating point values for simulation)\n -hostmem (stores simulation data in host memory)\n -benchmark (run benchmark to measure performance)\n -numbodies=<N> (number of bodies (>= 1) to run in simulation)\n -device=<d> (where d=0,1,2.... for the CUDA device to use)\n -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)\n -compare (compares simulation results running once on the default GPU and once on the CPU)\n -cpu (run n-body simulation on the CPU)\n -tipsy=<file.bin> (load a tipsy model file for simulation)\n\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n\n> Fullscreen mode\n> Simulation data stored in video memory\n> Double precision floating point simulation\n> 1 Devices used for simulation\nGPU Device 0: \"Ampere\" with compute capability 8.0\n\n> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]\nnumber of bodies = 512000\n512000 bodies, total time for 10 iterations: 10570.778 ms\n= 247.989 billion interactions per second\n= 7439.679 double-precision GFLOP/s at 30 flops per interaction\n
Delete your job with kubectl -n <project-namespace> delete job jobtest-b92qg
- this will delete the associated pods as well.
If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]
.
The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of the type of GPU present on the node.
The GPU resource requests can be made more specific by adding the type of GPU product the pod template is requesting to the node selector:
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'
nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
The nodeSelector:
key at the bottom of the pod template states the pod should be ran on a node with a 1g.5gb MIG GPU.
Exact GPU product names only
K8s will fail to assign the pod if you misspell the GPU type.
Be especially careful when requesting a full 80Gb or 40Gb A100 GPU as attempting to load GPUs with more data than its memory can handle can have unexpected consequences.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n
"},{"location":"services/gpuservice/training/L1_getting_started/#running-multiple-pods-with-k8s-jobs","title":"Running multiple pods with K8s jobs","text":"Wrapping a pod within a job provides additional functionality on top of accessing the queuing system.
Firstly, the restartPolicy within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod.
Jobs also allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate.
See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 3\n parallelism: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n
"},{"location":"services/gpuservice/training/L1_getting_started/#change-the-default-kubectl-namespace-in-the-project-kubeconfig-file","title":"Change the default kubectl namespace in the project kubeconfig file","text":"Passing the -n <project-namespace>
flag every time you want to interact with the cluster can be cumbersome.
You can alter the kubeconfig on your VM to send commands to your project namespace by default.
Only users with sudo privileges can change the root kubectl config file.
Open the command line on your EIDF VM with access to the EIDF GPU Service.
Open the root kubeconfig file with sudo privileges.
sudo nano /kubernetes/config\n
Add the namespace line with your project's kubernetes namespace to the \"eidf-general-prod\" context entry in your copy of the config file.
*** MORE CONFIG ***\n\ncontexts:\n- name: \"eidf-general-prod\"\n context:\n user: \"eidf-general-prod\"\n namespace: \"<project-namespace>\" # INSERT LINE\n cluster: \"eidf-general-prod\"\n\n*** MORE CONFIG ***\n
Check kubectl connects to the cluster. If this does not work you delete and re-download the kubeconfig file using the button on the project page of the EIDF portal.
kubectl get pods\n
It is recommended that users complete Getting started with Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#overview","title":"Overview","text":"Pods in the K8s EIDF GPU Service are intentionally ephemeral.
They only last as long as required to complete the task that they were created for.
Keeping pods ephemeral ensures the cluster resources are released for other users to request.
However, this means the default storage volumes within a pod are temporary.
If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance.
K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs.
These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#submitting-a-persistent-volume-claim","title":"Submitting a Persistent Volume Claim","text":"Before a persistent volume can be mounted to a pod, the required storage resources need to be requested and reserved to your namespace.
A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources.
The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently.
Example PVCs can be seen on the Kubernetes documentation page.
All PVCs on the EIDF GPU Service must use the csi-rbd-sc
storage class.
kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: test-ceph-pvc\nspec:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 2Gi\n storageClassName: csi-rbd-sc\n
You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml kubectl -n <project-namespace> create <PVC specification yaml>
Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands:
kubectl -n <project-namespace> delete pvc <PVC name>
kubectl -n <project-namespace> get pvc <PVC name>
kubectl -n <project-namespace> apply -f <PVC specification yaml>
Introducing a persistent volume to a pod requires the addition of a volumeMount option to the container and a volume option linking to the PVC in the pod specification yaml.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-pod-specification-yaml-with-mounted-persistent-volume","title":"Example pod specification yaml with mounted persistent volume","text":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: test-ceph-pvc-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: test-ceph-pvc-pod\n spec:\n containers:\n - name: cudasample\n image: busybox\n args: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n restartPolicy: Never\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: test-ceph-pvc\n
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#accessing-the-persistent-volume-outside-a-pod","title":"Accessing the persistent volume outside a pod","text":"To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command.
*** On Login Node - replacing pod name with your pod name ***\nkubectl -n <project-namespace> cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd\n
For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted.
The bash command rsync can be amended to manage file transfers into the mounted PV following this GitHub repo.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#clean-up","title":"Clean up","text":"kubectl -n <project-namespace> delete job test-ceph-pvc-job\n\nkubectl -n <project-namespace> delete pvc test-ceph-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/","title":"Running a PyTorch task","text":""},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#requirements","title":"Requirements","text":"It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#overview","title":"Overview","text":"In the following lesson, we'll build a CNN neural network and train it using the EIDF GPU Service.
The model was taken from the PyTorch Tutorials.
The lesson will be split into three parts:
Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).
kubectl -n <project-namespace> create -f <pvc-spec-yaml>\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-persistentvolumeclaim","title":"Example PyTorch PersistentVolumeClaim","text":"kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: pytorch-pvc\nspec:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 2Gi\n storageClassName: csi-rbd-sc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#transfer-codedata-to-persistent-volume","title":"Transfer code/data to persistent volume","text":"Check PVC has been created
kubectl -n <project-namespace> get pvc <pv-name>\n
Create a lightweight job with pod with PV mounted (example job below)
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml\n
Download the PyTorch code
wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py\n
Copy the Python script into the PV
kubectl -n <project-namespace> cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/\n
Check whether the files were transferred successfully
kubectl -n <project-namespace> exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd\n
Delete the lightweight job
kubectl -n <project-namespace> delete job lightweight-job-<identifier>\n
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: lightweight-pod\n spec:\n containers:\n - name: data-loader\n image: busybox\n args: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: '1Gi'\n limits:\n cpu: 1\n memory: '1Gi'\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n restartPolicy: Never\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#creating-a-job-with-a-pytorch-container","title":"Creating a Job with a PyTorch container","text":"We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.
The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.
Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.
kubectl -n <project-namespace> create -f <pytorch-job-yaml>\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-job-specification-file","title":"Example PyTorch Job Specification File","text":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: pytorch-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: pytorch-pod\n spec:\n restartPolicy: Never\n containers:\n - name: pytorch-con\n image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n command: [\"python3\"]\n args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n resources:\n requests:\n cpu: 2\n memory: \"1Gi\"\n limits:\n cpu: 4\n memory: \"4Gi\"\n nvidia.com/gpu: 1\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#reviewing-the-results-of-the-pytorch-model","title":"Reviewing the results of the PyTorch model","text":"This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.
Check that the model ran to completion
kubectl -n <project-namespace> logs <pytorch-pod-name>\n
Spin up a lightweight pod to retrieve results
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml\n
Copy the trained model back to your access VM
kubectl -n <project-namespace> cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth\n
A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.
A Kubernetes job can create and manage multiple pods with identical or different initial parameters.
NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.
Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: pytorch-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 3\n template:\n metadata:\n name: pytorch-pod\n spec:\n restartPolicy: Never\n containers:\n - name: pytorch-con\n image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n command: [\"python3\"]\n args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n resources:\n requests:\n cpu: 2\n memory: \"1Gi\"\n limits:\n cpu: 4\n memory: \"4Gi\"\n nvidia.com/gpu: 1\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#clean-up","title":"Clean up","text":"kubectl -n <project-namespace> delete pod pytorch-job\n\nkubectl -n <project-namespace> delete pvc pytorch-pvc\n
"},{"location":"services/gpuservice/training/L4_template_workflow/","title":"Template workflow","text":""},{"location":"services/gpuservice/training/L4_template_workflow/#requirements","title":"Requirements","text":"It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L4_template_workflow/#overview","title":"Overview","text":"An example workflow for code development using K8s is outlined below.
In theory, users can create docker images with all the code, software and data included to complete their analysis.
In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.
Therefore, it is recommended to separate code, software, and data preparation into distinct steps:
Data Loading: Loading large data sets asynchronously.
Developing a Docker environment: Manually or automatically building Docker images.
Code development with K8s: Iteratively changing and testing code in a job.
The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.
The three stages are interchangeable and may not be relevant to every project.
Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).
"},{"location":"services/gpuservice/training/L4_template_workflow/#data-loading","title":"Data loading","text":"The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.
Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.
Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.
It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.
Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.
"},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-with-a-lightweight-job","title":"Asynchronous data downloading with a lightweight job","text":"Check a PVC has been created.
kubectl -n <project-namespace> get pvc template-workflow-pvc\n
Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n metadata:\n name: lightweight-job\n spec:\n restartPolicy: Never\n containers:\n - name: data-loader\n image: alpine/curl:latest\n command: ['sh', '-c', \"cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip\"]\n resources:\n requests:\n cpu: 1\n memory: \"1Gi\"\n limits:\n cpu: 1\n memory: \"1Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Run the data download job.
kubectl -n <project-namespace> create -f lightweight-pod.yaml\n
Check if the download has completed.
kubectl -n <project-namespace> get jobs\n
Delete the lightweight job once completed.
kubectl -n <project-namespace> delete job lightweight-job\n
Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.
Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.
This allows you to start a task, such as downloading a data set, and check in on it asynchronously.
Once you have started a screen session, you can create a new window with ctrl-a c
, swap between windows with ctrl-a 0-9
and exit screen (but keep any task running) with ctrl-a d
.
Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.
Start a screen session.
screen\n
Create an interactive lightweight job session.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n metadata:\n name: lightweight-pod\n spec:\n restartPolicy: Never\n containers:\n - name: data-loader\n image: alpine/curl:latest\n command: ['sleep','infinity']\n resources:\n requests:\n cpu: 1\n memory: \"1Gi\"\n limits:\n cpu: 1\n memory: \"1Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Download data set. Change the curl URL to your data set of interest.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip\n
Exit the remote session by either ending the session or ctrl-a d
.
Reconnect at a later time and reattach the screen window.
screen -list\n\nscreen -r <session-name>\n
Check the download was successful and delete the job.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/\n\nkubectl -n <project-namespace> delete job lightweight-job\n
Exit the screen session.
exit\n
Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.
It does not provide functionality to build images and create pods from docker files.
However, use cases may require some custom modifications of a base image, such as adding a python library.
These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.
This is not an introduction to building docker images, please see the Docker tutorial for a general overview.
"},{"location":"services/gpuservice/training/L4_template_workflow/#manually-building-a-docker-image-locally","title":"Manually building a Docker image locally","text":"Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.
Create a Dockerfile to add any additional packages required to the base image.
FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10\nRUN pip install pandas\nRUN pip install plotly\n
Build the Docker container locally (You will need to install Docker)
cd <dockerfile-folder>\n\ndocker build . -t <docker-hub-username>/template-docker-image:latest\n
Building images for different CPU architectures
Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.
If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64
flag to the build function.
Create a repository to hold the image on Docker Hub (You will need to create and setup an account).
Push the Docker image to the repository.
docker push <docker-hub-username>/template-docker-image:latest\n
Finally, specify your Docker image in the image:
tag of the job specification yaml file.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n
In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.
A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.
This process requires you to already have a GitHub and Docker Hub account.
Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.
Create two GitHub secrets to securely provide your Docker Hub username and access token.
Add the dockerfile to a code/docker folder within an active GitHub repo.
Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.
name: ci\non:\n push:\n paths:\n - 'code/docker/**'\n\njobs:\n docker:\n runs-on: ubuntu-latest\n steps:\n -\n name: Set up QEMU\n uses: docker/setup-qemu-action@v3\n -\n name: Set up Docker Buildx\n uses: docker/setup-buildx-action@v3\n -\n name: Login to Docker Hub\n uses: docker/login-action@v3\n with:\n username: ${{ secrets.DOCKERHUB_USERNAME }}\n password: ${{ secrets.DOCKERHUB_TOKEN }}\n -\n name: Build and push\n uses: docker/build-push-action@v5\n with:\n context: \"{{defaultContext}}:code/docker\"\n push: true\n tags: <target-dockerhub-image-name>\n
Push a change to the dockerfile and check the Docker Hub image is updated.
Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.
However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.
If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).
A pod yaml file can be defined to automatically pull the latest code version before running any tests.
Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create
command.
You must already have a GitHub account to follow this process.
This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).
A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.
"},{"location":"services/gpuservice/training/L4_template_workflow/#create-a-job-that-downloads-and-runs-the-latest-code-version-at-runtime","title":"Create a job that downloads and runs the latest code version at runtime","text":"Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Add an initial container that runs before the main container to download the latest version of the code.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n - mountPath: /code\n name: github-code\n initContainers:\n - name: lightweight-git-container\n image: cicirello/alpine-plus-plus\n command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /code\n name: github-code\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n - name: github-code\n emptyDir:\n sizeLimit: 1Gi\n
Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the initContainers: command:
tag.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: ['sh', '-c', \"python3 /code/<python-script>\"]\n resources:\n requests:\n cpu: 10\n memory: \"40Gi\"\n limits:\n cpu: 10\n memory: \"80Gi\"\n nvidia.com/gpu: 1\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n - mountPath: /code\n name: github-code\n initContainers:\n - name: lightweight-git-container\n image: cicirello/alpine-plus-plus\n command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /code\n name: github-code\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n - name: github-code\n emptyDir:\n sizeLimit: 1Gi\n
Submit the yaml file to kubernetes
kubectl -n <project-namespace> create -f <job-yaml-file>\n
EIDF hosts a Graphcore Bow Pod64 system for AI acceleration.
The specification of the Bow Pod64 is:
For more details about the IPU architecture, see documentation from Graphcore.
The smallest unit of compute resource that can be requested is a single IPU.
Similarly to the EIDF GPU Service, usage of the Graphcore is managed using Kubernetes.
"},{"location":"services/graphcore/#service-access","title":"Service Access","text":"Access to the Graphcore accelerator is provisioning through the EIDF GPU Service.
Users should apply for access to Graphcore via the EIDF GPU Service.
"},{"location":"services/graphcore/#project-quotas","title":"Project Quotas","text":"Currently there is no active quota mechanism on the Graphcore accelerator. IPUJobs should be actively using partitions on the Graphcore.
"},{"location":"services/graphcore/#graphcore-tutorial","title":"Graphcore Tutorial","text":"The following tutorial teaches users how to submit tasks to the Graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the GPU service tutorial. For more in-depth lessons about developing applications for Graphcore, see the general documentation and guide for creating IPU jobs via Kubernetes.
Lesson Objective Getting started with IPU jobs a. How to send an IPUJob.b. Monitoring and Cancelling your IPUJob. Multi-IPU Jobs a. Using multiple IPUs for distributed training. Profiling with PopVision a. Enabling profiling in your code.b. Downloading the profile reports. Other Frameworks a. Using Tensorflow and PopART.b. Writing IPU programs with PopLibs (C++)."},{"location":"services/graphcore/#further-reading-and-help","title":"Further Reading and Help","text":"The Graphcore documentation provides information about using the Graphcore system.
The Graphcore examples repository on GitHub provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks.
IPUJobs
manages the launcher and worker pods
, therefore the pods will be deleted when the IPUJob
is deleted, using kubectl delete ipujobs <IPUJob-name>
. If only the pod
is deleted via kubectl delete pod
, the IPUJob
may respawn the pod
.
To see running or terminated IPUJobs
, run kubectl get ipujobs
.
'poptorch_cpp_error': Failed to acquire X IPU(s)
. Why?","text":"This error may appear when the IPUJob name is too long.
We have identified that for IPUJobs with metadata:name
length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters.
This guide assumes basic familiarity with Kubernetes (K8s) and usage of kubectl
. See GPU service tutorial to get started.
Graphcore provides prebuilt docker containers (full lists here) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs.
In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs.
"},{"location":"services/graphcore/training/L1_getting_started/#creating-your-first-ipu-job","title":"Creating your first IPU job","text":"For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (graphcore/pytorch:3.3.0
) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided here. More applications can be found in the repository https://github.com/graphcore/examples.
To get started:
mnist-training-ipujob.yaml
, then copy and save the following content into the file:apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: mnist-training-\nspec:\n # jobInstances defines the number of job instances.\n # More than 1 job instance is usually useful for inference jobs only.\n jobInstances: 1\n # ipusPerJobInstance refers to the number of IPUs required per job instance.\n # A separate IPU partition of this size will be created by the IPU Operator\n # for each job instance.\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: mnist-training\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/pytorch/mnist;\n python -m pip install -r requirements.txt;\n python mnist_poptorch_code_only.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
to submit the job - run kubectl create -f mnist-training-ipujob.yaml
, which will give the following output:
ipujob.graphcore.ai/mnist-training-<random string> created\n
to monitor progress of the job - run kubectl get pods
, which will give the following output
NAME READY STATUS RESTARTS AGE\nmnist-training-<random string>-worker-0 0/1 Completed 0 2m56s\n
to read the result - run kubectl logs mnist-training-<random string>-worker-0
, which will give the following output (or similar)
...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [00:23<00:00]\nEpochs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:34<00:00, 34.18s/it]\n...\nAccuracy on test set: 97.08%\n
"},{"location":"services/graphcore/training/L1_getting_started/#monitoring-and-cancelling-your-ipu-job","title":"Monitoring and Cancelling your IPU job","text":"An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete IPUjobs
, run kubectl get ipujobs
, which will show:
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE\nmnist-training Completed 0 1 All instances done 10m\n
To delete the IPUjob
, run kubectl delete ipujobs <job-name>
, e.g. kubectl delete ipujobs mnist-training-<random string>
. This will also delete the associated worker pod mnist-training-<random string>-worker-0
.
Note: simply deleting the pod via kubectl delete pods mnist-training-<random-string>-worker-0
does not delete the IPU job, which will need to be deleted separately.
Note: you can list all pods via kubectl get all
or kubectl get pods
, but they do not show the ipujobs. These can be obtained using kubectl get ipujobs
.
Note: kubectl describe <pod-name>
provides verbose description of a specific pod.
The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named IPUJob
, which can be seen at the beginning of the included yaml file:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\n
An IPUJob
allows users to defineworkloads that can use IPUs. There are several fields specific to an IPUJob
:
job instances : This defines the number of jobs. In the case of training it should be 1.
ipusPerJobInstance : This defines the size of IPU partition that will be created for each job instance.
workers : This defines a Pod specification that will be used for Worker
Pods, including the container image and commands.
These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the next lesson.
"},{"location":"services/graphcore/training/L1_getting_started/#additional-information","title":"Additional Information","text":"It is possible to further specify the restart policy (Always
/OnFailure
/Never
/ExitCode
) and clean up policy (Workers
/All
/None
); see here.
In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on https://github.com/graphcore/examples. These may require distributed training on multiple IPUs.
The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64.
"},{"location":"services/graphcore/training/L2_multiple_IPU/#first-example","title":"First example","text":"As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from https://github.com/graphcore/examples/tree/master/nlp/bert/pytorch.
To get started, save and create an IPUJob with the following .yaml
file:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: bert-training-multi-ipu-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"4\"\n workers:\n template:\n spec:\n containers:\n - name: bert-training-multi-ipu\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/examples.git;\n cd examples/nlp/bert/pytorch;\n apt update ;\n apt upgrade -y;\n DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n pip3 install -r requirements.txt ;\n python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running the above IPUJob and querying the log via kubectl logs pod/bert-training-multi-ipu-<random string>-worker-0
should give:
...\nData loaded in 8.559805537108332 secs\n-----------------------------------------------------------\n-------------------- Device Allocation --------------------\nEmbedding --> IPU 0\nEncoder 0 --> IPU 1\nEncoder 1 --> IPU 1\nEncoder 2 --> IPU 1\nEncoder 3 --> IPU 1\nEncoder 4 --> IPU 2\nEncoder 5 --> IPU 2\nEncoder 6 --> IPU 2\nEncoder 7 --> IPU 2\nEncoder 8 --> IPU 3\nEncoder 9 --> IPU 3\nEncoder 10 --> IPU 3\nEncoder 11 --> IPU 3\nPooler --> IPU 0\nClassifier --> IPU 0\n-----------------------------------------------------------\n---------- Compilation/Loading from Cache Started ---------\n\n...\n\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [08:02<00:00]\nCompiled/Loaded model in 500.756152929971 secs\n-----------------------------------------------------------\n--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %: 0%| | 0/1 [00:16<?, ?it/s, throughput: 4035.0 samples/sec]\n-----------------------------------------------------------\n-------------------- Training Metrics ---------------------\nglobal_batch_size: 65536\ndevice_iterations: 1\ntraining_steps: 1\nTraining time: 16.245 secs\n-----------------------------------------------------------\n
"},{"location":"services/graphcore/training/L2_multiple_IPU/#details","title":"Details","text":"In this example, we have requested 4 IPUs:
ipusPerJobInstance: \"4\"\n
The python flag --config pretrain_base_128_pod4
uses one of the preset configurations for this model with 4 IPUs. Here we also use the --datset generated
flag to generate data rather than download the required dataset.
To provided sufficient shm for the IPU pod, it may be necessary to mount /dev/shm
as follows:
volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
It is also required to set spec.hostIPC
to true
:
hostIPC: true\n
and add a securityContext
to the container definition than enables the IPC_LOCK
capability:
securityContext:\n capabilities:\n add:\n - IPC_LOCK\n
Note: IPC_LOCK
allows for the RDMA software stack to use pinned memory \u2014 which is particularly useful for PyTorch dataloaders, which can be very memory hungry. This is since all data going to the IPUs go via the network interfaces (via 100Gbps ethernet).
In general, the graph compilation phase of running large models can require significant memory, and far less during the execution phase.
In the example above, it is possible to explicitly request the memory via:
resources:\n limits:\n memory: \"128Gi\"\n requests:\n memory: \"128Gi\"\n
which will succeed. (The graph compilation fails if only 32Gi
is requested.)
As a general guideline, 128GB memory should be enough for the majority of tasks, and rarely exceed 200GB even for jobs with high IPU count. In the example .yaml
script, we do not specifically request the memory.
In the example above, python is launched directly in the pod. When scaling up the number of IPUs (e.g. above 8 IPUs), it may be possible to run into a CPU bottleneck. This may be observed when the throughput scales sub-linearly with the number of data-parallel replicas (i.e. when doubling the IPU count, the performance does not double). This can also be verified by profiling the application and observing a significant proportion of runtime spent on host CPU workload.
In this case, Poprun can be used launch multiple instances. As an example, we will save the following .yaml configuratoin and run:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: bert-poprun-64ipus-\nspec:\n jobInstances: 1\n modelReplicasPerWorker: \"16\"\n ipusPerJobInstance: \"64\"\n workers:\n template:\n spec:\n containers:\n - name: bert-poprun-64ipus\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/examples.git;\n cd examples/nlp/bert/pytorch;\n apt update ;\n apt upgrade -y;\n DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n pip3 install -r requirements.txt ;\n OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 OMPI_ALLOW_RUN_AS_ROOT=1 \\\n poprun \\\n --allow-run-as-root 1 \\\n --vv \\\n --num-instances 1 \\\n --num-replicas 16 \\\n --mpi-global-args=\"--tag-output\" \\\n --ipus-per-replica 4 \\\n python3 run_pretraining.py \\\n --config pretrain_large_128_POD64 \\\n --dataset generated --training-steps 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Inspecting the log via kubectl logs <pod-name>
should produce:
...\n ===========================================================================================\n| poprun topology |\n|===========================================================================================|\n10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE\n10:10:50.154 1 POPRUN [D] Target options from environment: {}\n| hosts | localhost |\n|-----------|-------------------------------------------------------------------------------|\n| ILDs | 0 |\n|-----------|-------------------------------------------------------------------------------|\n| instances | 0 |\n|-----------|-------------------------------------------------------------------------------|\n| replicas | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |\n -------------------------------------------------------------------------------------------\n10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.154 1 POPRUN [D] Using target options: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting\n10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None\n10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output'\n'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x'\n'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x'\n'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x'\n'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0'\n'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS=\n{\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\n\"instanceSize\":\"64\"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1'\n10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346)\n[1,0]<stderr>: Registered metric hook: total_compiling_time with object: <function get_results_for_compile_time at 0x7fe0a6e8af70>\n[1,0]<stderr>:Using config: pretrain_large_128_POD64\n...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [10:11<00:00][1,0]<stderr>:\n[1,0]<stderr>:Compiled/Loaded model in 683.6591004971415 secs\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?it/s, throughput: 17692.1 samples/sec][1,0]<stderr>:\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:-------------------- Training Metrics ---------------------\n[1,0]<stderr>:global_batch_size: 65536\n[1,0]<stderr>:device_iterations: 1\n[1,0]<stderr>:training_steps: 1\n[1,0]<stderr>:Training time: 3.718 secs\n[1,0]<stderr>:-----------------------------------------------------------\n
"},{"location":"services/graphcore/training/L2_multiple_IPU/#notes-on-using-the-examples-respository","title":"Notes on using the examples respository","text":"Graphcore provides examples of a variety of models on Github https://github.com/graphcore/examples. When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel.
"},{"location":"services/graphcore/training/L3_profiling/","title":"Profiling with PopVision","text":"Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see Profiling and Debugging and PopVision Graph Analyser User Guide.
We will reuse the same PyTorch MNIST example from lesson 1 (from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist).
To enable profiling and create IPU reports, we need to add the following line to the training script mnist_poptorch_code_only.py
:
training_opts = training_opts.enableProfiling()\n
(for details the API, see API reference)
Save and run kubectl create -f <yaml-file>
on the following:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: mnist-training-profiling-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: mnist-training-profiling\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/pytorch/mnist;\n python -m pip install -r requirements.txt;\n sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py;\n python mnist_poptorch_code_only.py --epochs 1;\n echo 'RUNNING ls ./training';\n ls training\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
After completion, using kubectl logs <pod-name>
, we can see the following result
...\nAccuracy on test set: 96.69%\nRUNNING ls ./training\narchive.a\nprofile.pop\n
We can see that the training has created two Poplar report files: archive.a
which is an archive of the ELF executable files, one for each tile; and profile.pop
, the poplar profile, which contains compile-time and execution information about the Poplar graph.
To download the traing profiles to your local environment, you can use kubectl cp
. For example, run
kubectl cp <pod-name>:/root/build/examples/tutorials/simple_applications/pytorch/mnist/training .\n
Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here https://www.graphcore.ai/developer/popvision-tools.
From the Graph Analyser, you can analyse information including memory usage, execution trace and more.
"},{"location":"services/graphcore/training/L4_other_frameworks/","title":"Other Frameworks","text":"In this tutorial we'll briefly cover running tensorflow and PopART for Machine Learning, and writing IPU programs directly via the PopLibs library in C++. Extra links and resources will be provided for more in-depth information.
"},{"location":"services/graphcore/training/L4_other_frameworks/#terminology","title":"Terminology","text":"Within Graphcore, Poplar
refers to the tools (e.g. Poplar Graph Engine
or Poplar Graph Compiler
) and libraries (PopLibs
) for programming on IPUs.
The Poplar SDK
is a package of software development tools, including
For more details see here.
"},{"location":"services/graphcore/training/L4_other_frameworks/#other-ml-frameworks-tensorflow-and-popart","title":"Other ML frameworks: Tensorflow and PopART","text":"Besides being able to run PyTorch code, as demonstrated in the previous lessons, the Poplar SDK also supports running ML learning applications with tensorflow or PopART.
"},{"location":"services/graphcore/training/L4_other_frameworks/#tensorflow","title":"Tensorflow","text":"The Poplar SDK includes implementation of TensorFlow and Keras for the IPU.
For more information, refer to Targeting the IPU from TensorFlow 2 and TensorFlow 2 Quick Start.
These are available from the image graphcore/tensorflow:2
.
For a quick example, we will run an example script from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/tensorflow2/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: tensorflow-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: tensorflow-example\n image: graphcore/tensorflow:2\n command: [/bin/bash, -c, --]\n args:\n - |\n apt update;\n apt upgrade -y;\n apt install git -y;\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/tensorflow2/mnist;\n python -m pip install -r requirements.txt;\n python mnist_code_only.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\n2023-10-25 13:21:40.263823: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.2.0 (1513789a51) Poplar package: b82480c629\n2023-10-25 13:21:42.203515: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0\nDownloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n11493376/11490434 [==============================] - 0s 0us/step\n11501568/11490434 [==============================] - 0s 0us/step\n2023-10-25 13:21:43.789573: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)\n2023-10-25 13:21:44.164207: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n2023-10-25 13:21:57.935339: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\nEpoch 1/4\n2000/2000 [==============================] - 17s 8ms/step - loss: 0.6188\nEpoch 2/4\n2000/2000 [==============================] - 1s 427us/step - loss: 0.3330\nEpoch 3/4\n2000/2000 [==============================] - 1s 371us/step - loss: 0.2857\nEpoch 4/4\n2000/2000 [==============================] - 1s 439us/step - loss: 0.2568\n
"},{"location":"services/graphcore/training/L4_other_frameworks/#popart","title":"PopART","text":"The Poplar Advanced Run Time (PopART) enables importing and constructing ONNX graphs, and running graphs in inference, evaluation or training modes. PopART provides both a C++ and Python API.
For more information, see the PopART User Guide
PopART is available from the image graphcore/popart
.
For a quick example, we will run an example script from https://github.com/graphcore/tutorials/tree/sdk-release-3.1/simple_applications/popart/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: popart-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: popart-example\n image: graphcore/popart:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/tutorials.git;\n cd tutorials;\n git checkout sdk-release-3.1;\n cd simple_applications/popart/mnist;\n python3 -m pip install -r requirements.txt;\n ./get_data.sh;\n python3 popart_mnist.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\nCreating ONNX model.\nCompiling the training graph.\nCompiling the validation graph.\nRunning training loop.\nEpoch #1\n Loss=16.2605\n Accuracy=88.88%\n
"},{"location":"services/graphcore/training/L4_other_frameworks/#writing-ipu-programs-directly-with-poplibs","title":"Writing IPU programs directly with PopLibs","text":"The Poplar libraries are a set of C++ libraries consisting of the Poplar graph library and the open-source PopLibs libraries.
The Poplar graph library provides direct access to the IPU by code written in C++. You can write complete programs using Poplar, or use it to write functions to be called from your application written in a higher-level framework such as TensorFlow.
The PopLibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, element-wise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU.
For more information, see Poplar Quick Start and Poplar and PopLibs User Guide.
These are available from the image graphcore/poplar
.
When using the PopLibs libraries, you will have to include the include files in the include/popops
directory, e.g.
#include <include/popops/ElementWise.hpp>\n
and to link the relevant PopLibs libraries, in addition to the Poplar library, e.g.
g++ -std=c++11 my-program.cpp -lpoplar -lpopops\n
For a quick example, we will run an example from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/poplar/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: poplib-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: poplib-example\n image: graphcore/poplar:3.3.0\n command: [\"bash\"]\n args: [\"-c\", \"cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/poplar/mnist/ && ./get_data.sh && make && ./regression-demo -IPU 1 50\"]\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\nUsing the IPU\nTrying to attach to IPU\nAttached to IPU 0\nTarget:\n Number of IPUs: 1\n Tiles per IPU: 1,472\n Total Tiles: 1,472\n Memory Per-Tile: 624.0 kB\n Total Memory: 897.0 MB\n Clock Speed (approx): 1,850.0 MHz\n Number of Replicas: 1\n IPUs per Replica: 1\n Tiles per Replica: 1,472\n Memory per Replica: 897.0 MB\n\nGraph:\n Number of vertices: 5,466\n Number of edges: 16,256\n Number of variables: 41,059\n Number of compute sets: 20\n\n...\n\nEpoch 1 (99%), accuracy 76%\n
"},{"location":"services/jhub/","title":"EIDF Notebook Service","text":"The EIDF Notebook Service is a scalable Jupyterhub deployment in the EIDF Data Science Cloud.
The Notebook Service is open to all EIDF users and offers a selection of data science environments and user interfaces, including Jupyter notebooks, Jupyter Lab and RStudio.
Follow Quickstart to start using the EIDF Notebook Service.
"},{"location":"services/jhub/quickstart/","title":"Quickstart","text":""},{"location":"services/jhub/quickstart/#accessing","title":"Accessing","text":"Access the EIDF Notebooks in your browser by opening https://notebook.eidf.ac.uk/. You must be a member of an active EIDF project and have a user account to use the EIDF Notebook Service.
Click on \"Sign In with SAFE\". You will be redirected to the SAFE login page.
Log into the SAFE if you're not logged in already. If you have more than one account you will be presented with the form \"Approve Token\" and a choice of user accounts for the Notebook Service. This account is the user in your notebooks and you can share data with your DSC VMs within the same project.
Select the account you would like to use from the dropdown \"User Account\" at the end of the form. Then press \"Accept\" to return to the EIDF Notebook Service where you can select a server environment.
Select the environment that you would like to use for your notebooks and press \"Start\". Now your notebook container will be launched. This may take a little while.
"},{"location":"services/jhub/quickstart/#first-notebook","title":"First Notebook","text":"You will be presented with the JupyterLab dashboard view when the container has started.
The availability of launchers depends on the environment that you selected.
For example launch a Python 3 notebook or an R notebook from the dashboard. You can also launch a terminal session.
"},{"location":"services/jhub/quickstart/#python-packages","title":"Python packages","text":"Note that Python packages are installed into the system space of your container by default. However this means that they are not available after a restart of your notebook container which may happen when your session was idle for a while. We recommend specifying --user
to install packages into your user directory to preserve installations across sessions.
To install python packages in a notebook use the command:
!pip install <package> --user\n
or run the command in a terminal:
pip install <package> --user\n
"},{"location":"services/jhub/quickstart/#data","title":"Data","text":"There is a project space mounted in /project_data
. Only project accounts have permissions to view and write to their project folder in this space. Here you can share data with other notebook users in your project. Data placed in /project_data/shared
is shared with other notebook users outside your project.
You can also share data with DSC VMs in your project. Please contact the helpdesk if you would like to mount this project space to one of your VMs.
"},{"location":"services/jhub/quickstart/#limits","title":"Limits","text":"Note that there are limited amounts of memory and cores available per user. Users do not have sudo permissions in the containers so you cannot install any system packages.
Currently there is no access to GPUs. You can submit jobs to the EIDF GPU Service but you cannot run your notebooks on a GPU.
"},{"location":"services/mft/","title":"MFT","text":""},{"location":"services/mft/quickstart/","title":"Managed File Transfer","text":""},{"location":"services/mft/quickstart/#getting-to-the-mft","title":"Getting to the MFT","text":"The EIDF MFT can be accessed at https://eidf-mft.epcc.ed.ac.uk
"},{"location":"services/mft/quickstart/#how-it-works","title":"How it works","text":"The MFT provides a 'drop zone' for the project. All users in a given project will have access to the same shared transfer area. They will have the ability to upload, download, and delete files from the project's transfer area. This area is linked to a directory within the projects space on the shared backend storage.
Files which are uploaded are owned by the Linux user 'nobody' and the group ID of whatever project the file is being uploaded to. They have the permissions: Owner = rw Group = r Others = r
Once the file is opened on the VM, the user that opened it will become the owner and they can make further changes.
"},{"location":"services/mft/quickstart/#gaining-access-to-the-mft","title":"Gaining access to the MFT","text":"By default a project won't have access to the MFT, this has to be enabled. Currently this can be done by the PI sending a request to the EIDF Helpdesk. Once the project is enabled within the MFT, every user with the project will be able to log into the MFT using their usual EIDF credentials.
Once MFT access has been enabled for a project, PIs can give a project user access to the MFT. A new 'eidf-mft' machine option will be available for each user within the portal, which the PI can select to grant the user access to the MFT.
"},{"location":"services/mft/using-the-mft/","title":"Using the MFT Web Portal","text":""},{"location":"services/mft/using-the-mft/#logging-in-to-the-web-browser","title":"Logging in to the web browser","text":"When you reach the MFT home page you can log in using your usual VM project credentials.
You will then be asked what type of session you would like to start. Select New Web Client or Web Client and continue.
"},{"location":"services/mft/using-the-mft/#file-ingress","title":"File Ingress","text":"Once logged in, all files currently in the projects transfer directory will be displayed. Click the 'Upload' button under the 'Home' title to open the dialogue for file upload. You can then drag and drop files in, or click 'Browse' to find them locally.
Once uploaded, the file will be immediately accessible from the project area, and can be used within any EIDF service which has the filesystem mounted.
"},{"location":"services/mft/using-the-mft/#file-egress","title":"File Egress","text":"File egress can be done in the reverse way. By placing the file into the project transfer directory, it will become available in the MFT portal.
"},{"location":"services/mft/using-the-mft/#file-management","title":"File Management","text":"Directories can be created within the project transfer directory, for example with 'Import' and 'Export' to allow for better file management. Files deleted from either the MFT portal or from the VM itself will remove it from the other, as both locations point at the same file. It's only stored in one place, so modifications made from either place will remove the file.
"},{"location":"services/mft/using-the-mft/#sftp","title":"SFTP","text":"Once a project and user have access to the MFT, they can connect to it using SFTP as well as through the web browser.
This can be done by logging into the MFT URL with the user's project account:
```bash
sftp [EIDF username]@eidf-mft.epcc.ed.ac.uk\n
```
"},{"location":"services/mft/using-the-mft/#scp","title":"SCP","text":"Files can be scripted to be upload to the MFT using SCP.
To copy a file to the project MFT area using SCP:
scp /path/to/file [EIDF username]@eidf-mft.epcc.ed.ac.uk:/\n
"},{"location":"services/s3/","title":"Overview","text":"The EIDF S3 Service is an object store with an interface that is compatible with a subset of the Amazon S3 RESTful API.
"},{"location":"services/s3/#service-access","title":"Service Access","text":"Users should have an EIDF account as described in EIDF Accounts.
Project leads can request an object store allocation through a request to the EIDF helpdesk.
"},{"location":"services/s3/#access-keys","title":"Access keys","text":"Select your project at https://portal.eidf.ac.uk/project/. Your access keys are displayed in the table at the top of the page.
For each account, the quota and the number of buckets that it is permitted to create is shown, as well as the access keys. Click on \"Secret\" to view the access secret. You will need the access key, the corresponding access secret and the endpoint https://s3.eidf.ac.uk
to connect to the EIDF S3 Service with an S3 client.
Access management: Project management guide to managing accounts and access permissions for your S3 allocation.
Tutorial: Examples using EIDF S3
"},{"location":"services/s3/manage/","title":"Manage EIDF S3 access","text":"Access keys and accounts for the object store are managed by project managers via the EIDF Portal.
"},{"location":"services/s3/manage/#request-an-allocation","title":"Request an allocation","text":"An object store allocation for a project may be requested by contacting the EIDF helpdesk.
"},{"location":"services/s3/manage/#object-store-accounts","title":"Object store accounts","text":"Select your project at https://portal.eidf.ac.uk/project/ and jump to \"S3 Allocation\" on the project page to manage access keys and accounts.
S3 buckets and objects are owned by an account. Each account has a quota for storage and the number of buckets that it can create. The sum of all account quotas is limited by the total storage quota of the project object store allocation shown at the top.
An account with the minimum storage quota (1B) and zero buckets is effectively read only as it may not create new buckets and so cannot upload files.
To create an account:
_
only)You will not be allowed to create an account with the quota greater than the available storage quota of the project.
It may take a little while for the account to become available. Refresh the project page to update the list of accounts.
"},{"location":"services/s3/manage/#access-keys","title":"Access keys","text":"To use S3 (listing or creating buckets, listing objects or uploading and downloading files) you need an access key and a secret. An account can own any number of access keys. These keys share the account's quota and have access to the same buckets.
To create an access key:
It can take a little while for the access keys to become available. Refresh the project page to update the list of keys.
"},{"location":"services/s3/manage/#access-key-permissions","title":"Access key permissions","text":"You can control which project members are allowed to view an access key and secret in the EIDF Portal or the SAFE. Project managers and the PI have access to all S3 accounts and can view associated access keys and secrets in the project management view.
To grant view permissions for an access key to a project member:
It can take a little while for the permissions update to complete.
Note
Anyone who knows an access key and secret will be able to perform the associated activities via the S3 API regardless of the view permissions.
"},{"location":"services/s3/manage/#delete-an-access-key","title":"Delete an access key","text":"Click on the \"Bin\" icon next to a key and press \"Delete\" on the form.
"},{"location":"services/s3/tutorial/","title":"Tutorial","text":"Buckets owned by an EIDF project are placed in a tenancy in the EIDF S3 Service. The project code is a prefix on the bucket name, separated by a colon (:
), for example eidfXX1:somebucket
. Note that some S3 client libraries do not accept bucket names in this format.
The following examples use the AWS Command Line Interface (AWS CLI) to connect to EIDF S3.
"},{"location":"services/s3/tutorial/#setup","title":"Setup","text":"Install with pip
python -m pip install awscli\n
Installers are available for various platforms if you are not using Python: see https://aws.amazon.com/cli/
"},{"location":"services/s3/tutorial/#configure","title":"Configure","text":"Set your access key and secret as environment variables or configure a credentials file at ~/.aws/credentials
on Linux or %USERPROFILE%\\.aws\\credentials
on Windows.
Credentials file:
[default]\naws_access_key_id=<key>\naws_secret_access_key=<secret>\n
Environment variables:
export AWS_ACCESS_KEY_ID=<key>\nexport AWS_SECRET_ACCESS_KEY=<secret>\n
The parameter --endpoint-url https://s3.eidf.ac.uk
must always be set when calling a command.
List the buckets in your account:
aws s3 ls --endpoint-url https://s3.eidf.ac.uk\n
Create a bucket:
aws s3api create-bucket --bucket <bucketname> --endpoint-url https://s3.eidf.ac.uk\n
Upload a file:
aws s3 cp <filename> s3://<bucketname> --endpoint-url https://s3.eidf.ac.uk\n
Check that the file above was uploaded successfully by listing objects in the bucket:
aws s3 ls s3://<bucketname> --endpoint-url https://s3.eidf.ac.uk\n
To read from a public bucket without providing credentials, add the option --no-sign-request
to the call:
aws s3 ls s3://<bucketname> --no-sign-request --endpoint-url https://s3.eidf.ac.uk\n
"},{"location":"services/s3/tutorial/#python-using-boto3","title":"Python using boto3
","text":"The following examples use the Python library boto3
.
Installation:
python -m pip install boto3\n
"},{"location":"services/s3/tutorial/#usage","title":"Usage","text":"By default, the boto3
Python library raises an error that bucket names with a colon :
(as used by the EIDF S3 Service) are invalid, so we have to switch off the bucket name validation:
import boto3\nfrom botocore.handlers import validate_bucket_name\n\ns3 = boto3.resource('s3', endpoint_url='https://s3.eidf.ac.uk')\ns3.meta.client.meta.events.unregister('before-parameter-build.s3', validate_bucket_name)\n
List buckets:
for bucket in s3.buckets.all():\n print(f'{bucket.name}')\n
List objects in a bucket:
project_code = 'eidfXXX'\nbucketname = 'somebucket'\nbucket = s3.Bucket(f'{project_code}:{bucket_name}')\nfor obj in bucket.objects.all():\n print(f'{obj.key}')\n
Upload a file to a bucket:
bucket = s3.Bucket(f'{project_code}:{bucket_name}')\nbucket.upload_file('./somedata.csv', 'somedata.csv')\n
"},{"location":"services/s3/tutorial/#access-policies","title":"Access policies","text":"Bucket permissions use IAM policies. You can grant other accounts (within the same project or from other projects) read or write access to your buckets. For example to grant permissions to put, get, delete and list objects in bucket eidfXX1:somebucket
to the account account2
in project eidfXX2
:
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Sid\": \"AllowAccessToBucket\",\n \"Principal\": {\n \"AWS\": [\n \"arn:aws:iam::eidfXX2:user/account2\",\n ]\n },\n \"Effect\": \"Allow\",\n \"Action\": [\n \"s3:PutObject\",\n \"s3:GetObject\",\n \"s3:ListBucket\",\n \"s3:DeleteObject\",\n ],\n \"Resource\": [\n \"arn:aws:s3:::/*\",\n \"arn:aws:s3::eidfXX1:somebucket\"\n ]\n }\n ]\n}\n
You can chain multiple policies in the statement array:
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Principal\": { ... }\n \"Effect\": \"Allow\",\n \"Action\": [ ... ],\n \"Resource\": [ ... ]\n },\n {\n \"Principal\": { ... }\n \"Effect\": \"Allow\",\n \"Action\": [ ... ],\n \"Resource\": [ ... ]\n }\n ]\n}\n
Give public read access to a bucket (listing and downloading files):
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Principal\": \"*\",\n \"Action\": [\"s3:ListBucket\"],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:somebucket\"\n ]\n },\n {\n \"Effect\": \"Allow\",\n \"Principal\": \"*\",\n \"Action\": [\"s3:GetObject\"],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:somebucket/*\"\n ]\n }\n ]\n}\n
"},{"location":"services/s3/tutorial/#set-policy-using-aws-cli","title":"Set policy using AWS CLI","text":"Grant permissions stored in an IAM policy file:
aws put-bucket-policy --bucket <bucketname> --policy \"$(cat bucket-policy.json)\"\n
"},{"location":"services/s3/tutorial/#set-policy-using-python-boto3","title":"Set policy using Python boto3
","text":"Grant permissions to another account: In this example we grant ListBucket
and GetObject
permissions to account account1
in project eidfXX1
and account2
in project eidfXX2
.
import json\n\nbucket_policy = {\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Principal\": {\n \"AWS\": [\n \"arn:aws:iam::eidfXX1:user/account1\",\n \"arn:aws:iam::eidfXX2:user/account2\",\n ]\n },\n \"Action\": [\n \"s3:ListBucket\",\n \"s3:GetObject\"\n ],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:{bucket_name}\"\n f\"arn:aws:s3::eidfXX1:{bucket_name}/*\"\n ]\n }\n ]\n}\n\npolicy = bucket.Policy()\npolicy.put(Policy=json.dumps(bucket_policy))\n
"},{"location":"services/ultra2/","title":"Ultra2 Large Memory System","text":"Overview
Connect
Running jobs
"},{"location":"services/ultra2/access/","title":"Overview","text":"Ultra2 is a single logical CPU system based at EPCC. It is suitable for running jobs which require large volumes of non-distributed memory (as opposed to a cluster).
"},{"location":"services/ultra2/access/#specifications","title":"Specifications","text":"The system is a HPE SuperDome Flex containing 576 individual cores in a SMT-1 arrangement (1 thread per core). The system has 18TB of memory available to users. Home directories are network mounted from the EIDF e1000 Lustre filesystem, although some local NVMe storage is available for temporary file storage during runs.
"},{"location":"services/ultra2/access/#getting-access","title":"Getting Access","text":"Access to Ultra2 is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.
"},{"location":"services/ultra2/connect/","title":"Login","text":"The hostname for SSH access to the system is ultra2.eidf.ac.uk
To access Ultra2, you need to use two credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP).
"},{"location":"services/ultra2/connect/#ssh-key","title":"SSH Key","text":"You must upload the public part of your SSH key pair to the SAFE by following the instructions from the SAFE documentation
"},{"location":"services/ultra2/connect/#time-based-one-time-password-totp","title":"Time-based one-time password (TOTP)","text":"You must set up your TOTP token by following the instructions from the SAFE documentation
"},{"location":"services/ultra2/connect/#ssh-login-example","title":"SSH Login example","text":"To login to Ultra2, you will need to use the SSH Key and TOTP token as noted above. With the appropriate key loadedssh <username>@ultra2.eidf.ac.uk
will then prompt you, roughly once per day, for your TOTP code.
The primary HPC software provided is Intel's OneAPI suite containing mpi compilers and runtimes, debuggers and the vTune performance analyser. Standard GNU compilers are also available. The OneAPI suite can be loaded by sourcing the shell script:
source /opt/intel/oneapi/setvars.sh\n
"},{"location":"services/ultra2/run/#queue-system","title":"Queue system","text":"All jobs must be run via SLURM to avoid inconveniencing other users of the system. Users should not run jobs directly. Note that the system has one logical processor with a large number of threads and thus appears to SLURM as a single node. This is intentional.
"},{"location":"services/ultra2/run/#queue-limits","title":"Queue limits","text":"We kindly request that users limit their maximum total running job size to 288 cores and 4TB of memory, whether that be a divided into a single job, or a number of jobs. This may be enforced via SLURM in the future.
"},{"location":"services/ultra2/run/#example-mpi-job","title":"Example MPI job","text":"An example script to run a multi-process MPI \"Hello world\" example is shown.
#!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=4\n#SBATCH --nodelist=sdf-cs1\n#SBATCH --partition=standard\n##SBATCH --exclusive\n\n\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# Source oneAPI to ensure mpirun available\nif [[ -z \"${SETVARS_COMPLETED}\" ]]; then\nsource /opt/intel/oneapi/setvars.sh\nfi\n\n# mpirun invocation for Intel suite.\nmpirun -n ${mpi_threads} ./helloworld.exe\n
"},{"location":"services/virtualmachines/","title":"Overview","text":"The EIDF Virtual Machine (VM) Service is the underlying infrastructure upon which the EIDF Data Science Cloud (DSC) is built.
The service currenly has a mixture of hardware node types which host VMs of various flavours:
The shapes and sizes of the flavours are based on subdivisions of this hardware, noting that CPUs are 4x oversubscribed for mcomp nodes (general VM flavours).
"},{"location":"services/virtualmachines/#service-access","title":"Service Access","text":"Users should have an EIDF account - EIDF Accounts.
Project Leads will be able to have access to the DSC added to their project during the project application process or through a request to the EIDF helpdesk.
"},{"location":"services/virtualmachines/#additional-service-policy-information","title":"Additional Service Policy Information","text":"Additional information on service policies can be found here.
"},{"location":"services/virtualmachines/docs/","title":"Service Documentation","text":""},{"location":"services/virtualmachines/docs/#project-management-guide","title":"Project Management Guide","text":""},{"location":"services/virtualmachines/docs/#required-member-permissions","title":"Required Member Permissions","text":"VMs and user accounts can only be managed by project members with Cloud Admin permissions. This includes the principal investigator (PI) of the project and all project managers (PM). Through SAFE the PI can designate project managers and the PI and PMs can grant a project member the Cloud Admin role:
For details please refer to the SAFE documentation: How can I designate a user as a project manager?
"},{"location":"services/virtualmachines/docs/#create-a-vm","title":"Create a VM","text":"To create a new VM:
eidfxxx
Complete the 'Create Machine' form as follows:
dev-01
. The project code will be prepended automatically to your VM name, in this case your VM would be named eidfxxx-dev-01
.Click on 'Create'
You may wish to ensure that the machine size selected (number of CPUs and RAM) does not exceed your remaining quota before you press Create, otherwise the request will fail.
In the list of 'Machines' in the project page in the portal, click on the name of new VM to see the configuration and properties, including the machine specification, its 10.24.*.*
IP address and any configured VDI connections.
Each project has a quota for the number of instances, total number of vCPUs, total RAM and storage. You will not be able to create a VM if it exceeds the quota.
You can view and refresh the project usage compared to the quota in a table near the bottom of the project page. This table will be updated automatically when VMs are created or removed, and you can refresh it manually by pressing the \"Refresh\" button at the top of the table.
Please contact the helpdesk if your quota requirements have changed.
"},{"location":"services/virtualmachines/docs/#add-a-user-account","title":"Add a user account","text":"User accounts allow project members to log in to the VMs in a project. The Project PI and project managers manage user accounts for each member of the project. Users usually use one account (username and password) to log in to all the VMs in the same project that they can access, however a user may have multiple accounts in a project, for example for different roles.
Complete the 'Create User Account' form as follows:
The user can now set the password for their new account on the account details page.
"},{"location":"services/virtualmachines/docs/#adding-access-to-the-vm-for-a-user","title":"Adding Access to the VM for a User","text":"User accounts can be granted or denied access to existing VMs.
If a user is logged in already to the VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi newly added connections may not appear in their connections list immediately. They must log out and log in again to refresh the connection information, or wait until the login token expires and is refreshed automatically - this might take a while.
If a user only has one connection available in the VDI they will be automatically directed to the VM with the default connection.
"},{"location":"services/virtualmachines/docs/#sudo-permissions","title":"Sudo permissions","text":"A project manager or PI may also grant sudo permissions to users on selected VMs. Management of sudo permissions must be requested in the project application - if it was not requested or the request was denied the functionality described below is not available.
After a few minutes, the job to give the user account sudo permissions on the selected VMs will complete. On the account detail page a \"sudo\" badge will appear next to the selected VMs.
Please contact the helpdesk if sudo permission management is required but is not available in your project.
"},{"location":"services/virtualmachines/docs/#first-login","title":"First login","text":"A new user account must reset the password before they can log in for the first time. To do this:
If you did not select RDP access when you created the VM you can add it later:
Once the RDP job is completed, all users that are allowed to access the VM will also be permitted to use the RDP connection.
"},{"location":"services/virtualmachines/docs/#software-catalogue","title":"Software catalogue","text":"You can install packages from the software catalogue at a later time, even if you didn't select a package when first creating the machine.
It is the responsibility of project PIs to keep the VMs in their projects up to date as stated in the policy.
"},{"location":"services/virtualmachines/docs/#ubuntu","title":"Ubuntu","text":"To patch and update packages on Ubuntu run the following commands (requires sudo permissions):
sudo apt update\nsudo apt upgrade\n
Your system might require a restart after installing updates.
"},{"location":"services/virtualmachines/docs/#rocky","title":"Rocky","text":"To patch and update packages on Rocky run the following command (requires sudo permissions):
sudo dnf update\n
Your system might require a restart after installing updates.
"},{"location":"services/virtualmachines/docs/#reboot","title":"Reboot","text":"When logged in you can reboot a VM with this command (requires sudo permissions):
sudo reboot now\n
or use the reboot button in the EIDF Portal (requires project manager permissions).
"},{"location":"services/virtualmachines/flavours/","title":"Flavours","text":"These are the current Virtual Machine (VM) flavours (configurations) available on the the Virtual Desktop cloud service. Note that all VMs are built and configured using the EIDF Portal by PIs/Cloud Admins of projects, except GPU flavours which must be requested via the helpdesk or the support request form.
Flavour Name vCPUs DRAM in GB Pinned Cores GPU general.v2.tiny 1 2 No No general.v2.small 2 4 No No general.v2.medium 4 8 No No general.v2.large 8 16 No No general.v2.xlarge 16 32 No No capability.v2.8cpu 8 112 Yes No capability.v2.16cpu 16 224 Yes No capability.v2.32cpu 32 448 Yes No capability.v2.48cpu 48 672 Yes No capability.v2.64cpu 64 896 Yes No gpu.v1.8cpu 8 128 Yes Yes gpu.v1.16cpu 16 256 Yes Yes gpu.v1.32cpu 32 512 Yes Yes gpu.v1.48cpu 48 768 Yes Yes"},{"location":"services/virtualmachines/policies/","title":"EIDF Data Science Cloud Policies","text":""},{"location":"services/virtualmachines/policies/#end-of-life-policy-for-user-accounts-and-projects","title":"End of Life Policy for User Accounts and Projects","text":""},{"location":"services/virtualmachines/policies/#what-happens-when-an-account-or-project-is-no-longer-required-or-a-user-leaves-a-project","title":"What happens when an account or project is no longer required, or a user leaves a project","text":"These situations are most likely to come about during one of the following scenarios:
For each user account involved, assuming the relevant consent is given, the next step can be summarised as one of the following actions:
It will be possible to have the account re-activated up until resources are removed (as outlined above); after this time it will be necessary to re-apply.
A user's right to use EIDF is granted by a project. Our policy is to treat the account and associated data as the property of the PI as the owner of the project and its resources. It is the user's responsibility to ensure that any data they store on the EIDF DSC is handled appropriately and to copy off anything that they wish to keep to an appropriate location.
A project manager or the PI can revoke a user's access accounts within their project at any time, by locking, removing or re-owning the account as appropriate.
A user may give up access to an account and return it to the control of the project at any time.
When a project is due to end, the PI will receive notification of the closure of the project and its accounts one month before all project accounts and DSC resources (VMs, data volumes) are closed and cleaned or removed.
"},{"location":"services/virtualmachines/policies/#backup-policies","title":"Backup policies","text":"The current policy is:
We strongly advise that you keep copies of any critical data on an alternative system that is fully backed up.
"},{"location":"services/virtualmachines/policies/#patching-of-user-vms","title":"Patching of User VMs","text":"The EIDF team updates and patches the hypervisors and the cloud management software as part of the EIDF Maintenance sessions. It is the responsibility of project PIs to keep the VMs in their projects up to date. VMs running the Ubuntu and Rocky operating systems automatically install security patches and alert users at log-on (via SSH) to reboot as necessary for the changes to take effect. They also encourage users to update packages.
"},{"location":"services/virtualmachines/policies/#customer-run-outward-facing-web-services","title":"Customer-run outward facing web services","text":"PIs can apply to run an outward-facing service; that is a webservice on port 443, running on a project-owned VM. The policy requires the customer to accept the following conditions:
Pis can apply for such a service on application and also at any time by contacing the EIDF Service Desk.
"},{"location":"services/virtualmachines/quickstart/","title":"Quickstart","text":"Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.
Authentication is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"services/virtualmachines/quickstart/#accessing-your-projects","title":"Accessing your projects","text":"Log into the portal at https://portal.eidf.ac.uk/. The login will redirect you to the SAFE.
View the projects that you have access to at https://portal.eidf.ac.uk/project/
Navigate to https://portal.eidf.ac.uk/project/ and click the link to \"Request access\", or choose \"Request Access\" in the \"Project\" menu.
Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".
Now you have to wait for your PI or project manager to accept your request to join.
"},{"location":"services/virtualmachines/quickstart/#accessing-a-vm","title":"Accessing a VM","text":"Select a project and view your user accounts on the project page.
Click on an account name to view details of the VMs that are you allowed to access with this account, and to change the password for this account.
Before you log in for the first time with a new user account, you must change your password as described below.
Follow the link to the Guacamole login or log in directly at https://eidf-vdi.epcc.ed.ac.uk/vdi/. Please see the VDI guide for more information.
You can also log in via the EIDF Gateway Jump Host if this is available in your project.
Warning
You must set a password for a new account before you log in for the first time.
"},{"location":"services/virtualmachines/quickstart/#set-or-change-the-password-for-a-user-account","title":"Set or change the password for a user account","text":"Follow these instructions to set a password for a new account before you log in for the first time. If you have forgotten your password you may reset the password as described here.
Select a project and click the account name in the project page to view the account details.
In the user account detail page, press the button \"Set Password\" and follow the instructions in the form.
There may be a short delay while the change is implemented before the new password becomes usable.
"},{"location":"services/virtualmachines/quickstart/#further-information","title":"Further information","text":"Managing VMs: Project management guide to creating, configuring and removing VMs and managing user accounts in the portal.
Virtual Desktop Interface: Working with the VDI interface.
EIDF Gateway: SSH access to VMs via the EIDF SSH Gateway jump host.
"},{"location":"status/","title":"EIDF Service Status","text":"The table below represents the broad status of each EIDF service.
Service Status EIDF Portal VM SSH Gateway VM VDI Gateway Virtual Desktops Cerebras CS-2 Ultra2"},{"location":"status/#maintenance-sessions","title":"Maintenance Sessions","text":"There will be a service outage on the 3rd Thursday of every month from 9am to 5pm. We keep maintenance downtime to a minimum on the service but do occasionally need to perform essential work on the system. Maintenance sessions are used to ensure that:
The service will be returned to service ahead of 5pm if all the work is completed early.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"EIDF User Documentation","text":"The Edinburgh International Data Facility (EIDF) is built and operated by EPCC at the University of Edinburgh. EIDF is a place to store, find and work with data of all kinds. You can find more information on the service and the research it supports on the EIDF website.
For more information or for support with our services, please email eidf@epcc.ed.ac.uk
in the first instance.
This documentation gives more in-depth coverage of current EIDF services. It is aimed primarily at developers or power users.
"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"The source for this documentation is publicly available in the EIDF documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or additions to the content and/or addition of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
This documentation set is a work in progress.
"},{"location":"#credits","title":"Credits","text":"This documentation draws on the ARCHER2 National Supercomputing Service documentation.
"},{"location":"access/","title":"Accessing EIDF","text":"Some EIDF services are accessed via a Web browser and some by \"traditional\" command-line ssh
.
All EIDF services use the EPCC SAFE service management back end, to ensure compatibility with other EPCC high-performance computing services.
"},{"location":"access/#web-access-to-virtual-machines","title":"Web Access to Virtual Machines","text":"The Virtual Desktop VM service is browser-based, providing a virtual desktop interface (Apache Guacamole) for \"desktop-in-a-browser\" access. Applications to use the VM service are made through the EIDF Portal.
EIDF Portal: how to ask to join an existing EIDF project and how to apply for a new project
VDI access to virtual machines: how to connect to the virtual desktop interface.
"},{"location":"access/#ssh-access-to-virtual-machines","title":"SSH Access to Virtual Machines","text":"Users with the appropriate permissions can also use ssh
to login to Virtual Desktop VMs
Includes access to the following services:
To login to most command-line services with ssh
you should use the username and password you obtained from SAFE when you applied for access, along with the SSH Key you registered when creating the account. You can then login to the host following the appropriately linked instructions above.
Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.
The EIDF Portal uses EPCC's SAFE service management software to manage user accounts across all EPCC services. To log in to the Portal you will first be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"access/project/#how-to-request-to-join-a-project","title":"How to request to join a project","text":"Log in to the EIDF Portal and navigate to \"Projects\" and choose \"Request access\". Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".
Now you have to wait for your PI or project manager to accept your request to register.
"},{"location":"access/project/#how-to-apply-for-a-project-as-a-principal-investigator","title":"How to apply for a project as a Principal Investigator","text":""},{"location":"access/project/#create-a-new-project-application","title":"Create a new project application","text":"Navigate to the EIDF Portal and log in via SAFE if necessary (see above).
Once you have logged in click on \"Applications\" in the menu and choose \"New Application\".
Once the application has been created you see an overview of the form you are required to fill in. You can revisit the application at any time by clicking on \"Applications\" and choosing \"Your applications\" to display all your current and past applications and their status, or follow the link https://portal.eidf.ac.uk/proposal/.
"},{"location":"access/project/#populate-a-project-application","title":"Populate a project application","text":"Fill in each section of the application as required:
You can edit and save each section separately and revisit the application at a later time.
"},{"location":"access/project/#datasets","title":"Datasets","text":"You are required to fill in a \"Dataset\" form for each dataset that you are planning to store and process as part of your project.
We are required to ensure that projects involving \"sensitive\" data have the necessary permissions in place. The answers to these questions will enable us to decide what additional documentation we may need, and whether your project may need to be set up in an independently governed Safe Haven. There may be some projects we are simply unable to host for data protection reasons.
"},{"location":"access/project/#resource-requirements","title":"Resource Requirements","text":"Add an estimate for each size and type of VM that is required.
"},{"location":"access/project/#submission","title":"Submission","text":"When you are happy with your application, click \"Submit\". If there are missing fields that are required these are highlighted and your submission will fail.
When your submission was successful the application status is marked as \"Submitted\" and now you have to wait while the EIDF approval team considers your application. You may be contacted if there are any questions regarding your application or further information is required, and you will be notified of the outcome of your application.
"},{"location":"access/project/#approved-project","title":"Approved Project","text":"If your application was approved, refer to Data Science Virtual Desktops: Quickstart how to view your project and to Data Science Virtual Desktops: Managing VMs how to manage a project and how to create virtual machines and user accounts.
"},{"location":"access/ssh/","title":"SSH Access to Virtual Machines using the EIDF-Gateway Jump Host","text":"The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. You will require three things to use the gateway:
Steps to meet all of these requirements are explained below.
"},{"location":"access/ssh/#generating-and-adding-an-ssh-key","title":"Generating and Adding an SSH Key","text":"In order to make use of the EIDF-Gateway, your EIDF account needs an SSH-Key associated with it. If you added one while creating your EIDF account, you can skip this step.
"},{"location":"access/ssh/#check-for-an-existing-ssh-key","title":"Check for an existing SSH Key","text":"To check if you have an SSH Key associated with your account:
If there is an entry under 'Credentials', then you're all setup. If not, you'll need to generate an SSH-Key, to do this:
"},{"location":"access/ssh/#generate-a-new-ssh-key","title":"Generate a new SSH Key","text":"Generate a new SSH Key:
ssh-keygen\n
It is fine to accept the default name and path for the key unless you manage a number of keys.
This should not be necessary for most users, so only follow this process if you have an issue or have been told to by the EPCC Helpdesk. If you need to add an SSH Key directly to SAFE, you can follow this guide. However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide.
"},{"location":"access/ssh/#enabling-mfa-via-the-portal","title":"Enabling MFA via the Portal","text":"A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
To enable this for your EIDF account:
Note
TOTP is only required for the SSH Gateway, not to the VMs themselves, and not through the VDI. An MFA token will have to be set for each account you'd like to use to access the EIDF SSH Gateway.
"},{"location":"access/ssh/#using-the-ssh-key-and-totp-code-to-access-eidf-windows-and-linux","title":"Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux","text":"From your local terminal, import the SSH Key you generated above: ssh-add /path/to/ssh-key
This should return \"Identity added [Path to SSH Key]\" if successful. You can then follow the steps below to access your VM.
OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the commands below:
ssh-add /path/to/ssh-key\nssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]\n
For example:
ssh-add ~/.ssh/keys/id_ed25519\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n
Info
If the ssh-add
command fails saying the SSH Agent is not running, run the below command:
eval `ssh-agent`
Then re-run the ssh-add command above.
The -J
flag is use to specify that we will access the second specified host by jumping through the first specified host.
You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.
"},{"location":"access/ssh/#accessing-from-windows","title":"Accessing from Windows","text":"Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won\u2019t be covered in this tutorial.
"},{"location":"access/ssh/#installing-and-using-openssh","title":"Installing and using OpenSSH","text":"Import the SSH Key you generated above:
ssh-add \\path\\to\\sshkey\n\nFor Example:\nssh-add .\\.ssh\\id_ed25519\n
This should return \"Identity added [Path to SSH Key]\" if successful. If it doesn't, run the following in Powershell:
Get-Service -Name ssh-agent | Set-Service -StartupType Manual\nStart-Service ssh-agent\nssh-add \\path\\to\\sshkey\n
Login by jumping through the gateway.
ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip]\n\nFor Example:\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n
You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.
"},{"location":"access/ssh/#ssh-aliases","title":"SSH Aliases","text":"You can use SSH Aliases to access your VMs with a single command.
Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. Using the text editor of your choice (vi used as an example), edit the .ssh/config file:
vi ~/.ssh/config\n
Insert the following lines:
Host eidf-gateway\n Hostname eidf-gateway.epcc.ed.ac.uk\n User <eidf project username>\n IdentityFile /path/to/ssh/key\n
For example:
Host eidf-gateway\n Hostname eidf-gateway.epcc.ed.ac.uk\n User alice\n IdentityFile ~/.ssh/id_ed25519\n
Save and quit the file.
Now you can ssh to your VM using the below command:
ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key\n
For Example:
ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519\n
You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM:
Host <vm name/alias>\n HostName 10.24.VM.IP\n User <vm username>\n IdentityFile /path/to/ssh/key\n ProxyCommand ssh eidf-gateway -W %h:%p\n
For Example:
Host demo\n HostName 10.24.1.1\n User alice\n IdentityFile ~/.ssh/id_ed25519\n ProxyCommand ssh eidf-gateway -W %h:%p\n
Now, by running ssh demo
your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: HostKeyAlgorithms +ssh-rsa
Info
This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. This method allows for a much less complicated ssh command to reach your VMs. You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like. The -J
flag is use to specify that we will access the second specified host by jumping through the first specified host.
You do not have to set a password to log into virtual machines. However, if you have been given sudo permission, you will need to set a password to be able to make use of sudo. You can set (or reset) a password using the web form in the EIDF Portal following the instructions in Set or change the password for a user account.
"},{"location":"access/virtualmachines-vdi/","title":"Virtual Machines (VMs) and the EIDF Virtual Desktop Interface (VDI)","text":"Using the EIDF VDI, members of EIDF projects can connect to VMs that they have been granted access to. The EIDF VDI is a web portal that displays the connections to VMs a user has available to them, and then those connections can be easily initiated by clicking on them in the user interface. Once connected to the target VM, all interactions are mediated through the user's web browser by the EIDF VDI.
"},{"location":"access/virtualmachines-vdi/#login-to-the-eidf-vdi","title":"Login to the EIDF VDI","text":"Once your membership request to join the appropriate EIDF project has been approved, you will be able to login to the EIDF VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi.
Authentication to the VDI is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"access/virtualmachines-vdi/#navigating-the-eidf-vdi","title":"Navigating the EIDF VDI","text":"After you have been authenticated through SAFE and logged into the EIDF VDI, if you have multiple connections available to you that have been associated with your user (typically in the case of research projects), you will be presented with the VDI home screen as shown below:
VDI home page with list of available VM connections
Adding connections
Note that if a project manager has added a new connection for you it may not appear in the list of connections immediately. You must log out and log in again to refresh your connections list.
"},{"location":"access/virtualmachines-vdi/#connecting-to-a-vm","title":"Connecting to a VM","text":"If you have only one connection associated with your VDI user account (typically in the case of workshops), you will be automatically connected to the target VM's virtual desktop. Once you are connected to the VM, you will be asked for your username and password as shown below (if you are participating in a workshop, then you may not be asked for credentials)
Warning
If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.
VM virtual desktop connection user account login screen
Once your credentials have been accepted, you will be connected to your VM's desktop environment. For instance, the screenshot below shows a resulting connection to a Xubuntu 20.04 VM with the Xfce desktop environment.
VM virtual desktop
"},{"location":"access/virtualmachines-vdi/#vdi-features-for-the-virtual-desktop","title":"VDI Features for the Virtual Desktop","text":"The EIDF VDI is an instance of the Apache Guacamole clientless remote desktop gateway. Since the connection to your VM virtual desktop is entirely managed through Guacamole in the web browser, there are some additional features to be aware of that may assist you when using the VDI.
"},{"location":"access/virtualmachines-vdi/#the-vdi-menu","title":"The VDI Menu","text":"The Guacamole menu is a sidebar which is hidden until explicitly shown. On a desktop or other device which has a hardware keyboard, you can show this menu by pressing <Ctrl> + <Alt> + <Shift> on a Windows PC client, or <Ctrl> + <Command> + <Shift> on a Mac client. To hide the menu, you press the same key combination once again. The menu provides various options, including:
After you have activated the Guacamole menu using the key combination above, at the top of the menu is a text area labeled \u201cclipboard\u201d along with some basic instructions:
Text copied/cut within Guacamole will appear here. Changes to the text below will affect the remote clipboard.
The text area functions as an interface between the remote clipboard and the local clipboard. Text from the local clipboard can be pasted into the text area, causing that text to be sent to the clipboard of the remote desktop. Similarly, if you copy or cut text within the remote desktop, you will see that text within the text area, and can manually copy it into the local clipboard if desired.
You can use the standard keyboard shortcuts to copy text from your client PC or Mac to the Guacamole menu clipboard, then again copy that text from the Guacamole menu clipboard into an application or CLI terminal on the VM's remote desktop. An example of using the copy and paste clipboard is shown in the screenshot below.
The EIDF VDI Clipboard
"},{"location":"access/virtualmachines-vdi/#keyboard-language-and-layout-settings","title":"Keyboard Language and Layout Settings","text":"For users who do not have standard English (UK)
keyboard layouts, key presses can have unexpected translations as they are transmitted to your VM. Please contact the EIDF helpdesk at eidf@epcc.ed.ac.uk if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration.
Submit a query in the EIDF Portal by selecting \"Submit a Support Request\" in the \"Help and Support\" menu and filling in this form.
You can also email us at eidf@epcc.ed.ac.uk.
"},{"location":"faq/#how-do-i-request-more-resources-for-my-project-can-i-extend-my-project","title":"How do I request more resources for my project? Can I extend my project?","text":"Submit a support request: In the form select the project that your request relates to and select \"EIDF Project extension: duration and quota\" from the dropdown list of categories. Then enter the new quota or extension date in the description text box below and submit the request.
The EIDF approval team will consider the extension and you will be notified of the outcome.
"},{"location":"faq/#new-vms-and-vdi-connections","title":"New VMs and VDI connections","text":"My project manager gave me access to a VM but the connection doesn't show up in the VDI connections list?
This may happen when a machine/VM was added to your connections list while you were logged in to the VDI. Please refresh the connections list by logging out and logging in again, and the new connections should appear.
"},{"location":"faq/#non-default-ssh-keys","title":"Non-default SSH Keys","text":"I have different SSH keys for the SSH gateway and my VM, or I use a key which does not have the default name (~/.ssh/id_rsa) and I cannot login.
The command syntax shown in our SSH documentation (using the -J <username>@eidf-gateway
stanza) makes assumptions about SSH keys and their naming. You should try the full version of the command:
ssh -o ProxyCommand=\"ssh -i ~/.ssh/<gateway_private_key> -W %h:%p <gateway_username>@eidf-gateway.epcc.ed.ac.uk\" -i ~/.ssh/<vm_private_key> <vm_username>@<vm_ip>\n
Note that for the majority of users, gateway_username and vm_username are the same, as are gateway_private_key and vm_private_key
"},{"location":"faq/#username-policy","title":"Username Policy","text":"I already have an EIDF username for project Y, can I use this for project X?
We mandate that every username must be unique across our estate. EPCC machines including EIDF services such as the SDF and DSC VMs, and HPC services such as Cirrus require you to create a new machine account with a unique username for each project you work on. Usernames cannot be used on multiple projects, even if the previous project has finished. However, some projects span multiple machines so you may be able to login to multiple machines with the same username.
"},{"location":"known-issues/","title":"Known Issues","text":""},{"location":"known-issues/#virtual-desktops","title":"Virtual desktops","text":"No known issues.
"},{"location":"overview/","title":"A Unique Service for Academia and Industry","text":"The Edinburgh International Data Facility (EIDF) is a growing set of data and compute services developed to support the Data Driven Innovation Programme at the University of Edinburgh.
Our goal is to support learners, researchers and innovators across the spectrum, with services from data discovery through simple learn-as-you-play-with-data notebooks to GPU-enabled machine-learning platforms for driving AI application development.
"},{"location":"overview/#eidf-and-the-data-driven-innovation-initiative","title":"EIDF and the Data-Driven Innovation Initiative","text":"Launched at the end of 2018, the Data-Driven Innovation (DDI) programme is one of six funded within the Edinburgh & South-East Scotland City Region Deal. The DDI programme aims to make Edinburgh the \u201cData Capital of Europe\u201d, with ambitious targets to support, enhance and improve talent, research, commercial adoption and entrepreneurship across the region through better use of data.
The programme targets ten industry sectors, with interactions managed through five DDI Hubs: the Bayes Centre, the Usher Institute, Edinburgh Futures Institute, the National Robotarium, and Easter Bush. The activities of these Hubs are underpinned by EIDF.
"},{"location":"overview/acknowledgements/","title":"Acknowledging EIDF","text":"If you make use of EIDF services in your work, we encourage you to acknowledge us in any publications.
Acknowledgement of using the facility in publications can be used as an identifiable metric to evaluate the scientific support provided, and helps promote the impact of the wider DDI Programme.
We encourage our users to ensure that an acknowledgement of EIDF is included in the relevant section of their manuscript. We would suggest:
This work was supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.
"},{"location":"overview/contacts/","title":"Contact","text":"The Edinburgh International Data Facility is located at the Advanced Computing Facility of EPCC, the supercomputing centre based at the University of Edinburgh.
"},{"location":"overview/contacts/#email-us","title":"Email us","text":"Email EIDF: eidf@epcc.ed.ac.uk
"},{"location":"overview/contacts/#sign-up","title":"Sign up","text":"Join our mailing list to receive updates about EIDF.
"},{"location":"safe-haven-services/network-access-controls/","title":"Safe Haven Network Access Controls","text":"The TRE Safe Haven services are protected against open, global access by IPv4 source address filtering. These network access controls ensure that connections are permitted only from Safe Haven controller partner networks and collaborating research institutions.
The network access controls are managed by the Safe Haven service controllers who instruct EPCC to add and remove the IPv4 addresses allowed to connect to the service gateway. Researchers must connect to the Safe Haven service by first connecting to their institution or corporate VPN and then connecting to the Safe Haven.
The Safe Haven IG controller and research project co-ordination teams must submit and confirm IPv4 address filter changes to their service help desk via email.
"},{"location":"safe-haven-services/overview/","title":"Safe Haven Services","text":"The EIDF Trusted Research Environment (TRE) hosts several Safe Haven services that enable researchers to work with sensitive data in a secure environment. These services are operated by EPCC in partnership with Safe Haven controllers who manage the Information Governance (IG) appropriate for the research activities and the data access of their Safe Haven service.
It is the responsibility of EPCC as the Safe Haven operator to design, implement and administer the technical controls required to deliver the Safe Haven security regime demanded by the Safe Haven controller.
The role of the Safe Haven controller is to satisfy the needs of the researchers and the data suppliers. The controller is responsible for guaranteeing the confidentiality needs of the data suppliers and matching these with the availability needs of the researchers.
The service offers secure data sharing and analysis environments allowing researchers access to sensitive data under the terms and conditions prescribed by the data providers. The service prioritises the requirements of the data provider over the demands of the researcher and is an academic TRE operating under the guidance of the Five Safes framework.
The TRE has dedicated, private cloud infrastructure at EPCC's Advanced Computing Facility (ACF) data centre and has its own HPC cluster and high-performance file systems. When a new Safe Haven service is commissioned in the TRE it is created in a new virtual private cloud providing the Safe Haven service controller with an independent IG domain separate from other Safe Havens in the TRE. All TRE service infrastructure and all TRE project data are hosted at ACF.
If you have any questions about the EIDF TRE or about Safe Haven services, please contact us.
"},{"location":"safe-haven-services/safe-haven-access/","title":"Safe Haven Service Access","text":"Safe Haven services are accessed from a registered network connection address using a browser. The service URL will be \"https://shs.epcc.ed.ac.uk/<service>\" where <service> is the Safe Haven service name.
The Safe Haven access process is in three stages from multi-factor authentication to project desktop login.
Researchers who are active in many research projects and in more than one Safe Haven will need to pay attention to the service they connect to, the project desktop they login to, and the accounts and identities they are using.
"},{"location":"safe-haven-services/safe-haven-access/#safe-haven-login","title":"Safe Haven Login","text":"The first step in the process prompts the user for a Safe Haven username and then for a session PIN code sent via SMS text to the mobile number registered for the username.
Valid PIN code entry allows the user access to all of the Safe Haven service remote desktop gateways for up to 24 hours without entry of a new PIN code. A user who has successfully entered a PIN code once can access shs.epcc.ed.ac.uk/haven1 and shs.epcc.ed.ac.uk/haven2 without repeating PIN code identity verification.
When a valid PIN code is accepted, the user is prompted to accept the service use terms and conditions.
Registration of the user mobile phone number is managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/safe-haven-access/#remote-desktop-gateway-login","title":"Remote Desktop Gateway Login","text":"The second step in the access process is for the user to login to the Safe Haven service remote desktop gateway so that a project desktop connection can be chosen. The user is prompted for a Safe Haven service account identity.
VDI Safe Haven Service Login Page
Safe Haven accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/safe-haven-access/#project-desktop-connection","title":"Project Desktop Connection","text":"The third stage in the process is to select the virtual connection from those available on the account's home page. An example home page is shown below offering two connection options to the same virtual machine. Remote desktop connections will have an _rdp suffix and SSH terminal connections have an _ssh suffix. The most recently used connections are shown as screen thumbnails at the top of the page and all the connections available to the user are shown in a tree list below this.
VM connections available home page
The remote desktop gateway software used in the Safe Haven services in the TRE is the Apache Guacamole web application. Users new to this application can find the user manual here. It is recommended that users read this short guide, but note that the data sharing features such as copy and paste, connection sharing, and file transfers are disabled on all connections in the TRE Safe Havens.
A remote desktop or SSH connection is used to access data provided for a specific research project. If a researcher is working on multiple projects within a Safe Haven they can only login to one project at a time. Some connections may allow the user to login to any project and some connections will only allow the user to login into one specific project. This depends on project IG restrictions specified by the Safe Haven and project controllers.
Project desktop accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.
"},{"location":"safe-haven-services/using-the-hpc-cluster/","title":"Using the TRE HPC Cluster","text":""},{"location":"safe-haven-services/using-the-hpc-cluster/#introduction","title":"Introduction","text":"The TRE HPC system, also called the SuperDome Flex, is a single node, large memory HPC system. It is provided for compute and data intensive workloads that require more CPU, memory, and better IO performance than can be provided by the project VMs, which have the performance equivalent of small rack mount servers.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#specifications","title":"Specifications","text":"The system is an HPE SuperDome Flex configured with 1152 hyper-threaded cores (576 physical cores) and 18TB of memory, of which 17TB is available to users. User home and project data directories are on network mounted storage pods running the BeeGFS parallel filesystem. This storage is built in blocks of 768TB per pod. Multiple pods are available in the TRE for use by the HPC system and the total storage available will vary depending on the project configuration.
The HPC system runs Red Hat Enterprise Linux, which is not the same flavour of Linux as the Ubuntu distribution running on the desktop VMs. However, most jobs in the TRE run Python and R, and there are few issues moving between the two version of Linux. Use of virtual environments is strongly encouraged to ensure there are no differences between the desktop and HPC runtimes.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#software-management","title":"Software Management","text":"All system level software installed and configured on the TRE HPC system is managed by the TRE admin team. Software installation requests may be made by the Safe Haven IG controllers, research project co-ordinators, and researchers by submitting change requests through the dedicated service help desk via email.
Minor software changes will be made as soon as admin effort can be allocated. Major changes are likely to be scheduled for the TRE monthly maintenance session on the first Thursday of each month.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#hpc-login","title":"HPC Login","text":"Login to the HPC system is from the project VM using SSH and is not direct from the VDI. The HPC cluster accounts are the same accounts used on the project VMs, with the same username and password. All project data access on the HPC system is private to the project accounts as it is on the VMs, but it is important to understand that the TRE HPC cluster is shared by projects in other TRE Safe Havens.
To login to the HPC cluster from the project VMs use ssh shs-sdf01
from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the SSH key configuration instructions here. SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged.
To use the HPC system fully and fairly, all jobs must be run using the SLURM job manager. More information about SLURM, running batch jobs and running interactive jobs can be found here. Please read this carefully before using the cluster if you have not used SLURM before. The SLURM site also has a set of useful tutorials on HPC clusters and job scheduling.
All analysis and processing jobs must be run via SLURM. SLURM manages access to all the cores on the system beyond the first 32. If SLURM is not used and programs are run directly from the command line, then there are only 32 cores available, and these are shared by the other users. Normal code development, short test runs, and debugging can be done from the command line without using SLURM.
There is only one node
The HPC system is a single node with all cores sharing all the available memory. SLURM jobs should always specify '#SBATCH --nodes=1' to run correctly.
SLURM email alerts for job status change events are not supported in the TRE.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#resource-limits","title":"Resource Limits","text":"There are no resource constraints imposed on the default SLURM partition at present. There are user limits (see the output of ulimit -a
). If a project has a requirement for more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours, a resource reservation request should be made by the researchers through email to the service help desk.
There are no storage quotas enforced in the HPC cluster storage at present. The project storage requirements are negotiated, and space allocated before the project accounts are released. Storage use is monitored, and guidance will be issued before quotas are imposed on projects.
The HPC system is managed in the spirit of utilising it as fully as possible and as fairly as possible. This approach works best when researchers are aware of their project workload demands and cooperate rather than compete for cluster resources.
"},{"location":"safe-haven-services/using-the-hpc-cluster/#python-jobs","title":"Python Jobs","text":"A basic script to run a Python job in a virtual environment is shown below.
#!/bin/bash\n#\n#SBATCH --export=ALL # Job inherits all env vars\n#SBATCH --job-name=my_job_name # Job name\n#SBATCH --mem=512G # Job memory request\n#SBATCH --output=job-%j.out # Standard output file\n#SBATCH --error=job-%j.err # Standard error file\n#SBATCH --nodes=1 # Run on a single node\n#SBATCH --ntasks=1 # Run one task per node\n#SBATCH --time=02:00:00 # Time limit hrs:min:sec\n#SBATCH --partition standard # Run on partition (queue)\n\npwd\nhostname\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\necho \"Running job on a single CPU core\"\n\n# Create the job\u2019s virtual environment\nsource ${HOME}/my_venv/bin/activate\n\n# Run the job code\npython3 ${HOME}/my_job.py\n\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\n
"},{"location":"safe-haven-services/using-the-hpc-cluster/#mpi-jobs","title":"MPI Jobs","text":"An example script for a multi-process MPI example is shown. The system currently supports MPICH MPI.
#!/bin/bash\n#\n#SBATCH --export=ALL\n#SBATCH --job-name=mpi_test\n#SBATCH --output=job-%j.out\n#SBATCH --error=job-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=5\n#SBATCH --time=05:00\n#SBATCH --partition standard\n\necho \"Submitted Open MPI job\"\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# load Open MPI module\nmodule purge\nmodule load mpi/mpich-x86_64\n\n# run mpi program\nmpirun ${HOME}/test_mpi\n
"},{"location":"safe-haven-services/using-the-hpc-cluster/#managing-files-and-data","title":"Managing Files and Data","text":"There are three file systems to manage in the VM and HPC environment.
The /safe_data file system with the project data cannot be used by the HPC system. The /safe_data file system has restricted access and a relatively slow IO performance compared to the parallel BeeGFS file system storage on the HPC system.
The process to use the TRE HPC service is to copy and synchronise the project code and data files on the /safe_data file system with the HPC /home file system before and after login sessions and job runs on the HPC cluster. Assuming all the code and data required for the job is in a directory 'current_wip' on the project VM, the workflow is as follows:
rsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:
ssh shs-sdf01
, cd current_wip
, sbatch/srun my_job
rsync -avPz -e ssh shs-sdf01:current_wip /safe_data/my_project
Sessions on project VMs may be either remote desktop (RDP) logins or SSH terminal logins. Most users will prefer to use the remote desktop connections, but the SSH terminal connection is useful when remote network performance is poor and it must be used for account password changes.
"},{"location":"safe-haven-services/virtual-desktop-connections/#first-time-login-and-account-password-changes","title":"First Time Login and Account Password Changes","text":"Account Password Changes
Note that first time account login cannot be through RDP as a password change is required. Password reset logins must be SSH terminal sessions as password changes can only be made through SSH connections.
"},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-ssh-session","title":"Connecting to a Remote SSH Session","text":"When a VM SSH connection is selected the browser screen becomes a text terminal and the user is prompted to \"Login as: \" with a project account name, and then prompted for the account password. This connection type is equivalent to a standard xterm SSH session.
"},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-desktop-session","title":"Connecting to a Remote Desktop Session","text":"Remote desktop connections work best by first placing the browser in Full Screen mode and leaving it in this mode for the entire duration of the Safe Haven session.
When a VM RDP connection is selected the browser screen becomes a remote desktop presenting the login screen shown below.
VM virtual desktop connection user account login screen
Once the project account credentials have been accepted, a remote dekstop similar to the one shown below is presented. The default VM environment in the TRE is Ubuntu 22.04 with the Xfce desktop.
VM virtual desktop
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/","title":"Accessing the Superdome Flex inside the EPCC Trusted Research Environment","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#what-is-the-superdome-flex","title":"What is the Superdome Flex?","text":"The Superdome Flex (SDF) is a high-performance computing cluster manufactured by Hewlett Packard Enterprise. It has been designed to handle multi-core, high-memory tasks in environments where security is paramount. The hardware specifications of the SDF within the Trusted Research Environment (TRE) are as follows:
The software specification of the SDF are:
The SDF is within the TRE. Therefore, the same restrictions apply, i.e. the SDF is isolated from the internet (no downloading code from public GitHub repos) and copying/recording/extracting anything on the SDF outside of the TRE is strictly prohibited unless through approved processes.
Users can only access the SDF by ssh-ing into it via their VM desktop.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#hello-world","title":"Hello world","text":"**** On the VM desktop terminal ****\n\nssh shs-sdf01\n<Enter VM password>\n\necho \"Hello World\"\n\nexit\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#sdf-vs-vm-file-systems","title":"SDF vs VM file systems","text":"The SDF file system is separate from the VM file system, which is again separate from the project data space. Files need to be transferred between the three systems for any analysis to be completed within the SDF.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-showing-separate-sdf-and-vm-file-systems","title":"Example showing separate SDF and VM file systems","text":"**** On the VM desktop terminal ****\n\ncd ~\ntouch test.txt\nls\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is not here\nexit\n\nscp test.txt shs-sdf01:/home/<USERNAME>/\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is here\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-copying-data-between-project-data-space-and-sdf","title":"Example copying data between project data space and SDF","text":"Transferring and synchronising data sets between the project data space and the SDF is easier with the rsync command (rather than manually checking and copying files/folders with scp). rsync only transfers files that are different between the two targets, more details in its manual.
**** On the VM desktop terminal ****\n\nman rsync # check instructions for using rsync\n\nrsync -avPz -e ssh /safe_data/my_project/ shs-sdf01:/home/<USERNAME>/my_project/ # sync project folder and SDF home folder\n\nssh shs-sdf01\n<Enter VM password>\n\n*** Conduct analysis on SDF ***\n\nexit\n\nrsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:/home/<USERNAME>/my_project/ # sync project file and ssh home page # re-syncronise project folder and SDF home folder\n\n*** Optionally remove the project folder on SDF ***\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/","title":"Running R/Python Scripts","text":"Running analysis scripts on the SDF is slightly different to running scripts on the Desktop VMs. The Linux distribution differs between the two with the SDF using Red Hat Enterprise Linux (RHEL) and the Desktop VMs using Ubuntu. Therefore, it is highly advisable to use virtual environments (e.g. conda environments) to complete any analysis and aid the transition between the two distributions. Conda should run out of the box on the Desktop VMs, but some configuration is required on the SDF.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#setting-up-conda-environments-on-you-first-connection-to-the-sdf","title":"Setting up conda environments on you first connection to the SDF","text":"*** SDF Terminal ***\n\nconda activate base # Test conda environment\n\n# Conda command will not be found. There is no need to install!\n\neval \"$(/opt/anaconda3/bin/conda shell.bash hook)\" # Tells your terminal where conda is\n\nconda init # changes your .bashrc file so conda is automatically available in the future\n\nconda config --set auto_activate_base false # stop conda base from being activated on startup\n\npython # note python version\n\nexit()\n
The base conda environment is now available but note that the python and gcc compilers are not the latest (Python 3.9.7 and gcc 7.5.0).
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#getting-an-up-to-date-python-version","title":"Getting an up-to-date python version","text":"In order to get an up-to-date python version we first need to use an updated gcc version. Fortunately, conda has an updated gcc toolset that can be installed.
*** SDF Terminal ***\n\nconda activate base # If conda isn't already active\n\nconda create -n python-v3.11 gcc_linux-64=11.2.0 python=3.11.3\n\nconda activate python-v3.11\n\npython\n\nexit()\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#running-r-scripts-on-the-sdf","title":"Running R scripts on the SDF","text":"The default version of R available on the SDF is v4.1.2. Alternative R versions can be installed using conda similar to the python conda environment above.
conda create -n r-v4.3 gcc_linux-64=11.2.0 r-base=4.3\n\nconda activate r-v4.3\n\nR\n\nq()\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#final-points","title":"Final points","text":"The SDF, like the rest of the SHS, is separated from the internet. The installation of python/R libraries to your environment is from a local copy of the respective conda/CRAN library repositories. Therefore, not all packages may be available and not all package versions may be available.
It is discouraged to run extensive python/R analyses without submitting them as job requests using Slurm.
Slurm is a workload manager that schedules jobs submitted to a shared resource. Slurm is a well-developed tool that can manage large computing clusters, such as ARCHER2, with thousands of users each with different priorities and allocated computing hours. Inside the TRE, Slurm is used to help ensure all users of the SDF get equitable access. Therefore, users who are submitting jobs with high resource requirements (>80 cores, >1TB of memory) may have to wait longer for resource allocation to enable users with lower resource demands to continue their work.
Slurm is currently set up so all users have equal priority and there is no limit to the total number of CPU hours allocated to a user per month. However, there are limits to the maximum amount of resources that can be allocated to an individual job. Jobs that require more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours will be rejected. If users need to submit jobs with large resource demand, they need to submit a resource reservation request by emailing their project's service desk.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#why-do-you-need-to-use-slurm","title":"Why do you need to use Slurm?","text":"The SDF is a resource shared across all projects within the TRE and all users should have equal opportunity to use the SDF to complete resource-intense tasks appropriate to their projects. Users of the SDF are required to consider the needs of the wider community by:
requesting resources appropriate to their intended task and timeline.
submitting resource requests via Slurm to enable automatic scheduling and fair allocation alongside other user requests.
Users can develop code, complete test runs, and debug from the SDF command line without using Slurm. However, only 32 of the 512 cores are accessible without submitting a job request to Slurm. These cores are accessible to all users simultaneously.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#slurm-basics","title":"Slurm basics","text":"Slurm revolves around four main entities: nodes, partitions, jobs and job steps. Nodes and partitions are relevant for more complex distributed computing clusters so Slurm can allocate appropriate resources to jobs across multiple pieces of hardware. Jobs are requests for resources and job steps are what need to be completed once the resources have been allocated (completed in sequence or parallel). Job steps can be further broken down into tasks.
There are four key commands for Slurm users:
squeue: get details on a job or job step, i.e. has a job been allocated resources or is it still pending?
srun: initiate a job step or execute a job. A versatile function that can initiate job steps as part of a larger batch job or submit a job itself to get resources and run a job step. This is useful for testing job steps, experimenting with different resource allocations or running job steps that require large resources but are relatively easy to define in a line or two (i.e. running a sequence alignment).
scancel: stop a job from continuing.
sbatch: submit a job script containing multiple steps (i.e. srun) to be completed with the defined resources. This is the typical function for submitting jobs to Slurm.
More details on these functions (and several not mentioned here) can be seen on the Slurm website.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-simple-job","title":"Submitting a simple job","text":"*** SDF Terminal ***\n\nsqueue -u $USER # Check if there are jobs already queued or running for you\n\nsrun --job-name=my_first_slurm_job --nodes 1 --ntasks 10 --cpus-per-task 2 echo 'Hello World'\n\nsqueue -u $USER --state=CD # List all completed jobs\n
In this instance, the srun command completes two steps: job submission and job step execution. First, it submits a job request to be allocated 10 CPUs (1 CPU for each of the 10 tasks). Once the resources are available, it executes the job step consisting of 10 tasks each running the 'echo \"Hello World\"' function.
srun accepts a wide variety of options to specify the resources required to complete its job step. Within the SDF, you must always request 1 node (as there is only one node) and never use the --exclusive option (as no one will have exclusive access to this shared resource). Notice that running srun blocks your terminal from accepting any more commands and the output from each task in the job step, i.e. Hello World in the above example, outputs to your terminal. We will compare this to running a sbatch command.\u0011
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-batch-job","title":"Submitting a batch job","text":"Batch jobs are incredibly useful because they run in the background without blocking your terminal. Batch jobs also output the results to a log file rather than straight to your terminal. This allows you to check a job was completed successfully at a later time so you can move on to other things whilst waiting for a job to complete.
A batch job can be submitted to Slurm by passing a job script to the sbatch command. The first few lines of a job script outline the resources to be requested as part of the job. The remainder of a job script consists of one or more srun commands outlining the job steps that need to be completed (in sequence or parallel) once the resources are available. There are numerous options for defining the resource requirements of a job including:
More information on the various options are in the sbatch documentation.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script","title":"Example Job Script","text":"#!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=10\n#SBATCH --cpus-per-task=2\n\n% Run echo task in sequence\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task A. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\n% Run echo task in parallel with the ampersand character\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task A. Time: \" $(date +\u201d%H:%M:%S\u201d) &\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script-submission","title":"Example job script submission","text":"*** SDF Terminal ***\n\nnano example_job_script.sh\n\n*** Copy example job script above ***\n\nsbatch example_job_script.sh\n\nsqueue -u $USER -r 5\n\n*** Wait for the batch job to be completed ***\n\ncat example_job_script.log # The series tasks should be grouped together and the parallel tasks interspersed.\n
The example batch job is intended to show two things: 1) the usefulness of the sbatch command and 2) the versatility of a job script. As the sbatch command allows you to submit scripts and check their outcome at your own discretion, it is the most common way of interacting with Slurm. Meanwhile, the job script command allows you to specify one global resource request and break it up into multiple job steps with different resource demands that can be completed in parallel or in sequence.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-pythonr-code-to-slurm","title":"Submitting python/R code to Slurm","text":"Although submitting job steps containing python/R analysis scripts can be done with srun directly, as below, it is more common to submit bash scripts that call the analysis scripts after setting up the environment (i.e. after calling conda activate).
**** Python code job submission ****\n\nsrun --job-name=my_first_python_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G python3 example_script.py\n\n**** R code job submission ****\n\nsrun --job-name=my_first_r_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G Rscript example_script.R\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#signposting","title":"Signposting","text":"Useful websites for learning more about Slurm:
The Slurm documentation website
The Introduction to HPC carpentries lesson on Slurm
This lesson is adapted from a workshop introducing users to running python scripts on ARCHER2 as developed by Adrian Jackson.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#introduction","title":"Introduction","text":"Python does not have native support for parallelisation. Python contains a Global Interpreter Lock (GIL) which means the python interpreter only allows one thread to execute at a time. The advantage of the GIL is that C libraries can be easily integrated into Python scripts without checking if they are thread-safe. However, this means that most common python modules cannot be easily parallelised. Fortunately, there are now several re-implementations of common python modules that work around the GIL and are therefore parallelisable. Dask is a python module that contains a parallelised version of the pandas data frame as well as a general format for parallelising any python code.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask","title":"Dask","text":"Dask enables thread-safe parallelised python execution by creating task graphs (a graph of the dependencies of the inputs and outputs of each function) and then deducing which ones can be run separately. This lesson introduces some general concepts required for programming using Dask. There are also some exercises with example answers to help you write your first parallelised python scripts.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#arrays-data-frames-and-bags","title":"Arrays, data frames and bags","text":"Dask contains three data objects to enable parallelised analysis of large data sets in a way familiar to most python programmers. If the same operations are being applied to a large data set then Dask can split up the data set and apply the operations in parallel. The three data objects that Dask can easily split up are:
Arrays: Contains large numbers of elements in multiple dimensions, but each element must be of the same type. Each element has a unique index that allows users to specify changes to individual elements.
Data frames: Contains large numbers of elements which are typically highly structured with multiple object types allowed together. Each element has a unique index that allows users to specify changes to individual elements.
Bags: Contains large numbers of elements which are semi/un-structured. Elements are immutable once inside the bag. Bags are useful for conducting initial analysis/wrangling of raw data before more complex analysis is performed.
You may need to install dask or create a new conda environment with it in.
conda create -n dask-env gcc_linux-64=11.2.0 python=3.11.3 dask\n\nconda activate dask-env\n
Try running the following Python using dask:
import dask.array as da\n\nx = da.random.random((10000, 10000), chunks=(1000, 1000))\n\nprint(x)\n\nprint(x.compute())\n\nprint(x.sum())\n\nprint(x.sum().compute())\n
This should demonstrate that dask is both straightforward to implement simple parallelism, but also lazy in that it does not compute anything until you force it to with the .compute() function.
You can also try out dask DataFrames, using the following code:
import dask.dataframe as dd\n\ndf = dd.read_csv('surveys.csv')\n\ndf.head()\ndf.tail()\n\ndf.weight.max().compute()\n
You can try using different blocksizes when reading in the csv file, and then undertaking an operation on the data, as follows: Experiment with varying blocksizes, although you should be aware that making your block size too small is likely to cause poor performance (the blocksize affects the number of bytes read in at each operation).
df = dd.read_csv('surveys.csv', blocksize=\"10000\")\ndf.weight.max().compute()\n
You can also experiment with Dask Bags to see how that functionality works:
import dask.bag as db\nfrom operator import add\nb = db.from_sequence([1, 2, 3, 4, 5], npartitions=2)\nprint(b.compute())\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask-delayed","title":"Dask Delayed","text":"Dask delayed lets you construct your own task graphs/parallelism from Python functions. You can find out more about dask delayed from the dask documentation Try parallelising the code below using the .delayed function or the @delayed decorator, an example answer can be found here.
def inc(x):\n return x + 1\n\ndef double(x):\n return x * 2\n\ndef add(x, y):\n return x + y\n\ndata = [1, 2, 3, 4, 5]\n\noutput = []\nfor x in data:\n a = inc(x)\n b = double(x)\n c = add(a, b)\n output.append(c)\n\ntotal = sum(output)\n\nprint(total)\n
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#mandelbrot-exercise","title":"Mandelbrot Exercise","text":"The code below calculates the members of a Mandelbrot set using Python functions:
import sys\nimport time\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef mandelbrot(h, w, maxit=20, r=2):\n \"\"\"Returns an image of the Mandelbrot fractal of size (h,w).\"\"\"\n start = time.time()\n\n x = np.linspace(-2.5, 1.5, 4*h+1)\n\n y = np.linspace(-1.5, 1.5, 3*w+1)\n\n A, B = np.meshgrid(x, y)\n\n C = A + B*1j\n\n z = np.zeros_like(C)\n\n divtime = maxit + np.zeros(z.shape, dtype=int)\n\n for i in range(maxit):\n z = z**2 + C\n diverge = abs(z) > r # who is diverging\n div_now = diverge & (divtime == maxit) # who is diverging now\n divtime[div_now] = i # note when\n z[diverge] = r # avoid diverging too much\n\n end = time.time()\n\n return divtime, end-start\n\nh = 2000\nw = 2000\n\nmandelbrot_space, time = mandelbrot(h, w)\n\nplt.imshow(mandelbrot_space)\n\nprint(time)\n
Your task is to parallelise this code using Dask Array functionality. Using the base python code above, extend it with Dask Array for the main arrays in the computation. Remember you need to specify a chunk size with Dask Arrays, and you will also need to call compute at some point to force Dask to actually undertake the computation. Note, depending on where you run this you may not see any actual speed up of the computation. You need access to extra resources (compute cores) for the calculation to go faster. If in doubt, submit a python script of your solution to the SDF compute nodes to see if you see speed up there. If you are struggling with this parallelisation exercise, there is a solution available for you here.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#pi-exercise","title":"Pi Exercise","text":"The code below calculates Pi using a function that can split it up into chunks and calculate each chunk separately. Currently it uses a single chunk to produce the final value of Pi, but that can be changed by calling pi_chunk multiple times with different inputs. This is not necessarily the most efficient method for calculating Pi in serial, but it does enable parallelisation of the calculation of Pi using multiple copies of pi_chunk called simultaneously.
import time\nimport sys\n\n# Calculate pi in chunks\n\n# n - total number of steps to be undertaken across all chunks\n# lower - the lowest number of this chunk\n# upper - the upper limit of this chunk such that i < upper\n\ndef pi_chunk(n, lower, upper):\n step = 1.0 / n\n p = step * sum(4.0/(1.0 + ((i + 0.5) * (i + 0.5) * step * step)) for i in range(lower, upper))\n return p\n\n# Number of slices\n\nnum_steps = 10000000\n\nprint(\"Calculating PI using:\\n \" + str(num_steps) + \" slices\")\n\nstart = time.time()\n\n# Calculate using a single chunk containing all steps\n\np = pi_chunk(num_steps, 1, num_steps)\n\nstop = time.time()\n\nprint(\"Obtained value of Pi: \" + str(p))\n\nprint(\"Time taken: \" + str(stop - start) + \" seconds\")\n
For this exercise, your task is to implemented the above code on the SDF, and then parallelise using Dask. There are a number of different ways you could parallelise this using Dask, but we suggest using the Futures map functionality to run the pi_chunk function on a range of different inputs. Futures map has the following definition:
Client.map(func, *iterables[, key, workers, ...])\n
Where func is the function you want to run, and then the subsequent arguments are inputs to that function. To utilise this for the Pi calculation, you will first need to setup and configure a Dask Client to use, and also create and populate lists or vectors of inputs to be passed to the pi_chunk function for each function run that Dask launches.
If you run Dask with processes then it is possible that you will get errors about forking processes, such as these:
An attempt has been made to start a new process before the current process has finished its bootstrapping phase.\n This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:\n
In that case you need to encapsulate your code within a main function, using something like this:
if __name__ == \"__main__\":\n
If you are struggling with this exercise then there is a solution available for you here.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#signposting","title":"Signposting","text":"More information on parallelised python code can be found in the carpentries lesson
Dask itself has several detailed tutorials
This lesson is adapted from a workshop introducing users to running R scripts on ARCHER2 as developed by Adrian Jackson.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#introduction","title":"Introduction","text":"In this exercise we are going to try different methods of parallelising R on the SDF. This will include single node parallelisation functionality (e.g. using threads or processes to use cores within a single node), and distributed memory functionality that enables the parallelisation of R programs across multiple nodes.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#example-parallelised-r-code","title":"Example parallelised R code","text":"You may need to activate an R conda environment.
conda activate r-v4.2\n
Try running the following R script using R on the SDF login node:
n <- 8*2048\nA <- matrix( rnorm(n*n), ncol=n, nrow=n )\nB <- matrix( rnorm(n*n), ncol=n, nrow=n )\nC <- A %*% B\n
You can run this as follows on the SDF (assuming you have saved the above code into a file named matrix.R):
Rscript ./matrix.R\n
You can check the resources used by R when running on the login node using this command:
top -u $USER\n
If you run the R script in the background using &, as follows, you can then monitor your run using the top command. You may notice when you run your program that at points R uses many more resources than a single core can provide, as demonstrated below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND\n 178357 adrianj 20 0 15.542 0.014t 13064 R 10862 2.773 9:01.66 R\n
In the example above it can be seen that >10862% of a single core is being used by R. This is an example of R using automatic parallelisation. You can experiment with controlling the automatic parallelisation using the OMP_NUM_THREADS variable to restrict the number of cores available to R. Try using the following values:
export OMP_NUM_THREADS=8\n\nexport OMP_NUM_THREADS=4\n\nexport OMP_NUM_THREADS=2\n
You may also notice that not all the R script is parallelised. Only the actual matrix multiplication is undertaken in parallel, the initialisation/creation of the matrices is done in serial.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-datatables","title":"Parallelisation with data.tables","text":"We can also experiment with the implicit parallelism in other libraries, such as data.table. You will first need to install this library on the SDF. To do this you can simply run the following command:
install.packages(data.table)\n
Once you have installed data.table you can experiment with the following code:
library(data.table)\nvenue_data <- data.table( ID = 1:50000000,\nCapacity = sample(100:1000, size = 50000000, replace = T), Code = sample(LETTERS, 50000000, replace = T),\nCountry = rep(c(\"England\",\"Scotland\",\"Wales\",\"NorthernIreland\"), 50000000))\nsystem.time(venue_data[, mean(Capacity), by = Country])\n
This creates some random data in a large data table and then performs a calculation on it. Try running R with varying numbers of threads to see what impact that has on performance. Remember, you can vary the number of threads R uses by setting OMP_NUM_THREADS= before you run R. If you want to try easily varying the number of threads you can save the above code into a script and run it using Rscript, changing OMP_NUM_THREADS each time you run it, e.g.:
export OMP_NUM_THREADS=1\n\nRscript ./data_table_test.R\n\nexport OMP_NUM_THREADS=2\n\nRscript ./data_table_test.R\n
The elapsed time that is printed out when the calculation is run represents how long the script/program took to run. It\u2019s important to bear in mind that, as with the matrix multiplication exercise, not everything will be parallelised. Creating the data table is done in serial so does not benefit from the addition of more threads.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#loop-and-function-parallelism","title":"Loop and function parallelism","text":"R provides a number of different functions to run loops or functions in parallel. One of the most common functions is to use are the {X}apply
functions:
apply
Apply a function over a matrix or data frame
lapply
Apply a function over a list, vector, or data frame
sapply
Same as lapply
but returns a vector
vapply
Same as sapply
but with a specified return type that improves safety and can improve speed
For example:
res <- lapply(1:3, function(i) {\n sqrt(i)*sqrt(i*2)\n })\n
The {X}apply
functionality supports iteration over a dataset without requiring a loop to be constructed. However, the functions outlined above do not exploit parallelism, even if there is potential for parallelisation many operations that utilise them.
There are a number of mechanisms that can be used to implement parallelism using the {X}apply
functions. One of the simplest is using the parallel
library, and the mclapply
function:
library(parallel)\nres <- mclapply(1:3, function(i) {\n sqrt(i)\n})\n
Try experimenting with the above functions on large numbers of iterations, both with lapply and mclapply. Can you achieve better performance using the MC_CORES environment variable to specify how many parallel processes R uses to complete these calculations? The default on the SDF is 2 cores, but you can increase this in the same way we did for OMP_NUM_THREADS, e.g.:
export MC_CORES=16\n
Try different numbers of iterations of the functions (e.g. change 1:3 in the code to something much larger), and different numbers of parallel processes, e.g.:
export MC_CORES=2\n\nexport MC_CORES=8\n\nexport MC_CORES=16\n
If you have separate functions then the above approach will provide a simple method for parallelising using the resources within a single node. However, if your functionality is more loop-based, then you may not wish to have to package this up into separate functions to parallelise.
"},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-foreach","title":"Parallelisation with foreach","text":"The foreach
package can be used to parallelise loops as well as functions. Consider a loop of the following form:
main_list <- c()\nfor (i in 1:3) {\n main_list <- c(main_list, sqrt(i))\n}\n
This can be converted to foreach
functionality as follows:
main_list <- c()\nlibrary(foreach)\nforeach(i=1:3) %do% {\n main_list <- c(main_list, sqrt(i))\n}\n
Whilst this approach does not significantly change the performance or functionality of the code, it does let us then exploit parallel functionality in foreach. The %do%
can be replaced with a %dopar%
which will execute the code in parallel.
To test this out we\u2019re going to try an example using the randomForest
library. We can now run the following code in R:
library(foreach)\nlibrary(randomForest)\nx <- matrix(runif(50000), 1000)\ny <- gl(2, 500)\nrf <- foreach(ntree=rep(250, 4), .combine=combine) %do%\nrandomForest(x, y, ntree=ntree)\nprint(rf)\n
Implement the above code and run with a system.time
to see how long it takes. Once you have done this you can change the %do%
to a %dopar%
and re-run. Does this provide any performance benefits?
To exploit the parallelism with dopar
we need to provide parallel execution functionality and configure it to use extra cores on the system. One method to do this is using the doParallel
package.
library(doParallel)\nregisterDoParallel(8)\n
Does this now improve performance when running the randomForest
example? Experiment with different numbers of workers by changing the number set in registerDoParallel(8)
to see what kind of performance you can get. Note, you may also need to change the number of clusters used in the foreach, e.g. what is specified in the rep(250, 4)
part of the code, to enable more than 4 different sets to be run at once if using more than 4 workers. The amount of parallel workers you can use is dependent on the hardware you have access to, the number of workers you specify when you setup your parallel backend, and the amount of chunks of work you have to distribute with your foreach configuration.
It is possible to use different parallel backends for foreach
. The one we have used in the example above creates new worker processes to provide the parallelism, but you can also use larger numbers of workers through a parallel cluster, e.g.:
my.cluster <- parallel::makeCluster(8)\nregisterDoParallel(cl = my.cluster)\n
By default makeCluster
creates a socket cluster, where each worker is a new independent process. This can enable running the same R program across a range of systems, as it works on Linux and Windows (and other clients). However, you can also fork the existing R process to create your new workers, e.g.:
cl <-makeCluster(5, type=\"FORK\")\n
This saves you from having to create the variables or objects that were setup in the R program/script prior to the creation of the cluster, as they are automatically copied to the workers when using this forking mode. However, it is limited to Linux style systems and cannot scale beyond a single node.
Once you have finished using a parallel cluster you should shut it down to free up computational resources, using stopCluster
, e.g.:
stopCluster(cl)\n
When using clusters without the forking approach, you need to distribute objects and variables from the main process to the workers using the clusterExport
function, e.g.:
library(parallel)\nvariableA <- 10\nvariableB <- 20\nmySum <- function(x) variableA + variableB + x\ncl <- makeCluster(4)\nres <- try(parSapply(cl=cl, 1:40, mySum))\n
The program above will fail because variableA
and variableB
are not present on the cluster workers. Try the above on the SDF and see what result you get.
To fix this issue you can modify the program using clusterExport
to send variableA
and variableB
to the workers, prior to running the parSapply
e.g.:
clusterExport(cl=cl, c('variableA', 'variableB'))\n
"},{"location":"safe-haven-services/tre-container-user-guide/advise-ig-required-software-stack/","title":"Advising Information Governance of required software stack","text":"Projects must establish that the software stack they intend to import in the container is acceptable for the project\u2019s IG approvals. Projects should only seek to use container-based software if the standard TRE desktop environment is not sufficient for the research scope. However, it is broadly understood that the standard desktop software, whilst useful in most cases, is inadequate for many purposes and specifically ML, and software import using containers is intended to address this.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/","title":"Building and Testing Containers","text":""},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#choose-a-container-base-from-dockerhub","title":"Choose a container base from DockerHub","text":"Projects should build containers by starting with a well-known application base container on a public registry. Projects should add a minimum of additional project software and packages so that the container is clearly built for a specific purpose. Containers built for one specific batch job, either a data transformation or analysis, are examples of this approach. Container builds that assemble groups of tools and then used to run a variety of tasks should be avoided. Additionally, container builds that start from generic distributions such as Debian or Ubuntu should also be avoided as leaner and more focussed application and language containers are already available.
Examples of batch job container bases are Python and PyTorch, and other language specific and ML software stacks. Examples of interactive container bases are Rocker, Jupyter Docker Stacks, and NVIDIA RAPIDS extended with additional package sets and code required by your project.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#add-tre-file-system-directories","title":"Add TRE file system directories","text":"Container images built to run in the TRE must implement the following line in the Dockerfile to interface with the project data and the TRE file system:
RUN mkdir /safe_data /safe_outputs /scratch\n
The project\u2019s private /safe_data/<project id>
directory is mapped to the container\u2019s /safe_data
directory. A unique container job output directory is created in the user's home directory and mapped to /safe_outputs
in the container. And /scratch
is a temporary workspace that is deleted when the container exits. If the container processing uses the TMP
environment variable, it should be set to the /scratch
volume mount. Hence, containers have access to three directories located on the host system as detailed in the following table:
/safe_data/<your_project_name>/
/safe_data
Read-only access if required by IG, or read-write access, to data and other project files. ~/outputs_<unique_id>
/safe_outputs
Will be created at container startup as an empty directory. Intended for any outputs: logs, data, models. ~/scratch_<unique_id>
/scratch
Temporary directory that is removed after container termination on the host system. Any temporary files should be placed here. Currently, temporary files can also be written into any directory in the container\u2019s internal file system. This is allowed to prevent container software failure when it is dependent on the container file system being writable. However, the use of /scratch
is encouraged for two reasons:
/scratch
. Writing only to /scratch
at runtime is therefore future proof./scratch
can be more efficient if the service is able to mount it on high-performing storage devices.Research software stacks are complex and dependent on many package sets and libraries, and sometimes on specific version combinations of these. The container build process presents the project team with the opportunity to manage and resolve these dependencies before contact with the project data in the restricted TRE setting.
Unlike the TRE desktop servers, containers do not have access to external network repositories, and do not have access to external licence servers. Any container software that requires a licence to run must be copied into the container at build time. EPCC are not responsible for verifying that the appropriate licences are installed or that license terms are being met.
Projects using configuration files for multiple containers, for example ML models, can also import these to the TRE directly by copying them to the project /safe_data
file system.
Batch jobs built to run as containers should start directly from the ENTRYPOINT
or CMD
section of the Dockerfile. Batch jobs should run without any user interaction after start, should read input from the /safe_data
directory and write outputs to the /scratch
and /safe_outputs
directories.
Interactive containers should present a connection port for the user session. Once the container is started the user can connect to the session port from the TRE desktop. If code files are being changed during the session these should be saved on the /safe_data
or the /safe_outputs
directories as the internal container file space is not preserved when the session ends.
When the container is running in the TRE it will not have any external network or internet access. If the code, or any of its dependencies, rely on data files downloaded at runtime (for example machine learning models) this will fail in the TRE. Code must be modified to load these files explicitly from a file path which is accessible from inside the container.
An example of TRE network isolation significance and the considerations arising from this is provided by Hugging Face, where models are cached in the user local ~/.cache/huggingface/hub/
directory in the container. The environment variable HF_HOME
must be set to a directory in a /safe_data
project folder and the cache_dir
option of the from_pretrained()
call used at runtime.
If a model is downloaded from Hugging Face the advice is to set the environment variable HF_HUB_OFFLINE=1
to prevent attempts at contact the Hugging Face Hub. Any connection attempts will take a significant time to elapse and then fail in the TRE setting.
It is recommended that the checklist for Dockerfile composition is followed: Container Build Guide
Information Governance requirements may require a security scan of your container, and Trivy is a tool that can help with this task. Trivy inspects container images to find items which have known vulnerabilities and produces a report that may be used to help assess the risk. The use of the Trivy misconfiguration tool on Dockerfiles is also recommended. This tool option will highlight many common security issues:
docker run --rm -v $(pwd):/repo ghcr.io/aquasecurity/trivy:latest config \"/repo/Dockerfile\"\n
The security posture of containers and the build process may be of interest to IG teams, however, it is not expected that security issues indicated by the tool need to be addressed before the container is run in the TRE unless the IG team issues specific guidance on vulnerability and configuration remediation and mitigation.
"},{"location":"safe-haven-services/tre-container-user-guide/building-and-testing-containers/#scan-container-using-trivy-ci","title":"Scan container using Trivy CI","text":"Trivy can be run manually but it is easier to have it run automatically whenever you update your container image. An example GitHub Actions workflow to run Trivy and publish the outputs can be found here
The Trivy report can be downloaded as an artifact from the job summary page. Before using a specific container in the TRE it may be necessary to test the security risk and gain IG team approval.
"},{"location":"safe-haven-services/tre-container-user-guide/development-workflow/","title":"Development workflow","text":"The general guidance for TRE container development is:
Develop all code on a git platform, typically GitHub or a university managed GitLab service
Start Dockerfiles from a well-known base image. This is especially important if using a GPU as your container will need to have a compatible version of CUDA (currently 11.1 or later)
Add all the additional content (code files, libraries, packages, data, and licences) needed for your analysis work to your Dockerfile
Build and test the container to ensure that it has no external runtime dependencies
Push the Dockerfile to the project git repository so the container image build is recorded
Push the container image to the GitHub container registry ghcr.io (GHCR)
Login to a TRE desktop enabled for container execution to pull and run the container
Containers are connected to three external directories when run inside the TRE: one for access to the project data files (which may be read-only in some cases); one for temporary work files that are all deleted when the container exits; and one for the job output files (which may be set as read-only in some cases when the container exits). All container outputs remain inside the TRE project file space and there is no automatic export when the container finishes.
Container images that have been pulled into the TRE are destroyed after they have been run. Only the files written to the container outputs directory are guaranteed to be retained.
You must ensure that, apart from the input data, your container has everything that it needs to run, including all code and dependencies, and any ancillary files such as machine learning models. It is likely that your development environment, which is always outside the TRE, does not normally have these three directories, but you need to build a container that uses them (see below for path names) because there is no ability inside the TRE to change which directories are available.
The input data file names may change so you may not want to hard-code them into your container. For example, instead of your code using open(\"/safe_data/my_patients.csv\")
you should consider putting a list of input file names into a config file and reading that config file in your container start up to determine which input data files to use. This will allow you to re-run your container on different data sets much faster than building a new container each time.
This guide sets out the required activities for researchers using containers in the EPCC TRE (Safe Haven). The intended audience are software developers with experience of containers and Docker and Podman in particular. Online courses such as Intro to Containers demonstrate the base skills needed if there is any doubt.
The Container Execution Service (CES) has been introduced to allow project code developed and tested by researchers outside the TRE in personal development environments to be imported and run on the project data inside the TRE using a well-documented, transparent, secure workflow. The primary role of the TRE is to store and share data securely; it is not intended to be a software development and testing environment. The CES removes the need for software development in the TRE.
The use of containers and software configuration management processes is also strongly advocated by the research software engineering community for experiment management and reproducibility. It is recommended that TRE container builders take a disciplined approach to code management and use git to create container build audit trails to satisfy any IG (Information Governance) concerns about the provenance of the project code.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/","title":"Using Containers in the TRE","text":"Once you have built and tested your container, you are ready to start using it within the TRE.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/#pulling-a-container-into-the-tre","title":"Pulling a container into the TRE","text":"Containers can only be used on the TRE desktop hosts using shell commands. And containers can only be pulled from the GitHub Container Registry (GHCR) into the TRE using a ces-pull
script. Hence containers must be pushed to GHCR for them to be used in the TRE.
As use of containers in the TRE is a new service, it is at this stage regarded as an activity that requires additional security controls. As result the ces-pull
command is a privileged one that can only be run using sudo. Researcher accounts must be explicitly enabled for use of the sudo ces-pull
command through IG approval \u2013 sudo access for these accounts will be constrained to only run the ces-pull
command.
To pull a private image, you must create an access token to authenticate with GHCR (see Authenticating to the container registry). The container is then pulled by the user with the command:
sudo ces-pull <github_user> <github_token> ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
To pull a public image, which does not require authenticating with username and token, pass two empty strings:
sudo ces-pull \"\" \"\" ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
Once the container image has been pulled into the TRE desktop host, the image can be managed with Podman commands. However, containers must not be run directly using Podman. Instead, commands developed for use within the TRE must be used as will now be described.
"},{"location":"safe-haven-services/tre-container-user-guide/using-containers-in-the-tre/#running-the-container-in-the-tre","title":"Running the container in the TRE","text":"Containers may be run in the TRE using one of two commands: use ces-gpu-run
if a GPU is to be connected to the container, otherwise use the ces-run
command. The sudo privilege escalation is not required to run containers. The basic command to start a container is one of:
ces-run ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
or, if your container requires a GPU:
ces-gpu-run ghcr.io/<namespace>/<container_name>[:<container_tag>]\n
Each command supports a number of options to control resource allocation and to pass parameters to the podman run command and to the container itself. Each command has a help option to output the following information:
Usage:\nces-run [options] <container>\nAvailable Options:\n-c|--cores CPU cores to allocate (default is sharing all of them)\n--dry-run Do not run the container, print out all the command options\n--env-file File with env vars to pass to container\n-h|--help Print this stuff\n-m|--memory Memory to allocate in Gb (default is 4Gb)\n-n|--name Assign a name to the container\n--opt-file File with additional options to pass to run command\n-v|--verbose Print out all command options\n--version Print out version string\n
The --env-file
and --opt-file
arguments can be used to extend the command-line script that is executed when the container is started. The --env-file
option is exactly the docker and podman run option with the file containing lines of the form Variable=Value
. See the Docker option reference
The --opt-file
option allows you to have a file containing additional arguments to the ces-run
and ces-gpu-run
command.
Data Science Virtual Desktops
Managed File Transfer
Notebooks
Cerebras CS-2
Ultra2
Graphcore Bow Pod64
"},{"location":"services/#data-services","title":"Data Services","text":"S3
Data Catalogue
"},{"location":"services/cs2/","title":"Cerebras CS-2","text":"Get Access
Running codes
"},{"location":"services/cs2/access/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/access/#getting-access","title":"Getting Access","text":"Access to the Cerebras CS-2 system is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.
"},{"location":"services/cs2/run/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/run/#introduction","title":"Introduction","text":"The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system as a host system which provides login services, access to files, the SLURM batch system etc.
"},{"location":"services/cs2/run/#connecting-to-the-cluster","title":"Connecting to the cluster","text":"To gain access to the CS-2 WSC you need to login to the host system, Ultra2. See the documentation for Ultra2.
"},{"location":"services/cs2/run/#running-jobs","title":"Running Jobs","text":"All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.
"},{"location":"services/cs2/run/#slurm-example","title":"SLURM example","text":"This is based on the sample job from the Cerebras documentation Cerebras documentation - Execute your job
#!/bin/bash\n#SBATCH --job-name=Example # Job name\n#SBATCH --cpus-per-task=2 # Request 2 cores\n#SBATCH --output=example_%j.log # Standard output and error log\n#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour\n#SBATCH --gres=cs:1 # Request CS-2 system\n\nsource venv_cerebras_pt/bin/activate\npython run.py \\\n CSX \\\n --params params.yaml \\\n --num_csx=1 \\\n --model_dir model_dir \\\n --mode {train,eval,eval_all,train_and_eval} \\\n --mount_dirs {paths to modelzoo and to data} \\\n --python_paths {paths to modelzoo and other python code if used}\n
See the 'Troubleshooting' section below for known issues.
"},{"location":"services/cs2/run/#creating-an-environment","title":"Creating an environment","text":"To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this Cerebras setup environment docs however our host system is slightly different so we recommend the following:
"},{"location":"services/cs2/run/#create-the-venv","title":"Create the venv","text":"python3.8 -m venv venv_cerebras_pt\n
"},{"location":"services/cs2/run/#install-the-dependencies","title":"Install the dependencies","text":"source venv_cerebras_pt/bin/activate\npip install --upgrade pip\npip install cerebras_pytorch==2.2.1\n
"},{"location":"services/cs2/run/#validate-the-setup","title":"Validate the setup","text":"source venv_cerebras_pt/bin/activate\ncerebras_install_check\n
"},{"location":"services/cs2/run/#paths-pythonpath-and-mount_dirs","title":"Paths, PYTHONPATH and mount_dirs","text":"There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.
"},{"location":"services/datacatalogue/","title":"EIDF Data Catalogue Information","text":"Metadata information
"},{"location":"services/datacatalogue/metadata/","title":"EIDF Metadata Information","text":""},{"location":"services/datacatalogue/metadata/#what-is-fair","title":"What is FAIR?","text":"FAIR stands for Findable, Accessible, Interoperable, and Reusable, and helps emphasise the best practices with publishing and sharing data (more details: FAIR Principles)
"},{"location":"services/datacatalogue/metadata/#what-is-metadata","title":"What is metadata?","text":"Metadata is data about data, to help describe the dataset. Common metadata fields are things like the title of the dataset, who produced it, where it was generated (if relevant), when it was generated, and some key words describing it
"},{"location":"services/datacatalogue/metadata/#what-is-ckan","title":"What is CKAN?","text":"CKAN is a metadata catalogue - i.e. it is a database for metadata rather than data. This will help with all aspects of FAIR:
Using a standard vocabulary (such as the FAST Vocabulary) has many benefits:
All of these advantages mean that we, as a project, don't need to think about this - there is no need to reinvent the wheel when other institutes (e.g. National Libraries) have created. You might recognise WorldCat - it is an organisation which manages a global catalogue of ~18000 libraries world-wide, so they are in a good position to generate a comprehensive vocabulary of academic topics!
"},{"location":"services/datacatalogue/metadata/#what-about-licensing-what-does-cc-by-sa-40-mean","title":"What about licensing? (What does CC-BY-SA 4.0 mean?)","text":"The R in FAIR stands for reusable - more specifically it includes this subphrase: \"(Meta)data are released with a clear and accessible data usage license\". This means that we have to tell anyone else who uses the data what they're allowed to do with it - and, under the FAIR philosophy, more freedom is better.
CC-BY-SA 4.0 allows anyone to remix, adapt, and build upon your work (even for commercial purposes), as long as they credit you and license their new creations under the identical terms. It also explicitly includes Sui Generis Database Rights, giving rights to the curation of a database even if you don't have the rights to the items in a database (e.g. a Spotify playlist, even though you don't own the rights to each track).
Human readable summary: Creative Commons 4.0 Human Readable Full legal code: Creative Commons 4.0 Legal Code
"},{"location":"services/datacatalogue/metadata/#im-stuck-how-do-i-get-help","title":"I'm stuck! How do I get help?","text":"Contact the EIDF Service Team via eidf@epcc.ed.ac.uk
"},{"location":"services/gpuservice/","title":"Overview","text":"The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon Kubernetes.
MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.
The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.
The service provides access to:
The current full specification of the EIDF GPU Service as of 14 February 2024:
Quotas
This is the full configuration of the cluster.
Each project will have access to a quota across this shared configuration.
Changes to the default quota must be discussed and agreed with the EIDF Services team.
NOTE
If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. Please see Getting started with Kubernetes to learn about specifying GPU resources.
"},{"location":"services/gpuservice/#service-access","title":"Service Access","text":"Users should have an EIDF Account as the EIDF GPU Service is only accessible through EIDF Virtual Machines.
Existing projects can request access to the EIDF GPU Service through a service request to the EIDF helpdesk or emailing eidf@epcc.ed.ac.uk .
New projects wanting to using the GPU Service should include this in their EIDF Project Application.
Each project will be given a namespace within the EIDF GPU service to operate in.
This namespace will normally be the EIDF Project code appended with \u2019ns\u2019, i.e. eidf989ns
for a project with code 'eidf989'.
Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available here.
All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool.
The VM does not require to be GPU-enabled.
A quick check to see if a VM has access to the EIDF GPU service can be completed by typing kubectl -n <project-namespace> get jobs
in to the command line.
If this is first time you have connected to the GPU service the response should be No resources found in <project-namespace> namespace
.
EIDF GPU Service vs EIDF GPU-Enabled VMs
The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs.
This allows a project to access multiple GPUs of different types.
An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.
Projects do not have to apply for a GPU-enabled VM to access the GPU Service.
"},{"location":"services/gpuservice/#project-quotas","title":"Project Quotas","text":"A standard project namespace has the following initial quota (subject to ongoing review):
Quota is a maximum on a Shared Resource
A project quota is the maximum proportion of the service available for use by that project.
Any submitted job requests that would exceed the total project quota will be queued.
"},{"location":"services/gpuservice/#project-queues","title":"Project Queues","text":"EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the Kueue.
Job Queuing
During periods of high demand, jobs will be queued awaiting resource availability on the Service.
As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated.
GPUs in high demand, such as Nvidia H100s, typically have longer wait times.
Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.
"},{"location":"services/gpuservice/#additional-service-policy-information","title":"Additional Service Policy Information","text":"Additional information on service policies can be found here.
"},{"location":"services/gpuservice/#eidf-gpu-service-tutorial","title":"EIDF GPU Service Tutorial","text":"This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.
Lesson Objective Getting started with Kubernetes a. What is Kubernetes?b. How to send a task to a GPU node.c. How to define the GPU resources needed. Requesting persistent volumes with Kubernetes a. What is a persistent volume? b. How to request a PV resource. Running a PyTorch task a. Accessing a Pytorch container.b. Submitting a PyTorch task to the cluster.c. Inspecting the results. Template workflow a. Loading large data sets asynchronously.b. Manually or automatically building Docker images.c. Iteratively changing and testing code in a job."},{"location":"services/gpuservice/#further-reading-and-help","title":"Further Reading and Help","text":"The Nvidia developers blog provides several examples of how to run ML tasks on a Kubernetes GPU cluster.
Kubernetes documentation has a useful kubectl cheat sheet.
More detailed use cases for the kubectl
can be found in the Kubernetes documentation.
The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.
"},{"location":"services/gpuservice/faq/#how-do-i-obtain-my-project-kubeconfig-file","title":"How do I obtain my project kubeconfig file?","text":"Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.
"},{"location":"services/gpuservice/faq/#access-to-gpu-service-resources-in-default-namespace-is-forbidden","title":"Access to GPU Service resources in default namespace is 'Forbidden'","text":"Error from server (Forbidden): error when creating \"myjobfile.yml\": jobs is forbidden: User <user> cannot create resource \"jobs\" in API group \"\" in the namespace \"default\"\n
Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when the project namespace is not included in the kubectl command for submitting job/pods and kubectl tries to use the \"default\" namespace which projects do not have permissions to use. Resubmitting the job/pod with kubectl -n <project-namespace> create \"myjobfile.yml\"
should solve the issue.
The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.
"},{"location":"services/gpuservice/faq/#how-many-gpus-can-i-use-in-a-pod","title":"How many GPUs can I use in a pod?","text":"The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.
"},{"location":"services/gpuservice/faq/#why-did-a-validation-error-occur-when-submitting-a-pod-or-job-with-a-valid-specification-file","title":"Why did a validation error occur when submitting a pod or job with a valid specification file?","text":"If an error like the below occurs:
error: error validating \"myjobfile.yml\": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false\n
There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories.
The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the Kubernetes Version Skew Policy.
"},{"location":"services/gpuservice/faq/#insufficient-shared-memory-size","title":"Insufficient Shared Memory Size","text":"My SHM is very small, and it causes \"OSError: [Errno 28] No space left on device\" when I train a model using multi-GPU. How to increase SHM size?
The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem:
spec:\n containers:\n - name: [NAME]\n image: [IMAGE]\n volumeMounts:\n - mountPath: /dev/shm\n name: dshm\n volumes:\n - name: dshm\n emptyDir:\n medium: Memory\n
"},{"location":"services/gpuservice/faq/#pytorch-slow-performance-issues","title":"Pytorch Slow Performance Issues","text":"Pytorch on Kubernetes may operate slower than expected - much slower than an equivalent VM setup.
Pytorch defaults to auto-detecting the number of OMP Threads and it will report an incorrect number of potential threads compared to your requested CPU core count. This is a consequence in operating in a container environment, the CPU information is reported by standard libraries and tools will be the node level information rather than your container.
To help correct this issue, the environment variable OMP_NUM_THREADS should be set in the job submission file to the number of cores requested or less.
This has been tested using:
Example fragment for a Bash command start:
containers:\n - args:\n - >\n export OMP_NUM_THREADS=1;\n python mypytorchprogram.py;\n command:\n - /bin/bash\n - '-c'\n - '--'\n
"},{"location":"services/gpuservice/faq/#my-large-number-of-gpus-job-takes-a-long-time-to-be-scheduled","title":"My large number of GPUs Job takes a long time to be scheduled","text":"When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
"},{"location":"services/gpuservice/kueue/","title":"Kueue","text":""},{"location":"services/gpuservice/kueue/#overview","title":"Overview","text":"Kueue is a native Kubernetes quota and job management system.
This is the job queue system for the EIDF GPU Service, starting with February 2024.
All users should submit jobs to their local namespace user queue, this queue will have the name eidf project namespace
-user-queue.
Jobs can be submitted as before but will require the addition of a metadata label:
labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\n
This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted.
"},{"location":"services/gpuservice/kueue/#useful-commands-for-looking-at-your-local-queue","title":"Useful commands for looking at your local queue","text":""},{"location":"services/gpuservice/kueue/#kubectl-get-queue","title":"kubectl get queue
","text":"This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start:
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS\neidf001-user-queue eidf001-project-gpu-cq 0 2\n
"},{"location":"services/gpuservice/kueue/#kubectl-describe-queue-queue","title":"kubectl describe queue <queue>
","text":"This command will output more detailed information on the current resource usage in your queue:
Name: eidf001-user-queue\nNamespace: eidf001\nLabels: <none>\nAnnotations: <none>\nAPI Version: kueue.x-k8s.io/v1beta1\nKind: LocalQueue\nMetadata:\n Creation Timestamp: 2024-02-06T13:06:23Z\n Generation: 1\n Managed Fields:\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:spec:\n .:\n f:clusterQueue:\n Manager: kubectl-create\n Operation: Update\n Time: 2024-02-06T13:06:23Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n .:\n f:admittedWorkloads:\n f:conditions:\n .:\n k:{\"type\":\"Active\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n f:flavorUsage:\n .:\n k:{\"name\":\"default-flavor\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"cpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"memory\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-1g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-3g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-80\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n f:flavorsReservation:\n .:\n k:{\"name\":\"default-flavor\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"cpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"memory\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-1g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-3g\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n k:{\"name\":\"gpu-a100-80\"}:\n .:\n f:name:\n f:resources:\n .:\n k:{\"name\":\"nvidia.com/gpu\"}:\n .:\n f:name:\n f:total:\n f:pendingWorkloads:\n f:reservingWorkloads:\n Manager: kueue\n Operation: Update\n Subresource: status\n Time: 2024-02-14T10:54:20Z\n Resource Version: 333898946\n UID: bca097e2-6c55-4305-86ac-d1bd3c767751\nSpec:\n Cluster Queue: eidf001-project-gpu-cq\nStatus:\n Admitted Workloads: 2\n Conditions:\n Last Transition Time: 2024-02-06T13:06:23Z\n Message: Can submit new workloads to clusterQueue\n Reason: Ready\n Status: True\n Type: Active\n Flavor Usage:\n Name: gpu-a100\n Resources:\n Name: nvidia.com/gpu\n Total: 2\n Name: gpu-a100-3g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-1g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-80\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: default-flavor\n Resources:\n Name: cpu\n Total: 16\n Name: memory\n Total: 256Gi\n Flavors Reservation:\n Name: gpu-a100\n Resources:\n Name: nvidia.com/gpu\n Total: 2\n Name: gpu-a100-3g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-1g\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: gpu-a100-80\n Resources:\n Name: nvidia.com/gpu\n Total: 0\n Name: default-flavor\n Resources:\n Name: cpu\n Total: 16\n Name: memory\n Total: 256Gi\n Pending Workloads: 0\n Reserving Workloads: 2\nEvents: <none>\n
"},{"location":"services/gpuservice/kueue/#kubectl-get-workloads","title":"kubectl get workloads
","text":"This command will return the list of workloads in the queue:
NAME QUEUE ADMITTED BY AGE\njob-jobtest-366ab eidf001-user-queue eidf001-project-gpu-cq 4h45m\njob-jobtest-34ba9 eidf001-user-queue eidf001-project-gpu-cq 6h48m\n
"},{"location":"services/gpuservice/kueue/#kubectl-describe-workload-workload","title":"kubectl describe workload <workload>
","text":"This command will return a detailed summary of the workload including status and resource usage:
Name: job-pytorch-job-0b664\nNamespace: t4\nLabels: kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc\nAnnotations: <none>\nAPI Version: kueue.x-k8s.io/v1beta1\nKind: Workload\nMetadata:\n Creation Timestamp: 2024-02-14T15:22:16Z\n Generation: 2\n Managed Fields:\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n f:admission:\n f:clusterQueue:\n f:podSetAssignments:\n k:{\"name\":\"main\"}:\n .:\n f:count:\n f:flavors:\n f:cpu:\n f:memory:\n f:nvidia.com/gpu:\n f:name:\n f:resourceUsage:\n f:cpu:\n f:memory:\n f:nvidia.com/gpu:\n f:conditions:\n k:{\"type\":\"Admitted\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n k:{\"type\":\"QuotaReserved\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n Manager: kueue-admission\n Operation: Apply\n Subresource: status\n Time: 2024-02-14T15:22:16Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n f:conditions:\n k:{\"type\":\"Finished\"}:\n .:\n f:lastTransitionTime:\n f:message:\n f:reason:\n f:status:\n f:type:\n Manager: kueue-job-controller-Finished\n Operation: Apply\n Subresource: status\n Time: 2024-02-14T15:25:06Z\n API Version: kueue.x-k8s.io/v1beta1\n Fields Type: FieldsV1\n fieldsV1:\n f:metadata:\n f:labels:\n .:\n f:kueue.x-k8s.io/job-uid:\n f:ownerReferences:\n .:\n k:{\"uid\":\"33bc1e48-4dca-4252-9387-bf68b99759dc\"}:\n f:spec:\n .:\n f:podSets:\n .:\n k:{\"name\":\"main\"}:\n .:\n f:count:\n f:name:\n f:template:\n .:\n f:metadata:\n .:\n f:labels:\n .:\n f:controller-uid:\n f:job-name:\n f:name:\n f:spec:\n .:\n f:containers:\n f:dnsPolicy:\n f:nodeSelector:\n f:restartPolicy:\n f:schedulerName:\n f:securityContext:\n f:terminationGracePeriodSeconds:\n f:volumes:\n f:priority:\n f:priorityClassSource:\n f:queueName:\n Manager: kueue\n Operation: Update\n Time: 2024-02-14T15:22:16Z\n Owner References:\n API Version: batch/v1\n Block Owner Deletion: true\n Controller: true\n Kind: Job\n Name: pytorch-job\n UID: 33bc1e48-4dca-4252-9387-bf68b99759dc\n Resource Version: 270812029\n UID: 8cfa93ba-1142-4728-bc0c-e8de817e8151\nSpec:\n Pod Sets:\n Count: 1\n Name: main\n Template:\n Metadata:\n Labels:\n Controller - UID: 33bc1e48-4dca-4252-9387-bf68b99759dc\n Job - Name: pytorch-job\n Name: pytorch-pod\n Spec:\n Containers:\n Args:\n /mnt/ceph_rbd/example_pytorch_code.py\n Command:\n python3\n Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n Image Pull Policy: IfNotPresent\n Name: pytorch-con\n Resources:\n Limits:\n Cpu: 4\n Memory: 4Gi\n nvidia.com/gpu: 1\n Requests:\n Cpu: 2\n Memory: 1Gi\n Termination Message Path: /dev/termination-log\n Termination Message Policy: File\n Volume Mounts:\n Mount Path: /mnt/ceph_rbd\n Name: volume\n Dns Policy: ClusterFirst\n Node Selector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB\n Restart Policy: Never\n Scheduler Name: default-scheduler\n Security Context:\n Termination Grace Period Seconds: 30\n Volumes:\n Name: volume\n Persistent Volume Claim:\n Claim Name: pytorch-pvc\n Priority: 0\n Priority Class Source:\n Queue Name: t4-user-queue\nStatus:\n Admission:\n Cluster Queue: project-cq\n Pod Set Assignments:\n Count: 1\n Flavors:\n Cpu: default-flavor\n Memory: default-flavor\n nvidia.com/gpu: gpu-a100\n Name: main\n Resource Usage:\n Cpu: 2\n Memory: 1Gi\n nvidia.com/gpu: 1\n Conditions:\n Last Transition Time: 2024-02-14T15:22:16Z\n Message: Quota reserved in ClusterQueue project-cq\n Reason: QuotaReserved\n Status: True\n Type: QuotaReserved\n Last Transition Time: 2024-02-14T15:22:16Z\n Message: The workload is admitted\n Reason: Admitted\n Status: True\n Type: Admitted\n Last Transition Time: 2024-02-14T15:25:06Z\n Message: Job finished successfully\n Reason: JobFinished\n Status: True\n Type: Finished\n
"},{"location":"services/gpuservice/policies/","title":"GPU Service Policies","text":""},{"location":"services/gpuservice/policies/#namespaces","title":"Namespaces","text":"Each project will be given a namespace which will have an applied quota.
Default Quota:
Each project will be assigned a kubeconfig file for access to the service which will allow operation in the assigned namespace and access to exposed service operators, for example the GPU and CephRBD operators.
"},{"location":"services/gpuservice/policies/#kubernetes-job-time-to-live","title":"Kubernetes Job Time to Live","text":"All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via spec.ttlSecondsAfterFinished
> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service.
Important
This policy is automated and does not require users to change their job specifications.
"},{"location":"services/gpuservice/policies/#kubernetes-active-deadline-seconds","title":"Kubernetes Active Deadline Seconds","text":"All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via spec.spec.activeDeadlineSeconds
automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service.
Important
This policy is automated and does not require users to change their job or pod specifications.
"},{"location":"services/gpuservice/policies/#kueue","title":"Kueue","text":"All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload.
Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata:
labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\n
Jobs without this queue name tag will be rejected.
Pods bypassing the queue system will be deleted.
"},{"location":"services/gpuservice/training/L1_getting_started/","title":"Getting started with Kubernetes","text":""},{"location":"services/gpuservice/training/L1_getting_started/#requirements","title":"Requirements","text":"In order to follow this tutorial on the EIDF GPU Cluster you will need to have:
An account on the EIDF Portal.
An active EIDF Project on the Portal with access to the EIDF GPU Service.
The EIDF GPU Service kubernetes namespace associated with the project, e.g. eidf001ns.
The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue.
Downloaded the kubeconfig file to a Project VM along with the kubectl command line tool to interact with the K8s API.
Downloading the kubeconfig file and kubectl
Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed.
"},{"location":"services/gpuservice/training/L1_getting_started/#introduction","title":"Introduction","text":"Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications.
Nvidia GPUs are supported through K8s native Nvidia GPU Operators.
The use of K8s to manage the EIDF GPU Service provides two key advantages:
An overview of the key components of a K8s container can be seen on the Kubernetes docs website.
The primary component of a K8s cluster is a pod.
A pod is a set of one or more docker containers (and their storage volumes) that share resources.
It is the EIDF GPU Cluster policy that all pods should be wrapped within a K8s job.
This allows GPU/CPU/Memory resource requests to be managed by the cluster queue management system, kueue.
Pods which attempt to bypass the queue mechanism will affect the experience of other project users.
Any pods not associated with a job (or other K8s object) are at risk of being deleted without notice.
K8s jobs also provide additional functionality such as parallelism (described later in this tutorial).
Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers/code to be ran in the pod by defining a template within a job manifest file written in yaml.
The job yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran.
A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs.
Users interact with the K8s API using the kubectl
(short for kubernetes control) commands.
Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces.
Ensure kubectl is interacting with your project namespace.
You will need to pass the name of your project namespace to kubectl
in order for it to have permission to interact with the cluster.
kubectl
will attempt to interact with the default
namespace which will return a permissions error if it is not told otherwise.
kubectl -n <project-namespace> <command>
will tell kubectl to pass the commands to the correct namespace.
Useful commands are:
kubectl -n <project-namespace> create -f <job definition yaml>
: Create a new job with requested resources. Returns an error if a job with the same name already exists.kubectl -n <project-namespace> apply -f <job definition yaml>
: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml.kubectl -n <project-namespace> delete pod <pod name>
: Delete a pod from the cluster.kubectl -n <project-namespace> get pods
: Summarise all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pods
: Verbose description of all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pod <pod name>
: Verbose summary of the specified pod.kubectl -n <project-namespace> logs <pod name>
: Retrieve the log files associated with a running pod.kubectl -n <project-namespace> get jobs
: List all jobs the namespace has active (or pending).kubectl -n <project-namespace> describe job <job name>
: Verbose summary of the specified job.kubectl -n <project-namespace> delete job <job name>
: Delete a job from the cluster.To access the GPUs on the service, it is recommended to start with one of the prebuilt container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs.
The list of Nvidia images is available on their website.
The following example uses their CUDA sample code simulating nbody interactions.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n
The pod resources are defined under the resources
tags using the requests
and limits
tags.
Resources defined under the requests
tags are the reserved resources required for the pod to be scheduled.
If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested.
This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node.
The limits
tag specifies the maximum resources that can be assigned to a pod.
The EIDF GPU Service requires all pods have requests
and limits
tags for CPU and memory defined in order to be accepted.
GPU resources requests are optional and only an entry under the limits
tag is needed to specify the use of a GPU, nvidia.com/gpu: 1
. Without this no GPU will be available to the pod.
The label kueue.x-k8s.io/queue-name
specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users.
<project-namespace>-user-queue
, e.g. eidf001ns-user-queue:kubectl -n <project-namespace> create -f test_NBody.yml
This will output something like:
job.batch/jobtest-b92qg created\n
The five character code appended to the job name, i.e. b92qg
, is randomly generated and will differ from your run.
Run kubectl -n <project-namespace> get jobs
This will output something like:
NAME COMPLETIONS DURATION AGE\njobtest-b92qg 1/1 48s 29m\n
There may be more than one entry as this displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age.
Inspect your job further using the command kubectl -n <project-namespace> describe job jobtest-b92qg
, updating the job name with your five character code.
This will output something like:
Name: jobtest-b92qg\nNamespace: t4\nSelector: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\nLabels: kueue.x-k8s.io/queue-name=t4-user-queue\nAnnotations: batch.kubernetes.io/job-tracking:\nParallelism: 1\nCompletions: 3\nCompletion Mode: NonIndexed\nStart Time: Wed, 14 Feb 2024 14:07:44 +0000\nCompleted At: Wed, 14 Feb 2024 14:08:32 +0000\nDuration: 48s\nPods Statuses: 0 Active (0 Ready) / 3 Succeeded / 0 Failed\nPod Template:\n Labels: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\n job-name=jobtest-b92qg\n Containers:\n cudasample:\n Image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n Port: <none>\n Host Port: <none>\n Args:\n -benchmark\n -numbodies=512000\n -fp64\n -fullscreen\n Limits:\n cpu: 2\n memory: 4Gi\n nvidia.com/gpu: 1\n Requests:\n cpu: 2\n memory: 1Gi\n Environment: <none>\n Mounts: <none>\n Volumes: <none>\nEvents:\nType Reason Age From Message\n---- ------ ---- ---- -------\nNormal Suspended 8m1s job-controller Job suspended\nNormal CreatedWorkload 8m1s batch/job-kueue-controller Created Workload: t4/job-jobtest-b92qg-3b890\nNormal Started 8m1s batch/job-kueue-controller Admitted by clusterQueue project-cq\nNormal SuccessfulCreate 8m job-controller Created pod: jobtest-b92qg-lh64s\nNormal Resumed 8m job-controller Job resumed\nNormal SuccessfulCreate 7m44s job-controller Created pod: jobtest-b92qg-xhvdm\nNormal SuccessfulCreate 7m28s job-controller Created pod: jobtest-b92qg-lvmrf\nNormal Completed 7m12s job-controller Job completed\n
Run kubectl -n <project-namespace> get pods
This will output something like:
NAME READY STATUS RESTARTS AGE\njobtest-b92qg-lh64s 0/1 Completed 0 11m\n
Again, there may be more than one entry as this displays all the jobs in the current namespace. Also, each pod within a job is given another unique 5 character code appended to the job name.
View the logs of a pod from the job you ran kubectl -n <project-namespace> logs jobtest-b92qg-lh64s
- again update with you run's pod and job five letter code.
This will output something like:
Run \"nbody -benchmark [-numbodies=<numBodies>]\" to measure performance.\n -fullscreen (run n-body simulation in fullscreen mode)\n -fp64 (use double precision floating point values for simulation)\n -hostmem (stores simulation data in host memory)\n -benchmark (run benchmark to measure performance)\n -numbodies=<N> (number of bodies (>= 1) to run in simulation)\n -device=<d> (where d=0,1,2.... for the CUDA device to use)\n -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)\n -compare (compares simulation results running once on the default GPU and once on the CPU)\n -cpu (run n-body simulation on the CPU)\n -tipsy=<file.bin> (load a tipsy model file for simulation)\n\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n\n> Fullscreen mode\n> Simulation data stored in video memory\n> Double precision floating point simulation\n> 1 Devices used for simulation\nGPU Device 0: \"Ampere\" with compute capability 8.0\n\n> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]\nnumber of bodies = 512000\n512000 bodies, total time for 10 iterations: 10570.778 ms\n= 247.989 billion interactions per second\n= 7439.679 double-precision GFLOP/s at 30 flops per interaction\n
Delete your job with kubectl -n <project-namespace> delete job jobtest-b92qg
- this will delete the associated pods as well.
If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]
.
The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of the type of GPU present on the node.
The GPU resource requests can be made more specific by adding the type of GPU product the pod template is requesting to the node selector:
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'
nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
The nodeSelector:
key at the bottom of the pod template states the pod should be ran on a node with a 1g.5gb MIG GPU.
Exact GPU product names only
K8s will fail to assign the pod if you misspell the GPU type.
Be especially careful when requesting a full 80Gb or 40Gb A100 GPU as attempting to load GPUs with more data than its memory can handle can have unexpected consequences.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n
"},{"location":"services/gpuservice/training/L1_getting_started/#running-multiple-pods-with-k8s-jobs","title":"Running multiple pods with K8s jobs","text":"Wrapping a pod within a job provides additional functionality on top of accessing the queuing system.
Firstly, the restartPolicy within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod.
Jobs also allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate.
See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends.
apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 3\n parallelism: 1\n template:\n metadata:\n name: job-test\n spec:\n containers:\n - name: cudasample\n image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n nvidia.com/gpu: 1\n restartPolicy: Never\n
"},{"location":"services/gpuservice/training/L1_getting_started/#change-the-default-kubectl-namespace-in-the-project-kubeconfig-file","title":"Change the default kubectl namespace in the project kubeconfig file","text":"Passing the -n <project-namespace>
flag every time you want to interact with the cluster can be cumbersome.
You can alter the kubeconfig on your VM to send commands to your project namespace by default.
Only users with sudo privileges can change the root kubectl config file.
Open the command line on your EIDF VM with access to the EIDF GPU Service.
Open the root kubeconfig file with sudo privileges.
sudo nano /kubernetes/config\n
Add the namespace line with your project's kubernetes namespace to the \"eidf-general-prod\" context entry in your copy of the config file.
*** MORE CONFIG ***\n\ncontexts:\n- name: \"eidf-general-prod\"\n context:\n user: \"eidf-general-prod\"\n namespace: \"<project-namespace>\" # INSERT LINE\n cluster: \"eidf-general-prod\"\n\n*** MORE CONFIG ***\n
Check kubectl connects to the cluster. If this does not work you delete and re-download the kubeconfig file using the button on the project page of the EIDF portal.
kubectl get pods\n
It is recommended that users complete Getting started with Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#overview","title":"Overview","text":"Pods in the K8s EIDF GPU Service are intentionally ephemeral.
They only last as long as required to complete the task that they were created for.
Keeping pods ephemeral ensures the cluster resources are released for other users to request.
However, this means the default storage volumes within a pod are temporary.
If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance.
K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs.
These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#submitting-a-persistent-volume-claim","title":"Submitting a Persistent Volume Claim","text":"Before a persistent volume can be mounted to a pod, the required storage resources need to be requested and reserved to your namespace.
A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources.
The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently.
Example PVCs can be seen on the Kubernetes documentation page.
All PVCs on the EIDF GPU Service must use the csi-rbd-sc
storage class.
kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: test-ceph-pvc\nspec:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 2Gi\n storageClassName: csi-rbd-sc\n
You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml kubectl -n <project-namespace> create <PVC specification yaml>
Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands:
kubectl -n <project-namespace> delete pvc <PVC name>
kubectl -n <project-namespace> get pvc <PVC name>
kubectl -n <project-namespace> apply -f <PVC specification yaml>
Introducing a persistent volume to a pod requires the addition of a volumeMount option to the container and a volume option linking to the PVC in the pod specification yaml.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-pod-specification-yaml-with-mounted-persistent-volume","title":"Example pod specification yaml with mounted persistent volume","text":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: test-ceph-pvc-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: test-ceph-pvc-pod\n spec:\n containers:\n - name: cudasample\n image: busybox\n args: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 2\n memory: '1Gi'\n limits:\n cpu: 2\n memory: '4Gi'\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n restartPolicy: Never\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: test-ceph-pvc\n
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#accessing-the-persistent-volume-outside-a-pod","title":"Accessing the persistent volume outside a pod","text":"To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command.
*** On Login Node - replacing pod name with your pod name ***\nkubectl -n <project-namespace> cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd\n
For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted.
The bash command rsync can be amended to manage file transfers into the mounted PV following this GitHub repo.
"},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#clean-up","title":"Clean up","text":"kubectl -n <project-namespace> delete job test-ceph-pvc-job\n\nkubectl -n <project-namespace> delete pvc test-ceph-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/","title":"Running a PyTorch task","text":""},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#requirements","title":"Requirements","text":"It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#overview","title":"Overview","text":"In the following lesson, we'll build a CNN neural network and train it using the EIDF GPU Service.
The model was taken from the PyTorch Tutorials.
The lesson will be split into three parts:
Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).
kubectl -n <project-namespace> create -f <pvc-spec-yaml>\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-persistentvolumeclaim","title":"Example PyTorch PersistentVolumeClaim","text":"kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: pytorch-pvc\nspec:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 2Gi\n storageClassName: csi-rbd-sc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#transfer-codedata-to-persistent-volume","title":"Transfer code/data to persistent volume","text":"Check PVC has been created
kubectl -n <project-namespace> get pvc <pv-name>\n
Create a lightweight job with pod with PV mounted (example job below)
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml\n
Download the PyTorch code
wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py\n
Copy the Python script into the PV
kubectl -n <project-namespace> cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/\n
Check whether the files were transferred successfully
kubectl -n <project-namespace> exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd\n
Delete the lightweight job
kubectl -n <project-namespace> delete job lightweight-job-<identifier>\n
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: lightweight-pod\n spec:\n containers:\n - name: data-loader\n image: busybox\n args: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: '1Gi'\n limits:\n cpu: 1\n memory: '1Gi'\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n restartPolicy: Never\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#creating-a-job-with-a-pytorch-container","title":"Creating a Job with a PyTorch container","text":"We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.
The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.
Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.
kubectl -n <project-namespace> create -f <pytorch-job-yaml>\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-job-specification-file","title":"Example PyTorch Job Specification File","text":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: pytorch-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 1\n template:\n metadata:\n name: pytorch-pod\n spec:\n restartPolicy: Never\n containers:\n - name: pytorch-con\n image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n command: [\"python3\"]\n args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n resources:\n requests:\n cpu: 2\n memory: \"1Gi\"\n limits:\n cpu: 4\n memory: \"4Gi\"\n nvidia.com/gpu: 1\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#reviewing-the-results-of-the-pytorch-model","title":"Reviewing the results of the PyTorch model","text":"This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.
Check that the model ran to completion
kubectl -n <project-namespace> logs <pytorch-pod-name>\n
Spin up a lightweight pod to retrieve results
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml\n
Copy the trained model back to your access VM
kubectl -n <project-namespace> cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth\n
A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.
A Kubernetes job can create and manage multiple pods with identical or different initial parameters.
NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.
Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: pytorch-job\n labels:\n kueue.x-k8s.io/queue-name: <project namespace>-user-queue\nspec:\n completions: 3\n template:\n metadata:\n name: pytorch-pod\n spec:\n restartPolicy: Never\n containers:\n - name: pytorch-con\n image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n command: [\"python3\"]\n args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n resources:\n requests:\n cpu: 2\n memory: \"1Gi\"\n limits:\n cpu: 4\n memory: \"4Gi\"\n nvidia.com/gpu: 1\n nodeSelector:\n nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: pytorch-pvc\n
"},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#clean-up","title":"Clean up","text":"kubectl -n <project-namespace> delete pod pytorch-job\n\nkubectl -n <project-namespace> delete pvc pytorch-pvc\n
"},{"location":"services/gpuservice/training/L4_template_workflow/","title":"Template workflow","text":""},{"location":"services/gpuservice/training/L4_template_workflow/#requirements","title":"Requirements","text":"It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
"},{"location":"services/gpuservice/training/L4_template_workflow/#overview","title":"Overview","text":"An example workflow for code development using K8s is outlined below.
In theory, users can create docker images with all the code, software and data included to complete their analysis.
In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.
Therefore, it is recommended to separate code, software, and data preparation into distinct steps:
Data Loading: Loading large data sets asynchronously.
Developing a Docker environment: Manually or automatically building Docker images.
Code development with K8s: Iteratively changing and testing code in a job.
The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.
The three stages are interchangeable and may not be relevant to every project.
Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).
"},{"location":"services/gpuservice/training/L4_template_workflow/#data-loading","title":"Data loading","text":"The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.
Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.
Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.
It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.
Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.
"},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-with-a-lightweight-job","title":"Asynchronous data downloading with a lightweight job","text":"Check a PVC has been created.
kubectl -n <project-namespace> get pvc template-workflow-pvc\n
Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n metadata:\n name: lightweight-job\n spec:\n restartPolicy: Never\n containers:\n - name: data-loader\n image: alpine/curl:latest\n command: ['sh', '-c', \"cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip\"]\n resources:\n requests:\n cpu: 1\n memory: \"1Gi\"\n limits:\n cpu: 1\n memory: \"1Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Run the data download job.
kubectl -n <project-namespace> create -f lightweight-pod.yaml\n
Check if the download has completed.
kubectl -n <project-namespace> get jobs\n
Delete the lightweight job once completed.
kubectl -n <project-namespace> delete job lightweight-job\n
Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.
Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.
This allows you to start a task, such as downloading a data set, and check in on it asynchronously.
Once you have started a screen session, you can create a new window with ctrl-a c
, swap between windows with ctrl-a 0-9
and exit screen (but keep any task running) with ctrl-a d
.
Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.
Start a screen session.
screen\n
Create an interactive lightweight job session.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n metadata:\n name: lightweight-pod\n spec:\n restartPolicy: Never\n containers:\n - name: data-loader\n image: alpine/curl:latest\n command: ['sleep','infinity']\n resources:\n requests:\n cpu: 1\n memory: \"1Gi\"\n limits:\n cpu: 1\n memory: \"1Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Download data set. Change the curl URL to your data set of interest.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip\n
Exit the remote session by either ending the session or ctrl-a d
.
Reconnect at a later time and reattach the screen window.
screen -list\n\nscreen -r <session-name>\n
Check the download was successful and delete the job.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/\n\nkubectl -n <project-namespace> delete job lightweight-job\n
Exit the screen session.
exit\n
Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.
It does not provide functionality to build images and create pods from docker files.
However, use cases may require some custom modifications of a base image, such as adding a python library.
These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.
This is not an introduction to building docker images, please see the Docker tutorial for a general overview.
"},{"location":"services/gpuservice/training/L4_template_workflow/#manually-building-a-docker-image-locally","title":"Manually building a Docker image locally","text":"Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.
Create a Dockerfile to add any additional packages required to the base image.
FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10\nRUN pip install pandas\nRUN pip install plotly\n
Build the Docker container locally (You will need to install Docker)
cd <dockerfile-folder>\n\ndocker build . -t <docker-hub-username>/template-docker-image:latest\n
Building images for different CPU architectures
Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.
If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64
flag to the build function.
Create a repository to hold the image on Docker Hub (You will need to create and setup an account).
Push the Docker image to the repository.
docker push <docker-hub-username>/template-docker-image:latest\n
Finally, specify your Docker image in the image:
tag of the job specification yaml file.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n
In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.
A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.
This process requires you to already have a GitHub and Docker Hub account.
Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.
Create two GitHub secrets to securely provide your Docker Hub username and access token.
Add the dockerfile to a code/docker folder within an active GitHub repo.
Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.
name: ci\non:\n push:\n paths:\n - 'code/docker/**'\n\njobs:\n docker:\n runs-on: ubuntu-latest\n steps:\n -\n name: Set up QEMU\n uses: docker/setup-qemu-action@v3\n -\n name: Set up Docker Buildx\n uses: docker/setup-buildx-action@v3\n -\n name: Login to Docker Hub\n uses: docker/login-action@v3\n with:\n username: ${{ secrets.DOCKERHUB_USERNAME }}\n password: ${{ secrets.DOCKERHUB_TOKEN }}\n -\n name: Build and push\n uses: docker/build-push-action@v5\n with:\n context: \"{{defaultContext}}:code/docker\"\n push: true\n tags: <target-dockerhub-image-name>\n
Push a change to the dockerfile and check the Docker Hub image is updated.
Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.
However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.
If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).
A pod yaml file can be defined to automatically pull the latest code version before running any tests.
Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create
command.
You must already have a GitHub account to follow this process.
This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).
A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.
"},{"location":"services/gpuservice/training/L4_template_workflow/#create-a-job-that-downloads-and-runs-the-latest-code-version-at-runtime","title":"Create a job that downloads and runs the latest code version at runtime","text":"Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n
Add an initial container that runs before the main container to download the latest version of the code.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: [\"sleep\", \"infinity\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n - mountPath: /code\n name: github-code\n initContainers:\n - name: lightweight-git-container\n image: cicirello/alpine-plus-plus\n command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /code\n name: github-code\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n - name: github-code\n emptyDir:\n sizeLimit: 1Gi\n
Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the initContainers: command:
tag.
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n spec:\n restartPolicy: Never\n containers:\n - name: template-docker-image\n image: <docker-hub-username>/template-docker-image:latest\n command: ['sh', '-c', \"python3 /code/<python-script>\"]\n resources:\n requests:\n cpu: 10\n memory: \"40Gi\"\n limits:\n cpu: 10\n memory: \"80Gi\"\n nvidia.com/gpu: 1\n volumeMounts:\n - mountPath: /mnt/ceph_rbd\n name: volume\n - mountPath: /code\n name: github-code\n initContainers:\n - name: lightweight-git-container\n image: cicirello/alpine-plus-plus\n command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n resources:\n requests:\n cpu: 1\n memory: \"4Gi\"\n limits:\n cpu: 1\n memory: \"8Gi\"\n volumeMounts:\n - mountPath: /code\n name: github-code\n volumes:\n - name: volume\n persistentVolumeClaim:\n claimName: template-workflow-pvc\n - name: github-code\n emptyDir:\n sizeLimit: 1Gi\n
Submit the yaml file to kubernetes
kubectl -n <project-namespace> create -f <job-yaml-file>\n
EIDF hosts a Graphcore Bow Pod64 system for AI acceleration.
The specification of the Bow Pod64 is:
For more details about the IPU architecture, see documentation from Graphcore.
The smallest unit of compute resource that can be requested is a single IPU.
Similarly to the EIDF GPU Service, usage of the Graphcore is managed using Kubernetes.
"},{"location":"services/graphcore/#service-access","title":"Service Access","text":"Access to the Graphcore accelerator is provisioning through the EIDF GPU Service.
Users should apply for access to Graphcore via the EIDF GPU Service.
"},{"location":"services/graphcore/#project-quotas","title":"Project Quotas","text":"Currently there is no active quota mechanism on the Graphcore accelerator. IPUJobs should be actively using partitions on the Graphcore.
"},{"location":"services/graphcore/#graphcore-tutorial","title":"Graphcore Tutorial","text":"The following tutorial teaches users how to submit tasks to the Graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the GPU service tutorial. For more in-depth lessons about developing applications for Graphcore, see the general documentation and guide for creating IPU jobs via Kubernetes.
Lesson Objective Getting started with IPU jobs a. How to send an IPUJob.b. Monitoring and Cancelling your IPUJob. Multi-IPU Jobs a. Using multiple IPUs for distributed training. Profiling with PopVision a. Enabling profiling in your code.b. Downloading the profile reports. Other Frameworks a. Using Tensorflow and PopART.b. Writing IPU programs with PopLibs (C++)."},{"location":"services/graphcore/#further-reading-and-help","title":"Further Reading and Help","text":"The Graphcore documentation provides information about using the Graphcore system.
The Graphcore examples repository on GitHub provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks.
IPUJobs
manages the launcher and worker pods
, therefore the pods will be deleted when the IPUJob
is deleted, using kubectl delete ipujobs <IPUJob-name>
. If only the pod
is deleted via kubectl delete pod
, the IPUJob
may respawn the pod
.
To see running or terminated IPUJobs
, run kubectl get ipujobs
.
'poptorch_cpp_error': Failed to acquire X IPU(s)
. Why?","text":"This error may appear when the IPUJob name is too long.
We have identified that for IPUJobs with metadata:name
length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters.
This guide assumes basic familiarity with Kubernetes (K8s) and usage of kubectl
. See GPU service tutorial to get started.
Graphcore provides prebuilt docker containers (full lists here) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs.
In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs.
"},{"location":"services/graphcore/training/L1_getting_started/#creating-your-first-ipu-job","title":"Creating your first IPU job","text":"For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (graphcore/pytorch:3.3.0
) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided here. More applications can be found in the repository https://github.com/graphcore/examples.
To get started:
mnist-training-ipujob.yaml
, then copy and save the following content into the file:apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: mnist-training-\nspec:\n # jobInstances defines the number of job instances.\n # More than 1 job instance is usually useful for inference jobs only.\n jobInstances: 1\n # ipusPerJobInstance refers to the number of IPUs required per job instance.\n # A separate IPU partition of this size will be created by the IPU Operator\n # for each job instance.\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: mnist-training\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/pytorch/mnist;\n python -m pip install -r requirements.txt;\n python mnist_poptorch_code_only.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
to submit the job - run kubectl create -f mnist-training-ipujob.yaml
, which will give the following output:
ipujob.graphcore.ai/mnist-training-<random string> created\n
to monitor progress of the job - run kubectl get pods
, which will give the following output
NAME READY STATUS RESTARTS AGE\nmnist-training-<random string>-worker-0 0/1 Completed 0 2m56s\n
to read the result - run kubectl logs mnist-training-<random string>-worker-0
, which will give the following output (or similar)
...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [00:23<00:00]\nEpochs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:34<00:00, 34.18s/it]\n...\nAccuracy on test set: 97.08%\n
"},{"location":"services/graphcore/training/L1_getting_started/#monitoring-and-cancelling-your-ipu-job","title":"Monitoring and Cancelling your IPU job","text":"An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete IPUjobs
, run kubectl get ipujobs
, which will show:
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE\nmnist-training Completed 0 1 All instances done 10m\n
To delete the IPUjob
, run kubectl delete ipujobs <job-name>
, e.g. kubectl delete ipujobs mnist-training-<random string>
. This will also delete the associated worker pod mnist-training-<random string>-worker-0
.
Note: simply deleting the pod via kubectl delete pods mnist-training-<random-string>-worker-0
does not delete the IPU job, which will need to be deleted separately.
Note: you can list all pods via kubectl get all
or kubectl get pods
, but they do not show the ipujobs. These can be obtained using kubectl get ipujobs
.
Note: kubectl describe <pod-name>
provides verbose description of a specific pod.
The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named IPUJob
, which can be seen at the beginning of the included yaml file:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\n
An IPUJob
allows users to defineworkloads that can use IPUs. There are several fields specific to an IPUJob
:
job instances : This defines the number of jobs. In the case of training it should be 1.
ipusPerJobInstance : This defines the size of IPU partition that will be created for each job instance.
workers : This defines a Pod specification that will be used for Worker
Pods, including the container image and commands.
These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the next lesson.
"},{"location":"services/graphcore/training/L1_getting_started/#additional-information","title":"Additional Information","text":"It is possible to further specify the restart policy (Always
/OnFailure
/Never
/ExitCode
) and clean up policy (Workers
/All
/None
); see here.
In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on https://github.com/graphcore/examples. These may require distributed training on multiple IPUs.
The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64.
"},{"location":"services/graphcore/training/L2_multiple_IPU/#first-example","title":"First example","text":"As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from https://github.com/graphcore/examples/tree/master/nlp/bert/pytorch.
To get started, save and create an IPUJob with the following .yaml
file:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: bert-training-multi-ipu-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"4\"\n workers:\n template:\n spec:\n containers:\n - name: bert-training-multi-ipu\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/examples.git;\n cd examples/nlp/bert/pytorch;\n apt update ;\n apt upgrade -y;\n DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n pip3 install -r requirements.txt ;\n python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running the above IPUJob and querying the log via kubectl logs pod/bert-training-multi-ipu-<random string>-worker-0
should give:
...\nData loaded in 8.559805537108332 secs\n-----------------------------------------------------------\n-------------------- Device Allocation --------------------\nEmbedding --> IPU 0\nEncoder 0 --> IPU 1\nEncoder 1 --> IPU 1\nEncoder 2 --> IPU 1\nEncoder 3 --> IPU 1\nEncoder 4 --> IPU 2\nEncoder 5 --> IPU 2\nEncoder 6 --> IPU 2\nEncoder 7 --> IPU 2\nEncoder 8 --> IPU 3\nEncoder 9 --> IPU 3\nEncoder 10 --> IPU 3\nEncoder 11 --> IPU 3\nPooler --> IPU 0\nClassifier --> IPU 0\n-----------------------------------------------------------\n---------- Compilation/Loading from Cache Started ---------\n\n...\n\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [08:02<00:00]\nCompiled/Loaded model in 500.756152929971 secs\n-----------------------------------------------------------\n--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %: 0%| | 0/1 [00:16<?, ?it/s, throughput: 4035.0 samples/sec]\n-----------------------------------------------------------\n-------------------- Training Metrics ---------------------\nglobal_batch_size: 65536\ndevice_iterations: 1\ntraining_steps: 1\nTraining time: 16.245 secs\n-----------------------------------------------------------\n
"},{"location":"services/graphcore/training/L2_multiple_IPU/#details","title":"Details","text":"In this example, we have requested 4 IPUs:
ipusPerJobInstance: \"4\"\n
The python flag --config pretrain_base_128_pod4
uses one of the preset configurations for this model with 4 IPUs. Here we also use the --datset generated
flag to generate data rather than download the required dataset.
To provided sufficient shm for the IPU pod, it may be necessary to mount /dev/shm
as follows:
volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
It is also required to set spec.hostIPC
to true
:
hostIPC: true\n
and add a securityContext
to the container definition than enables the IPC_LOCK
capability:
securityContext:\n capabilities:\n add:\n - IPC_LOCK\n
Note: IPC_LOCK
allows for the RDMA software stack to use pinned memory \u2014 which is particularly useful for PyTorch dataloaders, which can be very memory hungry. This is since all data going to the IPUs go via the network interfaces (via 100Gbps ethernet).
In general, the graph compilation phase of running large models can require significant memory, and far less during the execution phase.
In the example above, it is possible to explicitly request the memory via:
resources:\n limits:\n memory: \"128Gi\"\n requests:\n memory: \"128Gi\"\n
which will succeed. (The graph compilation fails if only 32Gi
is requested.)
As a general guideline, 128GB memory should be enough for the majority of tasks, and rarely exceed 200GB even for jobs with high IPU count. In the example .yaml
script, we do not specifically request the memory.
In the example above, python is launched directly in the pod. When scaling up the number of IPUs (e.g. above 8 IPUs), it may be possible to run into a CPU bottleneck. This may be observed when the throughput scales sub-linearly with the number of data-parallel replicas (i.e. when doubling the IPU count, the performance does not double). This can also be verified by profiling the application and observing a significant proportion of runtime spent on host CPU workload.
In this case, Poprun can be used launch multiple instances. As an example, we will save the following .yaml configuratoin and run:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: bert-poprun-64ipus-\nspec:\n jobInstances: 1\n modelReplicasPerWorker: \"16\"\n ipusPerJobInstance: \"64\"\n workers:\n template:\n spec:\n containers:\n - name: bert-poprun-64ipus\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/examples.git;\n cd examples/nlp/bert/pytorch;\n apt update ;\n apt upgrade -y;\n DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n pip3 install -r requirements.txt ;\n OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 OMPI_ALLOW_RUN_AS_ROOT=1 \\\n poprun \\\n --allow-run-as-root 1 \\\n --vv \\\n --num-instances 1 \\\n --num-replicas 16 \\\n --mpi-global-args=\"--tag-output\" \\\n --ipus-per-replica 4 \\\n python3 run_pretraining.py \\\n --config pretrain_large_128_POD64 \\\n --dataset generated --training-steps 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Inspecting the log via kubectl logs <pod-name>
should produce:
...\n ===========================================================================================\n| poprun topology |\n|===========================================================================================|\n10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE\n10:10:50.154 1 POPRUN [D] Target options from environment: {}\n| hosts | localhost |\n|-----------|-------------------------------------------------------------------------------|\n| ILDs | 0 |\n|-----------|-------------------------------------------------------------------------------|\n| instances | 0 |\n|-----------|-------------------------------------------------------------------------------|\n| replicas | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |\n -------------------------------------------------------------------------------------------\n10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.154 1 POPRUN [D] Using target options: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting\n10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None\n10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output'\n'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x'\n'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x'\n'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x'\n'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0'\n'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS=\n{\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\n\"instanceSize\":\"64\"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1'\n10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346)\n[1,0]<stderr>: Registered metric hook: total_compiling_time with object: <function get_results_for_compile_time at 0x7fe0a6e8af70>\n[1,0]<stderr>:Using config: pretrain_large_128_POD64\n...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [10:11<00:00][1,0]<stderr>:\n[1,0]<stderr>:Compiled/Loaded model in 683.6591004971415 secs\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?it/s, throughput: 17692.1 samples/sec][1,0]<stderr>:\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:-------------------- Training Metrics ---------------------\n[1,0]<stderr>:global_batch_size: 65536\n[1,0]<stderr>:device_iterations: 1\n[1,0]<stderr>:training_steps: 1\n[1,0]<stderr>:Training time: 3.718 secs\n[1,0]<stderr>:-----------------------------------------------------------\n
"},{"location":"services/graphcore/training/L2_multiple_IPU/#notes-on-using-the-examples-respository","title":"Notes on using the examples respository","text":"Graphcore provides examples of a variety of models on Github https://github.com/graphcore/examples. When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel.
"},{"location":"services/graphcore/training/L3_profiling/","title":"Profiling with PopVision","text":"Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see Profiling and Debugging and PopVision Graph Analyser User Guide.
We will reuse the same PyTorch MNIST example from lesson 1 (from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist).
To enable profiling and create IPU reports, we need to add the following line to the training script mnist_poptorch_code_only.py
:
training_opts = training_opts.enableProfiling()\n
(for details the API, see API reference)
Save and run kubectl create -f <yaml-file>
on the following:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: mnist-training-profiling-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: mnist-training-profiling\n image: graphcore/pytorch:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/pytorch/mnist;\n python -m pip install -r requirements.txt;\n sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py;\n python mnist_poptorch_code_only.py --epochs 1;\n echo 'RUNNING ls ./training';\n ls training\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
After completion, using kubectl logs <pod-name>
, we can see the following result
...\nAccuracy on test set: 96.69%\nRUNNING ls ./training\narchive.a\nprofile.pop\n
We can see that the training has created two Poplar report files: archive.a
which is an archive of the ELF executable files, one for each tile; and profile.pop
, the poplar profile, which contains compile-time and execution information about the Poplar graph.
To download the traing profiles to your local environment, you can use kubectl cp
. For example, run
kubectl cp <pod-name>:/root/build/examples/tutorials/simple_applications/pytorch/mnist/training .\n
Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here https://www.graphcore.ai/developer/popvision-tools.
From the Graph Analyser, you can analyse information including memory usage, execution trace and more.
"},{"location":"services/graphcore/training/L4_other_frameworks/","title":"Other Frameworks","text":"In this tutorial we'll briefly cover running tensorflow and PopART for Machine Learning, and writing IPU programs directly via the PopLibs library in C++. Extra links and resources will be provided for more in-depth information.
"},{"location":"services/graphcore/training/L4_other_frameworks/#terminology","title":"Terminology","text":"Within Graphcore, Poplar
refers to the tools (e.g. Poplar Graph Engine
or Poplar Graph Compiler
) and libraries (PopLibs
) for programming on IPUs.
The Poplar SDK
is a package of software development tools, including
For more details see here.
"},{"location":"services/graphcore/training/L4_other_frameworks/#other-ml-frameworks-tensorflow-and-popart","title":"Other ML frameworks: Tensorflow and PopART","text":"Besides being able to run PyTorch code, as demonstrated in the previous lessons, the Poplar SDK also supports running ML learning applications with tensorflow or PopART.
"},{"location":"services/graphcore/training/L4_other_frameworks/#tensorflow","title":"Tensorflow","text":"The Poplar SDK includes implementation of TensorFlow and Keras for the IPU.
For more information, refer to Targeting the IPU from TensorFlow 2 and TensorFlow 2 Quick Start.
These are available from the image graphcore/tensorflow:2
.
For a quick example, we will run an example script from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/tensorflow2/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: tensorflow-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: tensorflow-example\n image: graphcore/tensorflow:2\n command: [/bin/bash, -c, --]\n args:\n - |\n apt update;\n apt upgrade -y;\n apt install git -y;\n cd;\n mkdir build;\n cd build;\n git clone https://github.com/graphcore/examples.git;\n cd examples/tutorials/simple_applications/tensorflow2/mnist;\n python -m pip install -r requirements.txt;\n python mnist_code_only.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\n2023-10-25 13:21:40.263823: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.2.0 (1513789a51) Poplar package: b82480c629\n2023-10-25 13:21:42.203515: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0\nDownloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n11493376/11490434 [==============================] - 0s 0us/step\n11501568/11490434 [==============================] - 0s 0us/step\n2023-10-25 13:21:43.789573: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)\n2023-10-25 13:21:44.164207: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n2023-10-25 13:21:57.935339: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\nEpoch 1/4\n2000/2000 [==============================] - 17s 8ms/step - loss: 0.6188\nEpoch 2/4\n2000/2000 [==============================] - 1s 427us/step - loss: 0.3330\nEpoch 3/4\n2000/2000 [==============================] - 1s 371us/step - loss: 0.2857\nEpoch 4/4\n2000/2000 [==============================] - 1s 439us/step - loss: 0.2568\n
"},{"location":"services/graphcore/training/L4_other_frameworks/#popart","title":"PopART","text":"The Poplar Advanced Run Time (PopART) enables importing and constructing ONNX graphs, and running graphs in inference, evaluation or training modes. PopART provides both a C++ and Python API.
For more information, see the PopART User Guide
PopART is available from the image graphcore/popart
.
For a quick example, we will run an example script from https://github.com/graphcore/tutorials/tree/sdk-release-3.1/simple_applications/popart/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: popart-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: popart-example\n image: graphcore/popart:3.3.0\n command: [/bin/bash, -c, --]\n args:\n - |\n cd ;\n mkdir build;\n cd build ;\n git clone https://github.com/graphcore/tutorials.git;\n cd tutorials;\n git checkout sdk-release-3.1;\n cd simple_applications/popart/mnist;\n python3 -m pip install -r requirements.txt;\n ./get_data.sh;\n python3 popart_mnist.py --epochs 1\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\nCreating ONNX model.\nCompiling the training graph.\nCompiling the validation graph.\nRunning training loop.\nEpoch #1\n Loss=16.2605\n Accuracy=88.88%\n
"},{"location":"services/graphcore/training/L4_other_frameworks/#writing-ipu-programs-directly-with-poplibs","title":"Writing IPU programs directly with PopLibs","text":"The Poplar libraries are a set of C++ libraries consisting of the Poplar graph library and the open-source PopLibs libraries.
The Poplar graph library provides direct access to the IPU by code written in C++. You can write complete programs using Poplar, or use it to write functions to be called from your application written in a higher-level framework such as TensorFlow.
The PopLibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, element-wise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU.
For more information, see Poplar Quick Start and Poplar and PopLibs User Guide.
These are available from the image graphcore/poplar
.
When using the PopLibs libraries, you will have to include the include files in the include/popops
directory, e.g.
#include <include/popops/ElementWise.hpp>\n
and to link the relevant PopLibs libraries, in addition to the Poplar library, e.g.
g++ -std=c++11 my-program.cpp -lpoplar -lpopops\n
For a quick example, we will run an example from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/poplar/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file>
to create the IPUJob:
apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n generateName: poplib-example-\nspec:\n jobInstances: 1\n ipusPerJobInstance: \"1\"\n workers:\n template:\n spec:\n containers:\n - name: poplib-example\n image: graphcore/poplar:3.3.0\n command: [\"bash\"]\n args: [\"-c\", \"cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/poplar/mnist/ && ./get_data.sh && make && ./regression-demo -IPU 1 50\"]\n resources:\n limits:\n cpu: 32\n memory: 200Gi\n securityContext:\n capabilities:\n add:\n - IPC_LOCK\n volumeMounts:\n - mountPath: /dev/shm\n name: devshm\n restartPolicy: Never\n hostIPC: true\n volumes:\n - emptyDir:\n medium: Memory\n sizeLimit: 10Gi\n name: devshm\n
Running kubectl logs <pod>
should show the results similar to the following
...\nUsing the IPU\nTrying to attach to IPU\nAttached to IPU 0\nTarget:\n Number of IPUs: 1\n Tiles per IPU: 1,472\n Total Tiles: 1,472\n Memory Per-Tile: 624.0 kB\n Total Memory: 897.0 MB\n Clock Speed (approx): 1,850.0 MHz\n Number of Replicas: 1\n IPUs per Replica: 1\n Tiles per Replica: 1,472\n Memory per Replica: 897.0 MB\n\nGraph:\n Number of vertices: 5,466\n Number of edges: 16,256\n Number of variables: 41,059\n Number of compute sets: 20\n\n...\n\nEpoch 1 (99%), accuracy 76%\n
"},{"location":"services/jhub/","title":"EIDF Notebook Service","text":"The EIDF Notebook Service is a scalable Jupyterhub deployment in the EIDF Data Science Cloud.
The Notebook Service is open to all EIDF users and offers a selection of data science environments and user interfaces, including Jupyter notebooks, Jupyter Lab and RStudio.
Follow Quickstart to start using the EIDF Notebook Service.
"},{"location":"services/jhub/quickstart/","title":"Quickstart","text":""},{"location":"services/jhub/quickstart/#accessing","title":"Accessing","text":"Access the EIDF Notebooks in your browser by opening https://notebook.eidf.ac.uk/. You must be a member of an active EIDF project and have a user account to use the EIDF Notebook Service.
Click on \"Sign In with SAFE\". You will be redirected to the SAFE login page.
Log into the SAFE if you're not logged in already. If you have more than one account you will be presented with the form \"Approve Token\" and a choice of user accounts for the Notebook Service. This account is the user in your notebooks and you can share data with your DSC VMs within the same project.
Select the account you would like to use from the dropdown \"User Account\" at the end of the form. Then press \"Accept\" to return to the EIDF Notebook Service where you can select a server environment.
Select the environment that you would like to use for your notebooks and press \"Start\". Now your notebook container will be launched. This may take a little while.
"},{"location":"services/jhub/quickstart/#first-notebook","title":"First Notebook","text":"You will be presented with the JupyterLab dashboard view when the container has started.
The availability of launchers depends on the environment that you selected.
For example launch a Python 3 notebook or an R notebook from the dashboard. You can also launch a terminal session.
"},{"location":"services/jhub/quickstart/#python-packages","title":"Python packages","text":"Note that Python packages are installed into the system space of your container by default. However this means that they are not available after a restart of your notebook container which may happen when your session was idle for a while. We recommend specifying --user
to install packages into your user directory to preserve installations across sessions.
To install python packages in a notebook use the command:
!pip install <package> --user\n
or run the command in a terminal:
pip install <package> --user\n
"},{"location":"services/jhub/quickstart/#data","title":"Data","text":"There is a project space mounted in /project_data
. Only project accounts have permissions to view and write to their project folder in this space. Here you can share data with other notebook users in your project. Data placed in /project_data/shared
is shared with other notebook users outside your project.
You can also share data with DSC VMs in your project. Please contact the helpdesk if you would like to mount this project space to one of your VMs.
"},{"location":"services/jhub/quickstart/#limits","title":"Limits","text":"Note that there are limited amounts of memory and cores available per user. Users do not have sudo permissions in the containers so you cannot install any system packages.
Currently there is no access to GPUs. You can submit jobs to the EIDF GPU Service but you cannot run your notebooks on a GPU.
"},{"location":"services/mft/","title":"MFT","text":""},{"location":"services/mft/quickstart/","title":"Managed File Transfer","text":""},{"location":"services/mft/quickstart/#getting-to-the-mft","title":"Getting to the MFT","text":"The EIDF MFT can be accessed at https://eidf-mft.epcc.ed.ac.uk
"},{"location":"services/mft/quickstart/#how-it-works","title":"How it works","text":"The MFT provides a 'drop zone' for the project. All users in a given project will have access to the same shared transfer area. They will have the ability to upload, download, and delete files from the project's transfer area. This area is linked to a directory within the projects space on the shared backend storage.
Files which are uploaded are owned by the Linux user 'nobody' and the group ID of whatever project the file is being uploaded to. They have the permissions: Owner = rw Group = r Others = r
Once the file is opened on the VM, the user that opened it will become the owner and they can make further changes.
"},{"location":"services/mft/quickstart/#gaining-access-to-the-mft","title":"Gaining access to the MFT","text":"By default a project won't have access to the MFT, this has to be enabled. Currently this can be done by the PI sending a request to the EIDF Helpdesk. Once the project is enabled within the MFT, every user with the project will be able to log into the MFT using their usual EIDF credentials.
Once MFT access has been enabled for a project, PIs can give a project user access to the MFT. A new 'eidf-mft' machine option will be available for each user within the portal, which the PI can select to grant the user access to the MFT.
"},{"location":"services/mft/using-the-mft/","title":"Using the MFT Web Portal","text":""},{"location":"services/mft/using-the-mft/#logging-in-to-the-web-browser","title":"Logging in to the web browser","text":"When you reach the MFT home page you can log in using your usual VM project credentials.
You will then be asked what type of session you would like to start. Select New Web Client or Web Client and continue.
"},{"location":"services/mft/using-the-mft/#file-ingress","title":"File Ingress","text":"Once logged in, all files currently in the projects transfer directory will be displayed. Click the 'Upload' button under the 'Home' title to open the dialogue for file upload. You can then drag and drop files in, or click 'Browse' to find them locally.
Once uploaded, the file will be immediately accessible from the project area, and can be used within any EIDF service which has the filesystem mounted.
"},{"location":"services/mft/using-the-mft/#file-egress","title":"File Egress","text":"File egress can be done in the reverse way. By placing the file into the project transfer directory, it will become available in the MFT portal.
"},{"location":"services/mft/using-the-mft/#file-management","title":"File Management","text":"Directories can be created within the project transfer directory, for example with 'Import' and 'Export' to allow for better file management. Files deleted from either the MFT portal or from the VM itself will remove it from the other, as both locations point at the same file. It's only stored in one place, so modifications made from either place will remove the file.
"},{"location":"services/mft/using-the-mft/#sftp","title":"SFTP","text":"Once a project and user have access to the MFT, they can connect to it using SFTP as well as through the web browser.
This can be done by logging into the MFT URL with the user's project account:
```bash
sftp [EIDF username]@eidf-mft.epcc.ed.ac.uk\n
```
"},{"location":"services/mft/using-the-mft/#scp","title":"SCP","text":"Files can be scripted to be upload to the MFT using SCP.
To copy a file to the project MFT area using SCP:
scp /path/to/file [EIDF username]@eidf-mft.epcc.ed.ac.uk:/\n
"},{"location":"services/s3/","title":"Overview","text":"The EIDF S3 Service is an object store with an interface that is compatible with a subset of the Amazon S3 RESTful API.
"},{"location":"services/s3/#service-access","title":"Service Access","text":"Users should have an EIDF account as described in EIDF Accounts.
Project leads can request an object store allocation through a request to the EIDF helpdesk.
"},{"location":"services/s3/#access-keys","title":"Access keys","text":"Select your project at https://portal.eidf.ac.uk/project/. Your access keys are displayed in the table at the top of the page.
For each account, the quota and the number of buckets that it is permitted to create is shown, as well as the access keys. Click on \"Secret\" to view the access secret. You will need the access key, the corresponding access secret and the endpoint https://s3.eidf.ac.uk
to connect to the EIDF S3 Service with an S3 client.
Access management: Project management guide to managing accounts and access permissions for your S3 allocation.
Tutorial: Examples using EIDF S3
"},{"location":"services/s3/manage/","title":"Manage EIDF S3 access","text":"Access keys and accounts for the object store are managed by project managers via the EIDF Portal.
"},{"location":"services/s3/manage/#request-an-allocation","title":"Request an allocation","text":"An object store allocation for a project may be requested by contacting the EIDF helpdesk.
"},{"location":"services/s3/manage/#object-store-accounts","title":"Object store accounts","text":"Select your project at https://portal.eidf.ac.uk/project/ and jump to \"S3 Allocation\" on the project page to manage access keys and accounts.
S3 buckets and objects are owned by an account. Each account has a quota for storage and the number of buckets that it can create. The sum of all account quotas is limited by the total storage quota of the project object store allocation shown at the top.
An account with the minimum storage quota (1B) and zero buckets is effectively read only as it may not create new buckets and so cannot upload files.
To create an account:
_
only)You will not be allowed to create an account with the quota greater than the available storage quota of the project.
It may take a little while for the account to become available. Refresh the project page to update the list of accounts.
"},{"location":"services/s3/manage/#access-keys","title":"Access keys","text":"To use S3 (listing or creating buckets, listing objects or uploading and downloading files) you need an access key and a secret. An account can own any number of access keys. These keys share the account's quota and have access to the same buckets.
To create an access key:
It can take a little while for the access keys to become available. Refresh the project page to update the list of keys.
"},{"location":"services/s3/manage/#access-key-permissions","title":"Access key permissions","text":"You can control which project members are allowed to view an access key and secret in the EIDF Portal or the SAFE. Project managers and the PI have access to all S3 accounts and can view associated access keys and secrets in the project management view.
To grant view permissions for an access key to a project member:
It can take a little while for the permissions update to complete.
Note
Anyone who knows an access key and secret will be able to perform the associated activities via the S3 API regardless of the view permissions.
"},{"location":"services/s3/manage/#delete-an-access-key","title":"Delete an access key","text":"Click on the \"Bin\" icon next to a key and press \"Delete\" on the form.
"},{"location":"services/s3/tutorial/","title":"Tutorial","text":"Buckets owned by an EIDF project are placed in a tenancy in the EIDF S3 Service. The project code is a prefix on the bucket name, separated by a colon (:
), for example eidfXX1:somebucket
. Note that some S3 client libraries do not accept bucket names in this format.
The following examples use the AWS Command Line Interface (AWS CLI) to connect to EIDF S3.
"},{"location":"services/s3/tutorial/#setup","title":"Setup","text":"Install with pip
python -m pip install awscli\n
Installers are available for various platforms if you are not using Python: see https://aws.amazon.com/cli/
"},{"location":"services/s3/tutorial/#configure","title":"Configure","text":"Set your access key and secret as environment variables or configure a credentials file at ~/.aws/credentials
on Linux or %USERPROFILE%\\.aws\\credentials
on Windows.
Credentials file:
[default]\naws_access_key_id=<key>\naws_secret_access_key=<secret>\n
Environment variables:
export AWS_ACCESS_KEY_ID=<key>\nexport AWS_SECRET_ACCESS_KEY=<secret>\n
The parameter --endpoint-url https://s3.eidf.ac.uk
must always be set when calling a command.
List the buckets in your account:
aws s3 ls --endpoint-url https://s3.eidf.ac.uk\n
Create a bucket:
aws s3api create-bucket --bucket <bucketname> --endpoint-url https://s3.eidf.ac.uk\n
Upload a file:
aws s3 cp <filename> s3://<bucketname> --endpoint-url https://s3.eidf.ac.uk\n
Check that the file above was uploaded successfully by listing objects in the bucket:
aws s3 ls s3://<bucketname> --endpoint-url https://s3.eidf.ac.uk\n
To read from a public bucket without providing credentials, add the option --no-sign-request
to the call:
aws s3 ls s3://<bucketname> --no-sign-request --endpoint-url https://s3.eidf.ac.uk\n
"},{"location":"services/s3/tutorial/#python-using-boto3","title":"Python using boto3
","text":"The following examples use the Python library boto3
.
Installation:
python -m pip install boto3\n
"},{"location":"services/s3/tutorial/#usage","title":"Usage","text":"By default, the boto3
Python library raises an error that bucket names with a colon :
(as used by the EIDF S3 Service) are invalid, so we have to switch off the bucket name validation:
import boto3\nfrom botocore.handlers import validate_bucket_name\n\ns3 = boto3.resource('s3', endpoint_url='https://s3.eidf.ac.uk')\ns3.meta.client.meta.events.unregister('before-parameter-build.s3', validate_bucket_name)\n
List buckets:
for bucket in s3.buckets.all():\n print(f'{bucket.name}')\n
List objects in a bucket:
project_code = 'eidfXXX'\nbucketname = 'somebucket'\nbucket = s3.Bucket(f'{project_code}:{bucket_name}')\nfor obj in bucket.objects.all():\n print(f'{obj.key}')\n
Upload a file to a bucket:
bucket = s3.Bucket(f'{project_code}:{bucket_name}')\nbucket.upload_file('./somedata.csv', 'somedata.csv')\n
"},{"location":"services/s3/tutorial/#access-policies","title":"Access policies","text":"Bucket permissions use IAM policies. You can grant other accounts (within the same project or from other projects) read or write access to your buckets. For example to grant permissions to put, get, delete and list objects in bucket eidfXX1:somebucket
to the account account2
in project eidfXX2
:
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Sid\": \"AllowAccessToBucket\",\n \"Principal\": {\n \"AWS\": [\n \"arn:aws:iam::eidfXX2:user/account2\",\n ]\n },\n \"Effect\": \"Allow\",\n \"Action\": [\n \"s3:PutObject\",\n \"s3:GetObject\",\n \"s3:ListBucket\",\n \"s3:DeleteObject\",\n ],\n \"Resource\": [\n \"arn:aws:s3:::/*\",\n \"arn:aws:s3::eidfXX1:somebucket\"\n ]\n }\n ]\n}\n
You can chain multiple policies in the statement array:
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Principal\": { ... }\n \"Effect\": \"Allow\",\n \"Action\": [ ... ],\n \"Resource\": [ ... ]\n },\n {\n \"Principal\": { ... }\n \"Effect\": \"Allow\",\n \"Action\": [ ... ],\n \"Resource\": [ ... ]\n }\n ]\n}\n
Give public read access to a bucket (listing and downloading files):
{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Principal\": \"*\",\n \"Action\": [\"s3:ListBucket\"],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:somebucket\"\n ]\n },\n {\n \"Effect\": \"Allow\",\n \"Principal\": \"*\",\n \"Action\": [\"s3:GetObject\"],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:somebucket/*\"\n ]\n }\n ]\n}\n
"},{"location":"services/s3/tutorial/#set-policy-using-aws-cli","title":"Set policy using AWS CLI","text":"Grant permissions stored in an IAM policy file:
aws put-bucket-policy --bucket <bucketname> --policy \"$(cat bucket-policy.json)\"\n
"},{"location":"services/s3/tutorial/#set-policy-using-python-boto3","title":"Set policy using Python boto3
","text":"Grant permissions to another account: In this example we grant ListBucket
and GetObject
permissions to account account1
in project eidfXX1
and account2
in project eidfXX2
.
import json\n\nbucket_policy = {\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Principal\": {\n \"AWS\": [\n \"arn:aws:iam::eidfXX1:user/account1\",\n \"arn:aws:iam::eidfXX2:user/account2\",\n ]\n },\n \"Action\": [\n \"s3:ListBucket\",\n \"s3:GetObject\"\n ],\n \"Resource\": [\n f\"arn:aws:s3::eidfXX1:{bucket_name}\"\n f\"arn:aws:s3::eidfXX1:{bucket_name}/*\"\n ]\n }\n ]\n}\n\npolicy = bucket.Policy()\npolicy.put(Policy=json.dumps(bucket_policy))\n
"},{"location":"services/ultra2/","title":"Ultra2 Large Memory System","text":"Overview
Connect
Running jobs
"},{"location":"services/ultra2/access/","title":"Overview","text":"Ultra2 is a single logical CPU system based at EPCC. It is suitable for running jobs which require large volumes of non-distributed memory (as opposed to a cluster).
"},{"location":"services/ultra2/access/#specifications","title":"Specifications","text":"The system is a HPE SuperDome Flex containing 576 individual cores in a SMT-1 arrangement (1 thread per core). The system has 18TB of memory available to users. Home directories are network mounted from the EIDF e1000 Lustre filesystem, although some local NVMe storage is available for temporary file storage during runs.
"},{"location":"services/ultra2/access/#getting-access","title":"Getting Access","text":"Access to Ultra2 is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.
"},{"location":"services/ultra2/connect/","title":"Login","text":"The hostname for SSH access to the system is ultra2.eidf.ac.uk
To access Ultra2, you need to use two credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP).
"},{"location":"services/ultra2/connect/#ssh-key","title":"SSH Key","text":"You must upload the public part of your SSH key pair to the SAFE by following the instructions from the SAFE documentation
"},{"location":"services/ultra2/connect/#time-based-one-time-password-totp","title":"Time-based one-time password (TOTP)","text":"You must set up your TOTP token by following the instructions from the SAFE documentation
"},{"location":"services/ultra2/connect/#ssh-login-example","title":"SSH Login example","text":"To login to Ultra2, you will need to use the SSH Key and TOTP token as noted above. With the appropriate key loadedssh <username>@ultra2.eidf.ac.uk
will then prompt you, roughly once per day, for your TOTP code.
The primary HPC software provided is Intel's OneAPI suite containing mpi compilers and runtimes, debuggers and the vTune performance analyser. Standard GNU compilers are also available. The OneAPI suite can be loaded by sourcing the shell script:
source /opt/intel/oneapi/setvars.sh\n
"},{"location":"services/ultra2/run/#queue-system","title":"Queue system","text":"All jobs must be run via SLURM to avoid inconveniencing other users of the system. Users should not run jobs directly. Note that the system has one logical processor with a large number of threads and thus appears to SLURM as a single node. This is intentional.
"},{"location":"services/ultra2/run/#queue-limits","title":"Queue limits","text":"We kindly request that users limit their maximum total running job size to 288 cores and 4TB of memory, whether that be a divided into a single job, or a number of jobs. This may be enforced via SLURM in the future.
"},{"location":"services/ultra2/run/#example-mpi-job","title":"Example MPI job","text":"An example script to run a multi-process MPI \"Hello world\" example is shown.
#!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=4\n#SBATCH --nodelist=sdf-cs1\n#SBATCH --partition=standard\n##SBATCH --exclusive\n\n\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# Source oneAPI to ensure mpirun available\nif [[ -z \"${SETVARS_COMPLETED}\" ]]; then\nsource /opt/intel/oneapi/setvars.sh\nfi\n\n# mpirun invocation for Intel suite.\nmpirun -n ${mpi_threads} ./helloworld.exe\n
"},{"location":"services/virtualmachines/","title":"Overview","text":"The EIDF Virtual Machine (VM) Service is the underlying infrastructure upon which the EIDF Data Science Cloud (DSC) is built.
The service currenly has a mixture of hardware node types which host VMs of various flavours:
The shapes and sizes of the flavours are based on subdivisions of this hardware, noting that CPUs are 4x oversubscribed for mcomp nodes (general VM flavours).
"},{"location":"services/virtualmachines/#service-access","title":"Service Access","text":"Users should have an EIDF account - EIDF Accounts.
Project Leads will be able to have access to the DSC added to their project during the project application process or through a request to the EIDF helpdesk.
"},{"location":"services/virtualmachines/#additional-service-policy-information","title":"Additional Service Policy Information","text":"Additional information on service policies can be found here.
"},{"location":"services/virtualmachines/docs/","title":"Service Documentation","text":""},{"location":"services/virtualmachines/docs/#project-management-guide","title":"Project Management Guide","text":""},{"location":"services/virtualmachines/docs/#required-member-permissions","title":"Required Member Permissions","text":"VMs and user accounts can only be managed by project members with Cloud Admin permissions. This includes the principal investigator (PI) of the project and all project managers (PM). Through SAFE the PI can designate project managers and the PI and PMs can grant a project member the Cloud Admin role:
For details please refer to the SAFE documentation: How can I designate a user as a project manager?
"},{"location":"services/virtualmachines/docs/#create-a-vm","title":"Create a VM","text":"To create a new VM:
eidfxxx
Complete the 'Create Machine' form as follows:
dev-01
. The project code will be prepended automatically to your VM name, in this case your VM would be named eidfxxx-dev-01
.Click on 'Create'
You may wish to ensure that the machine size selected (number of CPUs and RAM) does not exceed your remaining quota before you press Create, otherwise the request will fail.
In the list of 'Machines' in the project page in the portal, click on the name of new VM to see the configuration and properties, including the machine specification, its 10.24.*.*
IP address and any configured VDI connections.
Each project has a quota for the number of instances, total number of vCPUs, total RAM and storage. You will not be able to create a VM if it exceeds the quota.
You can view and refresh the project usage compared to the quota in a table near the bottom of the project page. This table will be updated automatically when VMs are created or removed, and you can refresh it manually by pressing the \"Refresh\" button at the top of the table.
Please contact the helpdesk if your quota requirements have changed.
"},{"location":"services/virtualmachines/docs/#add-a-user-account","title":"Add a user account","text":"User accounts allow project members to log in to the VMs in a project. The Project PI and project managers manage user accounts for each member of the project. Users usually use one account (username and password) to log in to all the VMs in the same project that they can access, however a user may have multiple accounts in a project, for example for different roles.
Complete the 'Create User Account' form as follows:
The user can now set the password for their new account on the account details page.
"},{"location":"services/virtualmachines/docs/#adding-access-to-the-vm-for-a-user","title":"Adding Access to the VM for a User","text":"User accounts can be granted or denied access to existing VMs.
If a user is logged in already to the VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi newly added connections may not appear in their connections list immediately. They must log out and log in again to refresh the connection information, or wait until the login token expires and is refreshed automatically - this might take a while.
If a user only has one connection available in the VDI they will be automatically directed to the VM with the default connection.
"},{"location":"services/virtualmachines/docs/#sudo-permissions","title":"Sudo permissions","text":"A project manager or PI may also grant sudo permissions to users on selected VMs. Management of sudo permissions must be requested in the project application - if it was not requested or the request was denied the functionality described below is not available.
After a few minutes, the job to give the user account sudo permissions on the selected VMs will complete. On the account detail page a \"sudo\" badge will appear next to the selected VMs.
Please contact the helpdesk if sudo permission management is required but is not available in your project.
"},{"location":"services/virtualmachines/docs/#first-login","title":"First login","text":"A new user account must reset the password before they can log in for the first time. To do this:
If you did not select RDP access when you created the VM you can add it later:
Once the RDP job is completed, all users that are allowed to access the VM will also be permitted to use the RDP connection.
"},{"location":"services/virtualmachines/docs/#software-catalogue","title":"Software catalogue","text":"You can install packages from the software catalogue at a later time, even if you didn't select a package when first creating the machine.
It is the responsibility of project PIs to keep the VMs in their projects up to date as stated in the policy.
"},{"location":"services/virtualmachines/docs/#ubuntu","title":"Ubuntu","text":"To patch and update packages on Ubuntu run the following commands (requires sudo permissions):
sudo apt update\nsudo apt upgrade\n
Your system might require a restart after installing updates.
"},{"location":"services/virtualmachines/docs/#rocky","title":"Rocky","text":"To patch and update packages on Rocky run the following command (requires sudo permissions):
sudo dnf update\n
Your system might require a restart after installing updates.
"},{"location":"services/virtualmachines/docs/#reboot","title":"Reboot","text":"When logged in you can reboot a VM with this command (requires sudo permissions):
sudo reboot now\n
or use the reboot button in the EIDF Portal (requires project manager permissions).
"},{"location":"services/virtualmachines/flavours/","title":"Flavours","text":"These are the current Virtual Machine (VM) flavours (configurations) available on the the Virtual Desktop cloud service. Note that all VMs are built and configured using the EIDF Portal by PIs/Cloud Admins of projects, except GPU flavours which must be requested via the helpdesk or the support request form.
Flavour Name vCPUs DRAM in GB Pinned Cores GPU general.v2.tiny 1 2 No No general.v2.small 2 4 No No general.v2.medium 4 8 No No general.v2.large 8 16 No No general.v2.xlarge 16 32 No No capability.v2.8cpu 8 112 Yes No capability.v2.16cpu 16 224 Yes No capability.v2.32cpu 32 448 Yes No capability.v2.48cpu 48 672 Yes No capability.v2.64cpu 64 896 Yes No gpu.v1.8cpu 8 128 Yes Yes gpu.v1.16cpu 16 256 Yes Yes gpu.v1.32cpu 32 512 Yes Yes gpu.v1.48cpu 48 768 Yes Yes"},{"location":"services/virtualmachines/policies/","title":"EIDF Data Science Cloud Policies","text":""},{"location":"services/virtualmachines/policies/#end-of-life-policy-for-user-accounts-and-projects","title":"End of Life Policy for User Accounts and Projects","text":""},{"location":"services/virtualmachines/policies/#what-happens-when-an-account-or-project-is-no-longer-required-or-a-user-leaves-a-project","title":"What happens when an account or project is no longer required, or a user leaves a project","text":"These situations are most likely to come about during one of the following scenarios:
For each user account involved, assuming the relevant consent is given, the next step can be summarised as one of the following actions:
It will be possible to have the account re-activated up until resources are removed (as outlined above); after this time it will be necessary to re-apply.
A user's right to use EIDF is granted by a project. Our policy is to treat the account and associated data as the property of the PI as the owner of the project and its resources. It is the user's responsibility to ensure that any data they store on the EIDF DSC is handled appropriately and to copy off anything that they wish to keep to an appropriate location.
A project manager or the PI can revoke a user's access accounts within their project at any time, by locking, removing or re-owning the account as appropriate.
A user may give up access to an account and return it to the control of the project at any time.
When a project is due to end, the PI will receive notification of the closure of the project and its accounts one month before all project accounts and DSC resources (VMs, data volumes) are closed and cleaned or removed.
"},{"location":"services/virtualmachines/policies/#backup-policies","title":"Backup policies","text":"The current policy is:
We strongly advise that you keep copies of any critical data on an alternative system that is fully backed up.
"},{"location":"services/virtualmachines/policies/#patching-of-user-vms","title":"Patching of User VMs","text":"The EIDF team updates and patches the hypervisors and the cloud management software as part of the EIDF Maintenance sessions. It is the responsibility of project PIs to keep the VMs in their projects up to date. VMs running the Ubuntu and Rocky operating systems automatically install security patches and alert users at log-on (via SSH) to reboot as necessary for the changes to take effect. They also encourage users to update packages.
"},{"location":"services/virtualmachines/policies/#customer-run-outward-facing-web-services","title":"Customer-run outward facing web services","text":"PIs can apply to run an outward-facing service; that is a webservice on port 443, running on a project-owned VM. The policy requires the customer to accept the following conditions:
Pis can apply for such a service on application and also at any time by contacing the EIDF Service Desk.
"},{"location":"services/virtualmachines/quickstart/","title":"Quickstart","text":"Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.
Authentication is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.
"},{"location":"services/virtualmachines/quickstart/#accessing-your-projects","title":"Accessing your projects","text":"Log into the portal at https://portal.eidf.ac.uk/. The login will redirect you to the SAFE.
View the projects that you have access to at https://portal.eidf.ac.uk/project/
Navigate to https://portal.eidf.ac.uk/project/ and click the link to \"Request access\", or choose \"Request Access\" in the \"Project\" menu.
Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".
Now you have to wait for your PI or project manager to accept your request to join.
"},{"location":"services/virtualmachines/quickstart/#accessing-a-vm","title":"Accessing a VM","text":"Select a project and view your user accounts on the project page.
Click on an account name to view details of the VMs that are you allowed to access with this account, and to change the password for this account.
Before you log in for the first time with a new user account, you must change your password as described below.
Follow the link to the Guacamole login or log in directly at https://eidf-vdi.epcc.ed.ac.uk/vdi/. Please see the VDI guide for more information.
You can also log in via the EIDF Gateway Jump Host if this is available in your project.
Warning
You must set a password for a new account before you log in for the first time.
"},{"location":"services/virtualmachines/quickstart/#set-or-change-the-password-for-a-user-account","title":"Set or change the password for a user account","text":"Follow these instructions to set a password for a new account before you log in for the first time. If you have forgotten your password you may reset the password as described here.
Select a project and click the account name in the project page to view the account details.
In the user account detail page, press the button \"Set Password\" and follow the instructions in the form.
There may be a short delay while the change is implemented before the new password becomes usable.
"},{"location":"services/virtualmachines/quickstart/#further-information","title":"Further information","text":"Managing VMs: Project management guide to creating, configuring and removing VMs and managing user accounts in the portal.
Virtual Desktop Interface: Working with the VDI interface.
EIDF Gateway: SSH access to VMs via the EIDF SSH Gateway jump host.
"},{"location":"status/","title":"EIDF Service Status","text":"The table below represents the broad status of each EIDF service.
Service Status EIDF Portal VM SSH Gateway VM VDI Gateway Virtual Desktops Cerebras CS-2 Ultra2"},{"location":"status/#maintenance-sessions","title":"Maintenance Sessions","text":"There will be a service outage on the 3rd Thursday of every month from 9am to 5pm. We keep maintenance downtime to a minimum on the service but do occasionally need to perform essential work on the system. Maintenance sessions are used to ensure that:
The service will be returned to service ahead of 5pm if all the work is completed early.
"}]} \ No newline at end of file diff --git a/services/cs2/run/index.html b/services/cs2/run/index.html index 27b464764..91d08d2c2 100644 --- a/services/cs2/run/index.html +++ b/services/cs2/run/index.html @@ -884,78 +884,6 @@ - - -source venv_cerebras_pt/bin/activate
cerebras_install_check
Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
-vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py
-
:530
-
The section should look like this:
-if modified_time > self._last_modified:
- raise RuntimeError(
- f"Attempting to materialize deferred tensor with key "
- f"\"{self._key}\" from file {self._filepath}, but the file has "
- f"since been modified. The loaded tensor value may be "
- f"different from originally loaded tensor. Please refrain "
- f"from modifying the file while the run is in progress."
- )
-
if modified_time > self._last_modified
#if modified_time > self._last_modified:
- # raise RuntimeError(
- # f"Attempting to materialize deferred tensor with key "
- # f"\"{self._key}\" from file {self._filepath}, but the file has "
- # f"since been modified. The loaded tensor value may be "
- # f"different from originally loaded tensor. Please refrain "
- # f"from modifying the file while the run is in progress."
- # )
-
:774
-
The section should look like this:
- if stat.st_mtime_ns > self._stat.st_mtime_ns:
- raise RuntimeError(
- f"Attempting to {msg} deferred tensor with key "
- f"\"{self._key}\" from file {self._filepath}, but the file has "
- f"since been modified. The loaded tensor value may be "
- f"different from originally loaded tensor. Please refrain "
- f"from modifying the file while the run is in progress."
- )
-
if stat.st_mtime_ns > self._stat.st_mtime_ns
#if stat.st_mtime_ns > self._stat.st_mtime_ns:
- # raise RuntimeError(
- # f"Attempting to {msg} deferred tensor with key "
- # f"\"{self._key}\" from file {self._filepath}, but the file has "
- # f"since been modified. The loaded tensor value may be "
- # f"different from originally loaded tensor. Please refrain "
- # f"from modifying the file while the run is in progress."
- # )
-
There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.