The presentation given during the session is at https://mikecroucher.github.io/Intro_to_HPC/#/
The University of Sheffield has two HPC systems:
- SHARC Sheffield's newest system. It contains about 2000 CPU cores all of which are latest generation.
- Iceberg Iceberg is Sheffield's old system. It contains 3440 CPU cores but many of them are very old and slow.
The two systems are broadly similar but have small differences in the way you load applications using the module system.
We recommend that you use ShARC as much as possible.
To use the HPC from a Windows machine, you need a way to connect - we recommend you install MobaXterm. This is available from http://mobaxterm.mobatek.net. On a University machine, you need to install the Portable edition (highlighted in the image below):
The download is a zip file that contains three other files. You should extract these files, for example to your desktop, before you use them. Do not run MobaXterm directly from the zip file.
mobaXterm also contains mobaTextEditor which you can use to write your programs.
You can connect to ShARC using MobaXterm as shown in the screenshot below.
The Remote Host
field should contain sharc.sheffield.ac.uk
:
If your log-in is successful, you should see something like the screen below.
If you cannot see the file browser pane on the left-hand side then see Troubleshooting
At this point, you are on the log in or Master node of ShARC. There isn't much compute power here and many people use it simultaneously. As such, we should get onto a compute node as fast as possible.
Since ShARC is a shared system, used by 100s of users, we need to request some resources from the scheduler using the command qrshx
. We need to tell the system how much memory we want to use.
For example, to request 8 Gigabytes (8G) of memory, we would enter
qrshx -l rmem=8G
Note: the l
is a small letter L
not the number 1
If this command is successful, you should see the prompt change from sharc-login1
or sharc-login-2
to sharc-nodeXXX
where XXX will be replaced with the number of the node you have been assigned.
You are now on a compute node and have access to your own CPU core and 8 Gigabytes of RAM.
Now would be a good time to learn some Linux commands using our Mini Terminal Tutorial
To run a Scala program on a Linux machine, it will need to be compiled using the Scala build tool (SBT). This requires a very strict directory structure and a .sbt
file specifying dependencies. We illustrate this on the helloWorld
example.
On the compute node, download a prepared Hello World application from GitHub with the command
git clone https://github.com/mikecroucher/scala-spark-HelloWorld
Enter the directory containing the code with the command
cd scala-spark-HelloWorld/
List the files in this directory with the Linux command ls
ls
Should give the output
project.sbt README.md src
Take a look at the contents of project.sbt
, which defines our project, with the Linux command more
more project.sbt
Should give the output
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.1",
"org.apache.spark" %% "spark-sql" % "2.0.1"
)
// Could add other dependencies here e.g.
// libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.0.1"
The project.sbt
defines how our project will be compiled.
The code that is to be compiled is hidden a few directories deeper.
View it using the more
command
more src/main/scala/hello.scala
Which should give the output
import org.apache.spark.sql.SparkSession
// define main method (scala entry point)
object HelloWorld {
def main(args: Array[String]): Unit = {
// initialise spark session (running in "local" mode)
val sparkSession = SparkSession.builder
.master("local")
.appName("Hello World")
.getOrCreate()
// do stuff
println("Hello, world!")
// terminate underlying spark context
sparkSession.stop()
}
}
We've downloaded a project, taken a look at it and all seems well. We are almost ready to compile.
The command we need to use is sbt package
but when we try it, it doesn't work:
sbt package
results in
bash: sbt: command not found
This error message occurs because the sbt
command is not available to us by default when we start a qrshx
session on a compute node.
To make sbt
available (and Java and Spark which we also need), We first have to load the relevant module files
module load apps/java/jdk1.8.0_102/binary
module load dev/sbt/0.13.13
module load apps/spark/2.1.0/gcc-4.8.5
Now, when you type sbt package
, it will compile your program.
If this is successful, you'll have a file in the location target/scala-2.11/hello_2.11-1.0.jar
.
Run with
spark-submit --master local[1] target/scala-2.11/hello_2.11-1.0.jar
You will see warnings like:
WARN: Unable to load native-hadoop library
These can be ignored.
We'll now learn how to create HelloWorld from scratch to give us practice in using Linux commands.
Ensure you are in your home directory by executing the command cd
on its own. Check that you are where you expect to be using the pwd
(print working directory) command.
The result should be
/home/abc123
where abc123
will be replaced by your username.
Start by creating the project directory. We'll call this hello
in this case.
To create our directory, we could use the graphical user interface of MobaXterm as shown in the screen shot below
It's much easier, however, to use the mkdir
command
mkdir hello
we could then proceed to create the other directories we need one command at a time:
mkdir hello/src
mkdir hello/src/main
mkdir hello/src/main/scala
Alternatively, we could take a shortcut and the -p
switch of mkdir
to create the whole nested structure at once.
mkidr -p hello/src/main/scala
Linux geeks are terminally lazy so if it feels like there should be a shortcut, there probably is one.
However you do it, you need to create the above 4 embedded directories.
Here, we create .sbt
file and .scala
file on the Windows machine (by downloading them or by copying and pasting them using an editor) and then transfer them to ShARC.
Recall that the .sbt
file contains the dependencies required by the program. Take a look at the .sbt
file included here for the helloWorld
program. The .scala
program is also available.
The .sbt
file needs to be placed at the top level of the project.
You can just drag and drop it from Windows to ShARC using MobaXterm.
The .scala
file needs to be placed in the scala
directory.
Exactly as before, we compile and run with sbt package
but first make sure that the present working directory of your terminal is the one containing your new project.sbt
(otherwise when you run sbt
it won't know how to compile your code).
You can check your present working directory (PWD) using pwd
and list the files in your PWD using ls
. Note that the PWD of the terminal and of MobaXterm's file browser do not need to be the same. Change into the hello
directory if necessary using a two-letter command you learned earlier in this tutorial.
If sbt package
is successful, you'll have a file in the location target/scala-2.11/hello_2.11-1.0.jar
.
Run with
spark-submit --master local[1] target/scala-2.11/hello_2.11-1.0.jar
Running short jobs, such as compiling our scala code or running Hello World is fine in interactive qrshx
sessions.
However, when we want to run long jobs or request resources such as multiple CPUs, we should start using batch processing.
Let's get an example from GitHub that calculates Pi using a Monte Carlo algorithm.
git clone https://github.com/mikecroucher/scala-spark-MontePi
Compile it as usual
cd scala-spark-MontePi/
sbt package
Instead of running it interactively, we are going to submit it to the scheduler queue.
The example includes a job submission script called submit_to_sharc.sh
Look at this file using the more
command to see if you can understand it.
When you are ready, submit it to the queue with the qsub
command
qsub submit_to_sharc.sh
Your job 83909 ("submit_to_sharc.sh") has been submitted
You can see the status of the queuing or running job with the qstat
command.
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
81304 0.42776 bash ab1abc r 02/14/2017 23:53:29 [email protected] 1
83909 0.00000 submit_to_ ab1abc qw 02/15/2017 03:22:23 [email protected]. 4
If you don't see your submit_to_sharc
job in qstat
's output this is probably because it's already finished running!
qsub
and qstat
are examples of scheduler commands.
A list of them can be found on the HPC documentation website
When the job has completed, you will see two new files in your current directory.
In my case, they were submit_to_sharc.sh.e83909
and submit_to_sharc.sh.o83909
.
The number at the end refers to the job-ID
Look at these two files with the more
command
- The file that ends with
.e83909
contains the standard error stream (stderr) - The file that ends with
.o83909
contains the standard output stream (stdout)
Refer to the Wikipedia article on standard streams for more information in this terminology.
Most of ShARC's nodes have 64Gb of RAM each. There are a small number with 256GB but these are heavily oversubscribed.
Everyone who is part of the MSc in Data Analytics has access to our premium queue which includes access to nodes with up to 768GB of memory.
If you need to request a lot of memory for your job - for example 250 Gigabytes per core - add the following lines to your job submission script:
# Tell the system to make use of the project containing the big memory nodes
#$ -P rse
# Ask for 250 Gigabytes per core
#$ -l rmem=250G
If after starting MobaXterm you cannot see the file browser pane on the left-hand side of the window then:
- Close and restart MobaXterm;
- Session;
- Under Advanced SSH settings ensure that the Use SCP protocol box is ticked (see below);
- Enter Remote host and username as before;
- Click OK to connect.