diff --git a/docs/tursa-user-guide/hardware.md b/docs/tursa-user-guide/hardware.md new file mode 100644 index 0000000..150e02a --- /dev/null +++ b/docs/tursa-user-guide/hardware.md @@ -0,0 +1,36 @@ +# ARCHER2 hardware + +!!! note + Some of the material in this section is closely based on [information provided by NASA](https://www.nas.nasa.gov/hecc/support/kb/amd-rome-processors_658.html) as part of the documentation for the [Aitkin HPC system](https://www.nas.nasa.gov/hecc/resources/aitken.html). + +## System overview + +Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect. + +There are additional login nodes, which provide access to the system. + +Compute nodes are only accessible via the Slurm job scheduling system. + +There is a single file system which is available on login and compute nodes (see [Data management and transfer](data.md)). + +The Lustre file system has a capacity of 5.1 PiB. + +The interconnect uses a Fat Tree topology. + +## Interconnect details + +Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology: + +- Each node connects to 4 of the 5 L1 (leaf) switches within the same cabinet with 200 Gb/s links +- Within an 8-node block, all nodes share the same 4 switches +- Each L1 switch connects to all 20 L2 switches via 200 Gb/s links - leading maximum of 2 switch to switch hops to get between any 2 nodes +- There are no direct L1 to L1 or L2 to L2 switch connections +- 16-node, 32-node and 64-node blocks are constructed from 8-node blocks that show the required performance on the inter-block links + + + + + + + + diff --git a/docs/tursa-user-guide/scheduler.md b/docs/tursa-user-guide/scheduler.md index 004c7ab..e727dce 100644 --- a/docs/tursa-user-guide/scheduler.md +++ b/docs/tursa-user-guide/scheduler.md @@ -491,9 +491,9 @@ across the compute nodes. You will usually add the following options to ## Example job submission scripts -### Example: job submission script for Grid parallel job using CUDA +### Example: job submission script for a parallel job using CUDA -A job submission script for a Grid job that uses 4 compute nodes, 16 MPI +A job submission script for a parallel job that uses 4 compute nodes, 16 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used: @@ -540,48 +540,17 @@ export OMPI_MCA_btl_openib_if_exclude=mlx5_1,mlx5_2,mlx5_3 application="my_mpi_openmp_app.x" options="arg 1 arg2" -mpirun -np $SLURM_NTASKS --map-by numa -x LD_LIBRARY_PATH --bind-to none ./wrapper.sh ${application} ${options} +mpirun -np $SLURM_NTASKS --map-by numa -x LD_LIBRARY_PATH --bind-to none ${application} ${options} ``` -This will run your executable "grid" in parallel usimg 16 +This will run your executable "my_mpi_opnemp_app.x" in parallel usimg 16 MPI processes on 4 nodes, 8 OpenMP thread will be used per MPI process and 4 GPUs will be used per node (32 cores per node, 4 GPUs per node). Slurm will allocate 4 nodes to your job and srun will place 4 MPI processes on each node. -When running on Tursa it is important that we specify how -each of the GPU's interacts with the network interfaces to -reach optimal network communication performance. To achieve -this, we introduce a wrapper script (specified as `wrapper.sh` -in the example job script above) that sets a number of -environment parameters for each rank in a node (each GPU -in a node) explicitly tell each rank which network interface -it should use to communicate internode. - -`wrapper.sh` script example: - -``` -#!/bin/bash - - -lrank=$OMPI_COMM_WORLD_LOCAL_RANK -numa1=$(( 2 * $lrank)) -numa2=$(( 2 * $lrank + 1 )) -netdev=mlx5_${lrank}:1 - -export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK -export UCX_NET_DEVICES=mlx5_${lrank}:1 -BINDING="--interleave=$numa1,$numa2" - -echo "`hostname` - $lrank device=$CUDA_VISIBLE_DEVICES binding=$BINDING" - -numactl ${BINDING} $* -``` - See above for a more detailed discussion of the different `sbatch` options. -options - ## Using the `dev` QoS The `dev` QoS is designed for faster turnaround of short jobs than is usually available through