Skip to content

Commit

Permalink
Removes incorrect wrapper script
Browse files Browse the repository at this point in the history
  • Loading branch information
aturner-epcc committed Feb 29, 2024
1 parent db0a54b commit 889883d
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 35 deletions.
36 changes: 36 additions & 0 deletions docs/tursa-user-guide/hardware.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# ARCHER2 hardware

!!! note
Some of the material in this section is closely based on [information provided by NASA](https://www.nas.nasa.gov/hecc/support/kb/amd-rome-processors_658.html) as part of the documentation for the [Aitkin HPC system](https://www.nas.nasa.gov/hecc/resources/aitken.html).

## System overview

Tursa is a Eviden supercomputing system which has a total of 178 GPU compute nodes. Each GPU compute node has a CPU with 48 cores and 4 NVIDIA A100 GPU. Compute nodes are connected together by an Infiniband interconnect.

There are additional login nodes, which provide access to the system.

Compute nodes are only accessible via the Slurm job scheduling system.

There is a single file system which is available on login and compute nodes (see [Data management and transfer](data.md)).

The Lustre file system has a capacity of 5.1 PiB.

The interconnect uses a Fat Tree topology.

## Interconnect details

Tursa has a high performance interconnect with 4x 200 Gb/s infiniband interfaces per node. It uses a 2-layer fat tree topology:

- Each node connects to 4 of the 5 L1 (leaf) switches within the same cabinet with 200 Gb/s links
- Within an 8-node block, all nodes share the same 4 switches
- Each L1 switch connects to all 20 L2 switches via 200 Gb/s links - leading maximum of 2 switch to switch hops to get between any 2 nodes
- There are no direct L1 to L1 or L2 to L2 switch connections
- 16-node, 32-node and 64-node blocks are constructed from 8-node blocks that show the required performance on the inter-block links








39 changes: 4 additions & 35 deletions docs/tursa-user-guide/scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -491,9 +491,9 @@ across the compute nodes. You will usually add the following options to

## Example job submission scripts

### Example: job submission script for Grid parallel job using CUDA
### Example: job submission script for a parallel job using CUDA

A job submission script for a Grid job that uses 4 compute nodes, 16 MPI
A job submission script for a parallel job that uses 4 compute nodes, 16 MPI
processes per node and 4 GPUs per node. It does not restrict what type of
GPU the job can run on so both A100-40 and A100-80 can be used:

Expand Down Expand Up @@ -540,48 +540,17 @@ export OMPI_MCA_btl_openib_if_exclude=mlx5_1,mlx5_2,mlx5_3
application="my_mpi_openmp_app.x"
options="arg 1 arg2"
mpirun -np $SLURM_NTASKS --map-by numa -x LD_LIBRARY_PATH --bind-to none ./wrapper.sh ${application} ${options}
mpirun -np $SLURM_NTASKS --map-by numa -x LD_LIBRARY_PATH --bind-to none ${application} ${options}
```

This will run your executable "grid" in parallel usimg 16
This will run your executable "my_mpi_opnemp_app.x" in parallel usimg 16
MPI processes on 4 nodes, 8 OpenMP thread will be used per
MPI process and 4 GPUs will be used per node (32 cores per
node, 4 GPUs per node). Slurm will allocate 4 nodes to your
job and srun will place 4 MPI processes on each node.

When running on Tursa it is important that we specify how
each of the GPU's interacts with the network interfaces to
reach optimal network communication performance. To achieve
this, we introduce a wrapper script (specified as `wrapper.sh`
in the example job script above) that sets a number of
environment parameters for each rank in a node (each GPU
in a node) explicitly tell each rank which network interface
it should use to communicate internode.

`wrapper.sh` script example:

```
#!/bin/bash
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
numa1=$(( 2 * $lrank))
numa2=$(( 2 * $lrank + 1 ))
netdev=mlx5_${lrank}:1
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
export UCX_NET_DEVICES=mlx5_${lrank}:1
BINDING="--interleave=$numa1,$numa2"
echo "`hostname` - $lrank device=$CUDA_VISIBLE_DEVICES binding=$BINDING"
numactl ${BINDING} $*
```

See above for a more detailed discussion of the different `sbatch` options.

options

## Using the `dev` QoS

The `dev` QoS is designed for faster turnaround of short jobs than is usually available through
Expand Down

0 comments on commit 889883d

Please sign in to comment.