If one has more than one GPU, and certainly if one has more than one node with GPUs, it is natural to ask how to think about programming with MPI.
First, this may require a design decision about how to go about the problem.
A natural choice may be to run one MPI process per device. For example, on a node with 4 GPUs, we would ask for 4 MPI processes. Each individual MPI rank would just set the current device appropriately.
int rank = -1; /* MPI rank */
MPI_Comm_rank(comm, &rank);
cudaSetDevice(rank % ndevicePerNode);
The number of devices per node may be obtained via cudaGetDeviceCount()
or it may require external input.
In order to pass a message between two devices, one might consider:
/* On the sending side ... */
cudaMemcpy(hmsgs, dmsgs, ndata*sizeof(double), cudaMemcpyDeviceToHost);
MPI_Isend(hsmgs, ndata, MPI_DOUBLE, dst, ...);
/* On the receiving side ... */
MPI_Recv(hmsgr, ndata, MPI_DOUBLE, src, ...);
cudaMemcpy(dmsgr, hmsgr, ndata*sizeof(), cudaMemcpyHostToDevice);
This may very well lead to poor performance.
It is possible to use device references in MPI calls on the host. E.g., the previous example might be replaced by
MPI_Isend(dmsgs, ndata, MPI_DOUBLE, dst, ...);
MPI_Recv(dsmgr, ndata, MPI_DOUBLE, src, ...)
Here dmsgs
and dsmgr
are device memory references. If within a node
with fast connections, this should be routed in the appropriate way.
A fall-back to copy via the host may be required for inter-node meaages.
Some architectures have the network interface cards connected directly to the GPUs (rather than the host), so inter-node transfers there would also favour use of GPU-aware MPI.
The NVIDIA HPC SDK includes a build of OpenMPI with GPU-aware MPI enabled. A sample program has been provided with measures the time taken for messages of different size to be send between to MPI tasks by the two methods outlined above.
Have a look at the program, and try to compile and run it.