├── .gitignore ├── README.md ├── cpu ├── Makefile ├── ping_pong.c └── submit.lsf ├── cuda_aware ├── Makefile ├── ping_pong_cuda_aware.cu └── submit.lsf ├── cuda_staged ├── Makefile ├── ping_pong_cuda_staged.cu └── submit.lsf └── images ├── cuda_aware.png └── cuda_staged.png /.gitignore: -------------------------------------------------------------------------------- 1 | pp* 2 | *.o 3 | cpu_ping_pong* 4 | staged_ping_pong* 5 | cuda_aware_ping_pong* 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MPI Ping Pong to Demonstrate CUDA-Aware MPI 2 | 3 | In this tutorial, we will look at a simple ping pong code that measures bandwidth for data transfers between 2 MPI ranks. We will look at a CPU-only version, a CUDA version that stages data through CPU memory, and a CUDA-Aware version that passes data directly between GPUs (using GPUDirect). 4 | 5 | **NOTE:** This code is not optimized to achieve the best bandwidth results. It is simply meant to demonstrate how to use CUDA-Aware MPI. 6 | 7 | ## CPU Version 8 | 9 | We will begin by looking at a CPU-only version of the code in order to understand the idea behind an MPI ping pong program. Basically, 2 MPI ranks pass data back and forth and the bandwidth is calculated by timing the data transfers and knowing the size of the data being transferred. 10 | 11 | Let's look at the `cpu/ping_pong.c` code to see how this is implemented. At the top of the `main` program, we initialize MPI, determine the total number of MPI ranks, determine each rank's ID, and make sure we only have 2 total ranks: 12 | 13 | ``` c 14 | /* ------------------------------------------------------------------------------------------- 15 | MPI Initialization 16 | --------------------------------------------------------------------------------------------*/ 17 | MPI_Init(&argc, &argv); 18 | 19 | int size; 20 | MPI_Comm_size(MPI_COMM_WORLD, &size); 21 | 22 | int rank; 23 | MPI_Comm_rank(MPI_COMM_WORLD, &rank); 24 | 25 | MPI_Status stat; 26 | 27 | if(size != 2){ 28 | if(rank == 0){ 29 | printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size); 30 | } 31 | MPI_Finalize(); 32 | exit(0); 33 | } 34 | ``` 35 | 36 | Next, we enter our main `for` loop, where each iteration of the loop performs data transfers and bandwidth calculations for a different message size, ranging from 8 B to 1 GB (note that each element of the array is a double-precision variable of size 8 B, and `1 << i` can be read as "2 raised to the i power"): 37 | 38 | ``` c 39 | /* ------------------------------------------------------------------------------------------- 40 | Loop from 8 B to 1 GB 41 | --------------------------------------------------------------------------------------------*/ 42 | 43 | for(int i=0; i<=27; i++){ 44 | 45 | long int N = 1 << i; 46 | 47 | // Allocate memory for A on CPU 48 | double *A = (double*)malloc(N*sizeof(double)); 49 | 50 | ... 51 | ``` 52 | 53 | We then initialize the array `A`, set some tags to match MPI Send/Receive pairs, set `loop_count` (used later), and run a warm-up loop 5 times to remove any MPI setup costs: 54 | 55 | ``` c 56 | // Initialize all elements of A to random values 57 | for(int i=0; i 2 | #include 3 | #include 4 | 5 | int main(int argc, char *argv[]) 6 | { 7 | /* ------------------------------------------------------------------------------------------- 8 | MPI Initialization 9 | --------------------------------------------------------------------------------------------*/ 10 | MPI_Init(&argc, &argv); 11 | 12 | int size; 13 | MPI_Comm_size(MPI_COMM_WORLD, &size); 14 | 15 | int rank; 16 | MPI_Comm_rank(MPI_COMM_WORLD, &rank); 17 | 18 | MPI_Status stat; 19 | 20 | if(size != 2){ 21 | if(rank == 0){ 22 | printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size); 23 | } 24 | MPI_Finalize(); 25 | exit(0); 26 | } 27 | 28 | /* ------------------------------------------------------------------------------------------- 29 | Loop from 8 B to 1 GB 30 | --------------------------------------------------------------------------------------------*/ 31 | 32 | for(int i=0; i<=27; i++){ 33 | 34 | long int N = 1 << i; 35 | 36 | // Allocate memory for A on CPU 37 | double *A = (double*)malloc(N*sizeof(double)); 38 | 39 | // Initialize all elements of A to random values 40 | for(int i=0; i 2 | #include 3 | #include 4 | 5 | // Macro for checking errors in CUDA API calls 6 | #define cudaErrorCheck(call) \ 7 | do{ \ 8 | cudaError_t cuErr = call; \ 9 | if(cudaSuccess != cuErr){ \ 10 | printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\ 11 | exit(0); \ 12 | } \ 13 | }while(0) 14 | 15 | 16 | int main(int argc, char *argv[]) 17 | { 18 | /* ------------------------------------------------------------------------------------------- 19 | MPI Initialization 20 | --------------------------------------------------------------------------------------------*/ 21 | MPI_Init(&argc, &argv); 22 | 23 | int size; 24 | MPI_Comm_size(MPI_COMM_WORLD, &size); 25 | 26 | int rank; 27 | MPI_Comm_rank(MPI_COMM_WORLD, &rank); 28 | 29 | MPI_Status stat; 30 | 31 | if(size != 2){ 32 | if(rank == 0){ 33 | printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size); 34 | } 35 | MPI_Finalize(); 36 | exit(0); 37 | } 38 | 39 | // Map MPI ranks to GPUs 40 | int num_devices = 0; 41 | cudaErrorCheck( cudaGetDeviceCount(&num_devices) ); 42 | cudaErrorCheck( cudaSetDevice(rank % num_devices) ); 43 | 44 | /* ------------------------------------------------------------------------------------------- 45 | Loop from 8 B to 1 GB 46 | --------------------------------------------------------------------------------------------*/ 47 | 48 | for(int i=0; i<=27; i++){ 49 | 50 | long int N = 1 << i; 51 | 52 | // Allocate memory for A on CPU 53 | double *A = (double*)malloc(N*sizeof(double)); 54 | 55 | // Initialize all elements of A to random values 56 | for(int i=0; i 2 | #include 3 | #include 4 | 5 | // Macro for checking errors in CUDA API calls 6 | #define cudaErrorCheck(call) \ 7 | do{ \ 8 | cudaError_t cuErr = call; \ 9 | if(cudaSuccess != cuErr){ \ 10 | printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\ 11 | exit(0); \ 12 | } \ 13 | }while(0) 14 | 15 | 16 | int main(int argc, char *argv[]) 17 | { 18 | /* ------------------------------------------------------------------------------------------- 19 | MPI Initialization 20 | --------------------------------------------------------------------------------------------*/ 21 | MPI_Init(&argc, &argv); 22 | 23 | int size; 24 | MPI_Comm_size(MPI_COMM_WORLD, &size); 25 | 26 | int rank; 27 | MPI_Comm_rank(MPI_COMM_WORLD, &rank); 28 | 29 | MPI_Status stat; 30 | 31 | if(size != 2){ 32 | if(rank == 0){ 33 | printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size); 34 | } 35 | MPI_Finalize(); 36 | exit(0); 37 | } 38 | 39 | // Map MPI ranks to GPUs 40 | int num_devices = 0; 41 | cudaErrorCheck( cudaGetDeviceCount(&num_devices) ); 42 | cudaErrorCheck( cudaSetDevice(rank % num_devices) ); 43 | 44 | /* ------------------------------------------------------------------------------------------- 45 | Loop from 8 B to 1 GB 46 | --------------------------------------------------------------------------------------------*/ 47 | 48 | for(int i=0; i<=27; i++){ 49 | 50 | long int N = 1 << i; 51 | 52 | // Allocate memory for A on CPU 53 | double *A = (double*)malloc(N*sizeof(double)); 54 | 55 | // Initialize all elements of A to random values 56 | for(int i=0; i