├── .gitignore
├── README.md
├── cpu
    ├── Makefile
    ├── ping_pong.c
    └── submit.lsf
├── cuda_aware
    ├── Makefile
    ├── ping_pong_cuda_aware.cu
    └── submit.lsf
├── cuda_staged
    ├── Makefile
    ├── ping_pong_cuda_staged.cu
    └── submit.lsf
└── images
    ├── cuda_aware.png
    └── cuda_staged.png


/.gitignore:
--------------------------------------------------------------------------------
1 | pp*
2 | *.o
3 | cpu_ping_pong*
4 | staged_ping_pong*
5 | cuda_aware_ping_pong*
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MPI Ping Pong to Demonstrate CUDA-Aware MPI
  2 | 
  3 | In this tutorial, we will look at a simple ping pong code that measures bandwidth for data transfers between 2 MPI ranks. We will look at a CPU-only version, a CUDA version that stages data through CPU memory, and a CUDA-Aware version that passes data directly between GPUs (using GPUDirect).
  4 | 
  5 | **NOTE:** This code is not optimized to achieve the best bandwidth results. It is simply meant to demonstrate how to use CUDA-Aware MPI.
  6 | 
  7 | ## CPU Version
  8 | 
  9 | We will begin by looking at a CPU-only version of the code in order to understand the idea behind an MPI ping pong program. Basically, 2 MPI ranks pass data back and forth and the bandwidth is calculated by timing the data transfers and knowing the size of the data being transferred.
 10 | 
 11 | Let's look at the `cpu/ping_pong.c` code to see how this is implemented. At the top of the `main` program, we initialize MPI, determine the total number of MPI ranks, determine each rank's ID, and make sure we only have 2 total ranks:
 12 | 
 13 | ``` c
 14 |     /* -------------------------------------------------------------------------------------------
 15 |         MPI Initialization 
 16 |     --------------------------------------------------------------------------------------------*/
 17 |     MPI_Init(&argc, &argv);
 18 | 
 19 |     int size;
 20 |     MPI_Comm_size(MPI_COMM_WORLD, &size);
 21 | 
 22 |     int rank;
 23 |     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 24 | 
 25 |     MPI_Status stat;
 26 | 
 27 |     if(size != 2){
 28 |         if(rank == 0){
 29 |             printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
 30 |         }
 31 |         MPI_Finalize();
 32 |         exit(0);
 33 |     }
 34 | ```
 35 | 
 36 | Next, we enter our main `for` loop, where each iteration of the loop performs data transfers and bandwidth calculations for a different message size, ranging from 8 B to 1 GB (note that each element of the array is a double-precision variable of size 8 B, and `1 << i` can be read as "2 raised to the i power"):
 37 | 
 38 | ``` c
 39 |     /* -------------------------------------------------------------------------------------------
 40 |         Loop from 8 B to 1 GB
 41 |     --------------------------------------------------------------------------------------------*/
 42 | 
 43 |     for(int i=0; i<=27; i++){
 44 | 
 45 |         long int N = 1 << i;
 46 | 
 47 |         // Allocate memory for A on CPU
 48 |         double *A = (double*)malloc(N*sizeof(double));
 49 |         
 50 |         ...
 51 | ```
 52 | 
 53 | We then initialize the array `A`, set some tags to match MPI Send/Receive pairs, set `loop_count` (used later), and run a warm-up loop 5 times to remove any MPI setup costs:
 54 | 
 55 | ``` c
 56 |         // Initialize all elements of A to random values
 57 |         for(int i=0; i<N; i++){
 58 |             A[i] = (double)rand()/(double)RAND_MAX;
 59 |         }
 60 | 
 61 |         int tag1 = 10;
 62 |         int tag2 = 20;
 63 | 
 64 |         int loop_count = 50;
 65 | 
 66 |         // Warm-up loop
 67 |         for(int i=1; i<=5; i++){
 68 |             if(rank == 0){
 69 |                 MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 70 |                 MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 71 |             }
 72 |             else if(rank == 1){
 73 |                 MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
 74 |                 MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
 75 |             }
 76 |         }
 77 | ```
 78 | 
 79 | If you are not familiar with MPI, the MPI calls in the warm-up above loop might be a bit confusing, so an explanation here might be helpful. Essentially, each iteration of the warm-up loop is doing the following:
 80 | 
 81 | * If you are MPI rank 0, first send a message (the data in your array `A`) to MPI rank 1, and then expect to receive a message back from MPI rank 1 (the data in MPI rank 1's copy of array `A`). 
 82 | 
 83 | * If you are MPI rank 1, first expect to receive a message from rank 0 (the data in MPI rank 0's copy of array `A`), and then send a message back to MPI rank 0 (the data in your copy of array `A`).
 84 | 
 85 | The two bullet points above describe one "ping pong" data transfer between the MPI ranks (although these were just part of the warm-up loop). 
 86 | 
 87 | Getting back to the code, now we actually perform the ping-pong send and receive pairs `loop_count` times while timing the execution:
 88 | 
 89 | ```c
 90 |         // Time ping-pong for loop_count iterations of data transfer size 8*N bytes
 91 |         double start_time, stop_time, elapsed_time;
 92 |         start_time = MPI_Wtime();
 93 | 
 94 |         for(int i=1; i<=loop_count; i++){
 95 |             if(rank == 0){
 96 |                 MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 97 |                 MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 98 |             }
 99 |             else if(rank == 1){
100 |                 MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
101 |                 MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
102 |             }
103 |         }
104 | 
105 |         stop_time = MPI_Wtime();
106 |         elapsed_time = stop_time - start_time;
107 | ```
108 | 
109 | Then, from the timing results and the known size of the data transfers, we calculate the bandwidth and print the results:
110 | 
111 | ```c
112 |         long int num_B = 8*N;
113 |         long int B_in_GB = 1 << 30;
114 |         double num_GB = (double)num_B / (double)B_in_GB;
115 |         double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);
116 | 
117 |         if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );
118 | ```
119 | 
120 | And, of course, we must free the allocated memory, finalize MPI, and exit the program:
121 | 
122 | ```c
123 |         free(A);
124 |     }
125 | 
126 |     MPI_Finalize();
127 | 
128 |     return 0;
129 | }
130 | ```
131 | ### Results
132 | 
133 | Running this code on Summit yields the following results:
134 | 
135 | ```c
136 | Transfer size (B):          8, Transfer Time (s):     0.000000866, Bandwidth (GB/s):     0.008604834
137 | Transfer size (B):         16, Transfer Time (s):     0.000000954, Bandwidth (GB/s):     0.015614021
138 | Transfer size (B):         32, Transfer Time (s):     0.000000801, Bandwidth (GB/s):     0.037217958
139 | Transfer size (B):         64, Transfer Time (s):     0.000000944, Bandwidth (GB/s):     0.063143770
140 | Transfer size (B):        128, Transfer Time (s):     0.000000614, Bandwidth (GB/s):     0.194074658
141 | Transfer size (B):        256, Transfer Time (s):     0.000000543, Bandwidth (GB/s):     0.438679164
142 | Transfer size (B):        512, Transfer Time (s):     0.000000604, Bandwidth (GB/s):     0.789064065
143 | Transfer size (B):       1024, Transfer Time (s):     0.000000638, Bandwidth (GB/s):     1.495130605
144 | Transfer size (B):       2048, Transfer Time (s):     0.000000735, Bandwidth (GB/s):     2.594737042
145 | Transfer size (B):       4096, Transfer Time (s):     0.000001039, Bandwidth (GB/s):     3.672554470
146 | Transfer size (B):       8192, Transfer Time (s):     0.000001340, Bandwidth (GB/s):     5.691813849
147 | Transfer size (B):      16384, Transfer Time (s):     0.000003743, Bandwidth (GB/s):     4.076747560
148 | Transfer size (B):      32768, Transfer Time (s):     0.000004754, Bandwidth (GB/s):     6.418955979
149 | Transfer size (B):      65536, Transfer Time (s):     0.000006878, Bandwidth (GB/s):     8.874576996
150 | Transfer size (B):     131072, Transfer Time (s):     0.000011319, Bandwidth (GB/s):    10.784259681
151 | Transfer size (B):     262144, Transfer Time (s):     0.000020535, Bandwidth (GB/s):    11.889026561
152 | Transfer size (B):     524288, Transfer Time (s):     0.000035697, Bandwidth (GB/s):    13.678342615
153 | Transfer size (B):    1048576, Transfer Time (s):     0.000067051, Bandwidth (GB/s):    14.564560915
154 | Transfer size (B):    2097152, Transfer Time (s):     0.000130098, Bandwidth (GB/s):    15.012707619
155 | Transfer size (B):    4194304, Transfer Time (s):     0.000253810, Bandwidth (GB/s):    15.390456458
156 | Transfer size (B):    8388608, Transfer Time (s):     0.000558763, Bandwidth (GB/s):    13.981780467
157 | Transfer size (B):   16777216, Transfer Time (s):     0.001188749, Bandwidth (GB/s):    13.144070659
158 | Transfer size (B):   33554432, Transfer Time (s):     0.002449044, Bandwidth (GB/s):    12.760083414
159 | Transfer size (B):   67108864, Transfer Time (s):     0.005215280, Bandwidth (GB/s):    11.984016904
160 | Transfer size (B):  134217728, Transfer Time (s):     0.009999430, Bandwidth (GB/s):    12.500712246
161 | Transfer size (B):  268435456, Transfer Time (s):     0.019972757, Bandwidth (GB/s):    12.517049850
162 | Transfer size (B):  536870912, Transfer Time (s):     0.039737416, Bandwidth (GB/s):    12.582599598
163 | Transfer size (B): 1073741824, Transfer Time (s):     0.078949097, Bandwidth (GB/s):    12.666389375
164 | ```
165 | 
166 | ## CUDA Staged Version
167 | 
168 | Now that we are familiar with a basic MPI ping pong code, let's look at a version that includes GPUs...
169 | 
170 | In this example, we still pass data back and forth between two MPI ranks, but this time the data lives in GPU memory. More specifically, MPI rank 0 has a memory buffer in GPU 0's memory and MPI rank 1 has a memory buffer in GPU 1's memory, and they will pass data back and forth between the two GPUs' memories. Here, to get data from GPU 0's memory to GPU 1's memory, we will first stage the data through CPU memory. 
171 | 
172 | Now, let's take a look at the code to see the differences from the CPU-only version. Before `main`, we define a macro that allows us to check for errors in our CUDA API calls. This isn't important for this tutorial but, in general, it's a good idea to include such error checks.
173 | 
174 | ```c
175 | // Macro for checking errors in CUDA API calls
176 | #define cudaErrorCheck(call)                                                              \
177 | do{                                                                                       \
178 |     cudaError_t cuErr = call;                                                             \
179 |     if(cudaSuccess != cuErr){                                                             \
180 |         printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\
181 |         exit(0);                                                                            \
182 |     }                                                                                     \
183 | }while(0)
184 | ```
185 | 
186 | Similar to the CPU-only version, just inside `main`, we initialize MPI and find each MPI rank's ID, but here we also map the MPI rank to a different GPU (i.e., MPI rank 0 is mapped to GPU 0 and MPI rank 1 is mapped to GPU 1). Notice that we have wrapped the `cudaSetDevice()` call in our `cudaErrorCheck` macro (again, this isn't necessary, just good practice in CUDA programs).
187 | 
188 | ```c
189 |     /* -------------------------------------------------------------------------------------------
190 |         MPI Initialization 
191 |     --------------------------------------------------------------------------------------------*/
192 |     MPI_Init(&argc, &argv);
193 | 
194 |     int size;
195 |     MPI_Comm_size(MPI_COMM_WORLD, &size);
196 | 
197 |     int rank;
198 |     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
199 | 
200 |     MPI_Status stat;
201 | 
202 |     if(size != 2){
203 |         if(rank == 0){
204 |             printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
205 |         }
206 |         MPI_Finalize();
207 |         exit(0);
208 |     }
209 | 
210 |     // Map MPI ranks to GPUs
211 |     int num_devices = 0;
212 |     cudaErrorCheck( cudaGetDeviceCount(&num_devices) );
213 |     cudaErrorCheck( cudaSetDevice(rank % num_devices) );
214 | ```
215 | Next, we do roughly the same as we did in the CPU-only version: enter our main `for` loop that iterates over the different message sizes, allocate and intialize array `A`, and run our warm-up loop. However, we now have a call to `cudaMalloc` to allocate a memory buffer (`d_A`) on the GPUs and a call to `cudaMemcpy` to transfer the data initialized in array `A` to the GPU array (buffer) `d_A`. The `cudaMemcpy` was needed to get the data to the GPU before starting our ping pong. 
216 | 
217 | There are also `cudaMemcpy` calls within the if statements of the warm-up loop. These are needed to transfer data from the GPU buffer to the CPU buffer before the CPU buffer is used in the MPI call (and similarly for the transfer back).
218 | 
219 | ```c
220 |     /* -------------------------------------------------------------------------------------------
221 |         Loop from 8 B to 1 GB
222 |     --------------------------------------------------------------------------------------------*/
223 | 
224 |     for(int i=0; i<=27; i++){
225 | 
226 |         long int N = 1 << i;
227 |    
228 |         // Allocate memory for A on CPU
229 |         double *A = (double*)malloc(N*sizeof(double));
230 | 
231 |         // Initialize all elements of A to 0.0
232 |         for(int i=0; i<N; i++){
233 |             A[i] = 0.0;
234 |         }
235 | 
236 |         double *d_A;
237 |         cudaErrorCheck( cudaMalloc(&d_A, N*sizeof(double)) );
238 |         cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
239 | 
240 |         int tag1 = 10;
241 |         int tag2 = 20;
242 | 
243 |         int loop_count = 50;
244 | 
245 |         // Warm-up loop
246 |         for(int i=1; i<=5; i++){
247 |             if(rank == 0){
248 |                 cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
249 |                 MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
250 |                 MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
251 |                 cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
252 |             }
253 |             else if(rank == 1){
254 |                 MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
255 |                 cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
256 |                 cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
257 |                 MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
258 |             }
259 |         }
260 | ```
261 | 
262 | To clarify this, let's look at the following diagram of a Summit node. The magenta line shows the path taken by data as its passed from GPU 0's memory to GPU 1's memory (assuming the top left device is GPU 0 and the one below that is GPU 1). There are 3 steps involved:
263 | 
264 | * Data must first be transferred from GPU 0's memory into CPU memory
265 | * Then an MPI call is used to pass the data from MPI rank 0 to MPI rank 1 (in CPU memory)
266 | * Now that MPI rank 1 has the data (in CPU memory), it can transfer the data to GPU 1's memory
267 | 
268 | Or, more explicitly:
269 | 
270 | * MPI rank 0 must first transfer the data from a buffer in GPU 0's memory into a buffer in CPU memory
271 | * Then, MPI rank 0 can use its CPU buffer to send data to MPI rank 1's CPU buffer
272 | * Now that MPI rank 1 has the data in its CPU memory buffer, it can transfer it to a buffer in GPU 1's memory.
273 | 
274 | ![my image](images/cuda_staged.png)
275 | 
276 | Getting back to the code, we now perform our actual ping pong loop (with the same structure as the warm-up loop we just discussed) while timing the execution:
277 | 
278 | ```c
279 |         // Time ping-pong for loop_count iterations of data transfer size 8*N bytes
280 |         double start_time, stop_time, elapsed_time;
281 |         start_time = MPI_Wtime();
282 |    
283 |         for(int i=1; i<=loop_count; i++){
284 |             if(rank == 0){
285 |                 cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
286 |                 MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
287 |                 MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
288 |                 cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
289 |             }
290 |             else if(rank == 1){
291 |                 MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
292 |                 cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
293 |                 cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
294 |                 MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
295 |             }
296 |         }
297 | 
298 |         stop_time = MPI_Wtime();
299 |         elapsed_time = stop_time - start_time;
300 | ```
301 | 
302 | Similar to the CPU-only case, from the timing results and the known size of the data transfers, we calculate the bandwidth and print the results:
303 | 
304 | ```c
305 |         long int num_B = 8*N;
306 |         long int B_in_GB = 1 << 30;
307 |         double num_GB = (double)num_B / (double)B_in_GB;
308 |         double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);
309 | 
310 |         if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );
311 | ```
312 | 
313 | And finally, we free the memory on both the CPU and GPU, finalize MPI, and exit the program.
314 | 
315 | ```c
316 |         cudaErrorCheck( cudaFree(d_A) );
317 |         free(A);
318 |     }
319 | 
320 |     MPI_Finalize();
321 | 
322 |     return 0;
323 | }
324 | ```
325 | 
326 | ### Results
327 | 
328 | Running this code yields the results below. The lower bandwidth obtained in this example is not surprising considering the data transfers between CPU and GPU that are not present in the CPU only version. 
329 | 
330 | ```c
331 | Transfer size (B):          8, Transfer Time (s):     0.000031405, Bandwidth (GB/s):     0.000237238
332 | Transfer size (B):         16, Transfer Time (s):     0.000031500, Bandwidth (GB/s):     0.000473056
333 | Transfer size (B):         32, Transfer Time (s):     0.000031253, Bandwidth (GB/s):     0.000953594
334 | Transfer size (B):         64, Transfer Time (s):     0.000031277, Bandwidth (GB/s):     0.001905674
335 | Transfer size (B):        128, Transfer Time (s):     0.000030981, Bandwidth (GB/s):     0.003847860
336 | Transfer size (B):        256, Transfer Time (s):     0.000031053, Bandwidth (GB/s):     0.007677847
337 | Transfer size (B):        512, Transfer Time (s):     0.000031447, Bandwidth (GB/s):     0.015163110
338 | Transfer size (B):       1024, Transfer Time (s):     0.000031438, Bandwidth (GB/s):     0.030335168
339 | Transfer size (B):       2048, Transfer Time (s):     0.000031780, Bandwidth (GB/s):     0.060016886
340 | Transfer size (B):       4096, Transfer Time (s):     0.000033170, Bandwidth (GB/s):     0.115003491
341 | Transfer size (B):       8192, Transfer Time (s):     0.000033989, Bandwidth (GB/s):     0.224466891
342 | Transfer size (B):      16384, Transfer Time (s):     0.000038330, Bandwidth (GB/s):     0.398084800
343 | Transfer size (B):      32768, Transfer Time (s):     0.000041540, Bandwidth (GB/s):     0.734657127
344 | Transfer size (B):      65536, Transfer Time (s):     0.000048461, Bandwidth (GB/s):     1.259469903
345 | Transfer size (B):     131072, Transfer Time (s):     0.000061390, Bandwidth (GB/s):     1.988446644
346 | Transfer size (B):     262144, Transfer Time (s):     0.000081654, Bandwidth (GB/s):     2.989952549
347 | Transfer size (B):     524288, Transfer Time (s):     0.000131619, Bandwidth (GB/s):     3.709813882
348 | Transfer size (B):    1048576, Transfer Time (s):     0.000211197, Bandwidth (GB/s):     4.623936913
349 | Transfer size (B):    2097152, Transfer Time (s):     0.000355879, Bandwidth (GB/s):     5.488169482
350 | Transfer size (B):    4194304, Transfer Time (s):     0.000747434, Bandwidth (GB/s):     5.226212107
351 | Transfer size (B):    8388608, Transfer Time (s):     0.001695559, Bandwidth (GB/s):     4.607624607
352 | Transfer size (B):   16777216, Transfer Time (s):     0.003465020, Bandwidth (GB/s):     4.509353233
353 | Transfer size (B):   33554432, Transfer Time (s):     0.007005371, Bandwidth (GB/s):     4.460862842
354 | Transfer size (B):   67108864, Transfer Time (s):     0.014682284, Bandwidth (GB/s):     4.256830800
355 | Transfer size (B):  134217728, Transfer Time (s):     0.034339137, Bandwidth (GB/s):     3.640161378
356 | Transfer size (B):  268435456, Transfer Time (s):     0.067559225, Bandwidth (GB/s):     3.700456901
357 | Transfer size (B):  536870912, Transfer Time (s):     0.132207124, Bandwidth (GB/s):     3.781944463
358 | Transfer size (B): 1073741824, Transfer Time (s):     0.262666065, Bandwidth (GB/s):     3.807115313
359 | ```
360 | 
361 | ## CUDA-Aware Version
362 | 
363 | Before looking at this code example, let's first describe CUDA-Aware MPI and GPUDirect. These two topics are often used interchangeably, and although they can be related, they are distinct topics. 
364 | 
365 | CUDA-Aware MPI is an MPI implementation that allows GPU buffers (e.g., GPU memory allocated with `cudaMalloc`) to be used directly in MPI calls. However, CUDA-Aware MPI by itself does not specify whether data is staged through CPU memory or passed directly from GPU to GPU. That's where GPUDirect comes in! 
366 | 
367 | GPUDirect can enhance CUDA-Aware MPI by allowing data transfers directly between GPUs on the same node (peer-to-peer) or directly between GPUs on different nodes (RDMA support) without the need to stage data through CPU memory. 
368 | 
369 | Now let's take a look at the code. It's essentially the same as the CUDA staged version but now there are no calls to `cudaMemcpy` during the ping pong steps. Instead, we use our GPU buffers (`d_A`) directly in the MPI calls:
370 | 
371 | ```c
372 |         // Time ping-pong for loop_count iterations of data transfer size 8*N bytes
373 |         double start_time, stop_time, elapsed_time;
374 |         start_time = MPI_Wtime();
375 | 
376 |         for(int i=1; i<=loop_count; i++){
377 |             if(rank == 0){
378 |                 MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
379 |                 MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
380 |             }
381 |             else if(rank == 1){
382 |                 MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
383 |                 MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
384 |             }
385 |         }
386 | 
387 |         stop_time = MPI_Wtime();
388 |         elapsed_time = stop_time - start_time;
389 | ```
390 | 
391 | ### Results
392 | 
393 | There is a noticeable improvement in bandwidth relative to the CUDA staged version. This is because Summit has GPUDirect support (both peer-to-peer and RDMA) that allows data transfers between peer GPUs across NVLink.
394 | 
395 | ```c
396 | Transfer size (B):          8, Transfer Time (s):     0.000023285, Bandwidth (GB/s):     0.000319968
397 | Transfer size (B):         16, Transfer Time (s):     0.000022864, Bandwidth (GB/s):     0.000651721
398 | Transfer size (B):         32, Transfer Time (s):     0.000022066, Bandwidth (GB/s):     0.001350614
399 | Transfer size (B):         64, Transfer Time (s):     0.000021910, Bandwidth (GB/s):     0.002720440
400 | Transfer size (B):        128, Transfer Time (s):     0.000021974, Bandwidth (GB/s):     0.005425002
401 | Transfer size (B):        256, Transfer Time (s):     0.000021859, Bandwidth (GB/s):     0.010906970
402 | Transfer size (B):        512, Transfer Time (s):     0.000021851, Bandwidth (GB/s):     0.021822642
403 | Transfer size (B):       1024, Transfer Time (s):     0.000021901, Bandwidth (GB/s):     0.043544172
404 | Transfer size (B):       2048, Transfer Time (s):     0.000021715, Bandwidth (GB/s):     0.087836679
405 | Transfer size (B):       4096, Transfer Time (s):     0.000021824, Bandwidth (GB/s):     0.174796899
406 | Transfer size (B):       8192, Transfer Time (s):     0.000021769, Bandwidth (GB/s):     0.350462892
407 | Transfer size (B):      16384, Transfer Time (s):     0.000021730, Bandwidth (GB/s):     0.702213060
408 | Transfer size (B):      32768, Transfer Time (s):     0.000021875, Bandwidth (GB/s):     1.395117345
409 | Transfer size (B):      65536, Transfer Time (s):     0.000022900, Bandwidth (GB/s):     2.665275249
410 | Transfer size (B):     131072, Transfer Time (s):     0.000024108, Bandwidth (GB/s):     5.063377202
411 | Transfer size (B):     262144, Transfer Time (s):     0.000026803, Bandwidth (GB/s):     9.108750160
412 | Transfer size (B):     524288, Transfer Time (s):     0.000032308, Bandwidth (GB/s):    15.113262703
413 | Transfer size (B):    1048576, Transfer Time (s):     0.000043602, Bandwidth (GB/s):    22.397425198
414 | Transfer size (B):    2097152, Transfer Time (s):     0.000065720, Bandwidth (GB/s):    29.718880055
415 | Transfer size (B):    4194304, Transfer Time (s):     0.000110288, Bandwidth (GB/s):    35.418609771
416 | Transfer size (B):    8388608, Transfer Time (s):     0.000200170, Bandwidth (GB/s):    39.029370637
417 | Transfer size (B):   16777216, Transfer Time (s):     0.000378465, Bandwidth (GB/s):    41.285197534
418 | Transfer size (B):   33554432, Transfer Time (s):     0.000735136, Bandwidth (GB/s):    42.509119599
419 | Transfer size (B):   67108864, Transfer Time (s):     0.001448561, Bandwidth (GB/s):    43.146273061
420 | Transfer size (B):  134217728, Transfer Time (s):     0.002878520, Bandwidth (GB/s):    43.425097318
421 | Transfer size (B):  268435456, Transfer Time (s):     0.005728381, Bandwidth (GB/s):    43.642346264
422 | Transfer size (B):  536870912, Transfer Time (s):     0.011432438, Bandwidth (GB/s):    43.735204395
423 | Transfer size (B): 1073741824, Transfer Time (s):     0.022840518, Bandwidth (GB/s):    43.781843371
424 | ```
425 | 
426 | The magenta line in the following diagram shows how the data is transferred between the two GPUs across NVLink, which is the reason for the improved performance. 
427 | 
428 | ![my image](images/cuda_aware.png)
429 | 
430 | ### Additional Notes
431 | 
432 | * In order to enable CUDA-Aware MPI on Summit, you must pass the `--smpiargs="-gpu"` flag to `jsrun`
433 | 


--------------------------------------------------------------------------------
/cpu/Makefile:
--------------------------------------------------------------------------------
 1 | MPICOMP = mpicc
 2 | 
 3 | pp: ping_pong.o
 4 | 	$(MPICOMP) ping_pong.o -o pp
 5 | 
 6 | ping_pong.o: ping_pong.c
 7 | 	$(MPICOMP) -c ping_pong.c
 8 | 
 9 | .PHONY: clean
10 | 
11 | clean:
12 | 	rm -f pp *.o
13 | 


--------------------------------------------------------------------------------
/cpu/ping_pong.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <stdlib.h>
 3 | #include <mpi.h>
 4 | 
 5 | int main(int argc, char *argv[])
 6 | {
 7 | 	/* -------------------------------------------------------------------------------------------
 8 | 		MPI Initialization 
 9 | 	--------------------------------------------------------------------------------------------*/
10 | 	MPI_Init(&argc, &argv);
11 | 
12 | 	int size;
13 | 	MPI_Comm_size(MPI_COMM_WORLD, &size);
14 | 
15 | 	int rank;
16 | 	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
17 | 
18 | 	MPI_Status stat;
19 | 
20 | 	if(size != 2){
21 | 		if(rank == 0){
22 | 			printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
23 | 		}
24 | 		MPI_Finalize();
25 | 		exit(0);
26 | 	}
27 | 
28 | 	/* -------------------------------------------------------------------------------------------
29 | 		Loop from 8 B to 1 GB
30 | 	--------------------------------------------------------------------------------------------*/
31 | 
32 | 	for(int i=0; i<=27; i++){
33 | 
34 | 		long int N = 1 << i;
35 | 	
36 |    	 	// Allocate memory for A on CPU
37 | 		double *A = (double*)malloc(N*sizeof(double));
38 | 
39 |         // Initialize all elements of A to random values
40 |         for(int i=0; i<N; i++){
41 |             A[i] = (double)rand()/(double)RAND_MAX;
42 |         }
43 | 
44 | 		int tag1 = 10;
45 | 		int tag2 = 20;
46 | 	
47 | 		int loop_count = 50;
48 | 
49 | 		// Warm-up loop
50 | 		for(int i=1; i<=5; i++){
51 | 			if(rank == 0){
52 | 				MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
53 | 				MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
54 | 			}
55 | 			else if(rank == 1){
56 | 				MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
57 | 				MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
58 | 			}
59 | 		}
60 | 
61 | 		// Time ping-pong for loop_count iterations of data transfer size 8*N bytes
62 | 		double start_time, stop_time, elapsed_time;
63 | 		start_time = MPI_Wtime();
64 | 	
65 | 		for(int i=1; i<=loop_count; i++){
66 | 			if(rank == 0){
67 | 				MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
68 | 				MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
69 | 			}
70 | 			else if(rank == 1){
71 | 				MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
72 | 				MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
73 | 			}
74 | 		}
75 | 
76 | 		stop_time = MPI_Wtime();
77 | 		elapsed_time = stop_time - start_time;
78 | 
79 | 		long int num_B = 8*N;
80 | 		long int B_in_GB = 1 << 30;
81 | 		double num_GB = (double)num_B / (double)B_in_GB;
82 | 		double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);
83 | 
84 | 		if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );
85 | 
86 | 		free(A);
87 | 	}
88 | 
89 | 	MPI_Finalize();
90 | 
91 | 	return 0;
92 | }
93 | 


--------------------------------------------------------------------------------
/cpu/submit.lsf:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #BSUB -P STF007
 4 | #BSUB -W 10
 5 | #BSUB -nnodes 1
 6 | #BSUB -J cpu_ping_pong
 7 | #BSUB -o cpu_ping_pong.%J
 8 | #BSUB -e cpu_ping_pong.%J
 9 | 
10 | jsrun -n1 -c42 -g6 -a2 ./pp
11 | 


--------------------------------------------------------------------------------
/cuda_aware/Makefile:
--------------------------------------------------------------------------------
 1 | CUCOMP  = nvcc
 2 | CUFLAGS = -arch=sm_70
 3 | 
 4 | INCLUDES  = -I$(OMPI_DIR)/include
 5 | LIBRARIES = -L$(OMPI_DIR)/lib -lmpi_ibm
 6 | 
 7 | pp_cuda_aware: ping_pong_cuda_aware.o
 8 | 	$(CUCOMP) $(CUFLAGS) $(LIBRARIES) ping_pong_cuda_aware.o -o pp_cuda_aware
 9 | 
10 | ping_pong_cuda_aware.o: ping_pong_cuda_aware.cu
11 | 	$(CUCOMP) $(CUFLAGS) $(INCLUDES) -c ping_pong_cuda_aware.cu
12 | 
13 | .PHONY: clean
14 | 
15 | clean:
16 | 	rm -f pp_cuda_aware *.o
17 | 


--------------------------------------------------------------------------------
/cuda_aware/ping_pong_cuda_aware.cu:
--------------------------------------------------------------------------------
  1 | #include <stdio.h>
  2 | #include <stdlib.h>
  3 | #include <mpi.h>
  4 | 
  5 | // Macro for checking errors in CUDA API calls
  6 | #define cudaErrorCheck(call)                                                              \
  7 | do{                                                                                       \
  8 | 	cudaError_t cuErr = call;                                                             \
  9 | 	if(cudaSuccess != cuErr){                                                             \
 10 | 		printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\
 11 | 		exit(0);                                                                            \
 12 | 	}                                                                                     \
 13 | }while(0)
 14 | 
 15 | 
 16 | int main(int argc, char *argv[])
 17 | {
 18 | 	/* -------------------------------------------------------------------------------------------
 19 | 		MPI Initialization 
 20 | 	--------------------------------------------------------------------------------------------*/
 21 | 	MPI_Init(&argc, &argv);
 22 | 
 23 | 	int size;
 24 | 	MPI_Comm_size(MPI_COMM_WORLD, &size);
 25 | 
 26 | 	int rank;
 27 | 	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 28 | 
 29 | 	MPI_Status stat;
 30 | 
 31 | 	if(size != 2){
 32 | 		if(rank == 0){
 33 | 			printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
 34 | 		}
 35 | 		MPI_Finalize();
 36 | 		exit(0);
 37 | 	}
 38 | 
 39 | 	// Map MPI ranks to GPUs
 40 |     int num_devices = 0;
 41 |     cudaErrorCheck( cudaGetDeviceCount(&num_devices) );
 42 | 	cudaErrorCheck( cudaSetDevice(rank % num_devices) );
 43 | 
 44 | 	/* -------------------------------------------------------------------------------------------
 45 | 		Loop from 8 B to 1 GB
 46 | 	--------------------------------------------------------------------------------------------*/
 47 | 
 48 | 	for(int i=0; i<=27; i++){
 49 | 
 50 | 		long int N = 1 << i;
 51 | 	
 52 | 		// Allocate memory for A on CPU
 53 | 		double *A = (double*)malloc(N*sizeof(double));
 54 | 
 55 | 		// Initialize all elements of A to random values
 56 | 		for(int i=0; i<N; i++){
 57 |             A[i] = (double)rand()/(double)RAND_MAX;
 58 | 		}
 59 | 
 60 | 		double *d_A;
 61 | 		cudaErrorCheck( cudaMalloc(&d_A, N*sizeof(double)) );
 62 | 		cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 63 | 	
 64 | 		int tag1 = 10;
 65 | 		int tag2 = 20;
 66 | 	
 67 | 		int loop_count = 50;
 68 | 
 69 | 		// Warm-up loop
 70 | 		for(int i=1; i<=5; i++){
 71 | 			if(rank == 0){
 72 | 				MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 73 | 				MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 74 | 			}
 75 | 			else if(rank == 1){
 76 | 				MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
 77 | 				MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
 78 | 			}
 79 | 		}
 80 | 
 81 | 		// Time ping-pong for loop_count iterations of data transfer size 8*N bytes
 82 | 		double start_time, stop_time, elapsed_time;
 83 | 		start_time = MPI_Wtime();
 84 | 	
 85 | 		for(int i=1; i<=loop_count; i++){
 86 | 			if(rank == 0){
 87 | 				MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 88 | 				MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 89 | 			}
 90 | 			else if(rank == 1){
 91 | 				MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
 92 | 				MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
 93 | 			}
 94 | 		}
 95 | 
 96 | 		stop_time = MPI_Wtime();
 97 | 		elapsed_time = stop_time - start_time;
 98 | 
 99 | 		long int num_B = 8*N;
100 | 		long int B_in_GB = 1 << 30;
101 | 		double num_GB = (double)num_B / (double)B_in_GB;
102 | 		double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);
103 | 
104 | 		if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );
105 | 
106 | 		cudaErrorCheck( cudaFree(d_A) );
107 | 		free(A);
108 | 	}
109 | 
110 | 	MPI_Finalize();
111 | 
112 | 	return 0;
113 | }
114 | 


--------------------------------------------------------------------------------
/cuda_aware/submit.lsf:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #BSUB -P STF007
 4 | #BSUB -W 10
 5 | #BSUB -nnodes 1
 6 | #BSUB -J cuda_aware_ping_pong
 7 | #BSUB -o cuda_aware_ping_pong.%J
 8 | #BSUB -e cuda_aware_ping_pong.%J
 9 | 
10 | jsrun --smpiargs="-gpu" -n1 -c42 -g6 -a2 ./pp_cuda_aware
11 | 


--------------------------------------------------------------------------------
/cuda_staged/Makefile:
--------------------------------------------------------------------------------
 1 | CUCOMP  = nvcc
 2 | CUFLAGS = -arch=sm_70
 3 | 
 4 | INCLUDES  = -I$(OMPI_DIR)/include
 5 | LIBRARIES = -L$(OMPI_DIR)/lib -lmpi_ibm
 6 | 
 7 | pp_cuda_staged: ping_pong_cuda_staged.o
 8 | 	$(CUCOMP) $(CUFLAGS) $(LIBRARIES) ping_pong_cuda_staged.o -o pp_cuda_staged
 9 | 
10 | ping_pong_cuda_staged.o: ping_pong_cuda_staged.cu
11 | 	$(CUCOMP) $(CUFLAGS) $(INCLUDES) -c ping_pong_cuda_staged.cu
12 | 
13 | .PHONY: clean
14 | 
15 | clean:
16 | 	rm -f pp_cuda_staged *.o
17 | 


--------------------------------------------------------------------------------
/cuda_staged/ping_pong_cuda_staged.cu:
--------------------------------------------------------------------------------
  1 | #include <stdio.h>
  2 | #include <stdlib.h>
  3 | #include <mpi.h>
  4 | 
  5 | // Macro for checking errors in CUDA API calls
  6 | #define cudaErrorCheck(call)                                                              \
  7 | do{                                                                                       \
  8 | 	cudaError_t cuErr = call;                                                             \
  9 | 	if(cudaSuccess != cuErr){                                                             \
 10 | 		printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\
 11 | 		exit(0);                                                                            \
 12 | 	}                                                                                     \
 13 | }while(0)
 14 | 
 15 | 
 16 | int main(int argc, char *argv[])
 17 | {
 18 | 	/* -------------------------------------------------------------------------------------------
 19 | 		MPI Initialization 
 20 | 	--------------------------------------------------------------------------------------------*/
 21 | 	MPI_Init(&argc, &argv);
 22 | 
 23 | 	int size;
 24 | 	MPI_Comm_size(MPI_COMM_WORLD, &size);
 25 | 
 26 | 	int rank;
 27 | 	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 28 | 
 29 | 	MPI_Status stat;
 30 | 
 31 | 	if(size != 2){
 32 | 		if(rank == 0){
 33 | 			printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
 34 | 		}
 35 | 		MPI_Finalize();
 36 | 		exit(0);
 37 | 	}
 38 | 
 39 |     // Map MPI ranks to GPUs
 40 |     int num_devices = 0;
 41 |     cudaErrorCheck( cudaGetDeviceCount(&num_devices) );
 42 |     cudaErrorCheck( cudaSetDevice(rank % num_devices) );
 43 | 
 44 | 	/* -------------------------------------------------------------------------------------------
 45 | 		Loop from 8 B to 1 GB
 46 | 	--------------------------------------------------------------------------------------------*/
 47 | 
 48 | 	for(int i=0; i<=27; i++){
 49 | 
 50 | 		long int N = 1 << i;
 51 | 	
 52 | 		// Allocate memory for A on CPU
 53 | 		double *A = (double*)malloc(N*sizeof(double));
 54 | 
 55 |         // Initialize all elements of A to random values
 56 |         for(int i=0; i<N; i++){
 57 |             A[i] = (double)rand()/(double)RAND_MAX;
 58 |         }
 59 | 
 60 | 		double *d_A;
 61 | 		cudaErrorCheck( cudaMalloc(&d_A, N*sizeof(double)) );
 62 | 		cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 63 | 	
 64 | 		int tag1 = 10;
 65 | 		int tag2 = 20;
 66 | 	
 67 | 		int loop_count = 50;
 68 | 
 69 | 		// Warm-up loop
 70 | 		for(int i=1; i<=5; i++){
 71 | 			if(rank == 0){
 72 | 				cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
 73 | 				MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 74 | 				MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 75 | 				cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 76 | 			}
 77 | 			else if(rank == 1){
 78 | 				MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
 79 | 				cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 80 | 				cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
 81 | 				MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
 82 | 			}
 83 | 		}
 84 | 
 85 | 		// Time ping-pong for loop_count iterations of data transfer size 8*N bytes
 86 | 		double start_time, stop_time, elapsed_time;
 87 | 		start_time = MPI_Wtime();
 88 | 	
 89 | 		for(int i=1; i<=loop_count; i++){
 90 | 			if(rank == 0){
 91 | 				cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
 92 | 				MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
 93 | 				MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
 94 | 				cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 95 | 			}
 96 | 			else if(rank == 1){
 97 | 				MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
 98 | 				cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
 99 | 				cudaErrorCheck( cudaMemcpy(A, d_A, N*sizeof(double), cudaMemcpyDeviceToHost) );
100 | 				MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
101 | 			}
102 | 		}
103 | 
104 | 		stop_time = MPI_Wtime();
105 | 		elapsed_time = stop_time - start_time;
106 | 
107 | 		long int num_B = 8*N;
108 | 		long int B_in_GB = 1 << 30;
109 | 		double num_GB = (double)num_B / (double)B_in_GB;
110 | 		double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);
111 | 
112 | 		if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );
113 | 
114 | 		cudaErrorCheck( cudaFree(d_A) );
115 | 		free(A);
116 | 	}
117 | 
118 | 	MPI_Finalize();
119 | 
120 | 	return 0;
121 | }
122 | 


--------------------------------------------------------------------------------
/cuda_staged/submit.lsf:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #BSUB -P STF007
 4 | #BSUB -W 10
 5 | #BSUB -nnodes 1
 6 | #BSUB -J staged_ping_pong
 7 | #BSUB -o staged_ping_pong.%J
 8 | #BSUB -e staged_ping_pong.%J
 9 | 
10 | jsrun -n1 -c42 -g6 -a2 ./pp_cuda_staged
11 | 


--------------------------------------------------------------------------------
/images/cuda_aware.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/olcf-tutorials/MPI_ping_pong/0326b1d57784ab3c407c3c7857292770fca10e7c/images/cuda_aware.png


--------------------------------------------------------------------------------
/images/cuda_staged.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/olcf-tutorials/MPI_ping_pong/0326b1d57784ab3c407c3c7857292770fca10e7c/images/cuda_staged.png


--------------------------------------------------------------------------------