├── README.md ├── crunch_mpiBench ├── makefile └── mpiBench.c /README.md: -------------------------------------------------------------------------------- 1 | # mpiBench 2 | Times MPI collectives over a series of message sizes 3 | 4 | # What is mpiBench? 5 | 6 | mpiBench.c 7 | 8 | This program measures MPI collective performance for a range of 9 | message sizes. The user may specify: 10 | - the collective to perform, 11 | - the message size limits, 12 | - the number of iterations to perform, 13 | - the maximum memory a process may allocate for MPI buffers, 14 | - the maximum time permitted for a given test, 15 | - and the number of Cartesian dimensions to divide processes into. 16 | 17 | The default behavior of mpiBench will run from 0-256K byte messages 18 | for all supported collectives on MPI_COMM_WORLD with a 1G buffer 19 | limit. Each test will execute as many iterations as it can to fit 20 | within a default time limit of 50000 usecs. 21 | 22 | crunch_mpiBench 23 | 24 | This is a perl script which can be used to filter data and generate 25 | reports from mpiBench output files. It can merge data from 26 | multiple mpiBench output files into a single report. It can also 27 | filter output to a subset of collectives. By default, it reports 28 | the operation duration time (i.e., how long the collective took to 29 | complete). For some collectives, it can also report the effective 30 | bandwidth. If provided two datasets, it computes a speedup factor. 31 | 32 | # What is measured 33 | 34 | mpiBench measures the total time required to iterate through a loop 35 | of back-to-back invocations of the same collective (optionally 36 | separated by a barrier), and divides by the number of iterations. 37 | In other words the timing kernel looks like the following: 38 | 39 | time_start = timer(); 40 | for (i=0 ; i < iterations; i++) { 41 | collective(msg_size); 42 | barrier(); 43 | } 44 | time_end = timer(); 45 | time = (time_end - time_start) / iterations; 46 | 47 | Each participating MPI process performs this measurement and all 48 | report their times. It is the average, minimum, and maximum across 49 | this set of times which is reported. 50 | 51 | Before the timing kernel is started, the collective is invoked once to 52 | prime it, since the initial call may be subject to overhead that later 53 | calls are not. Then, the collective is timed across a small set of 54 | iterations (~5) to get a rough estimate for the time required for a 55 | single invocation. If the user specifies a time limit using the -t 56 | option, this esitmate is used to reduce the number of iterations made 57 | in the timing kernel loop, as necessary, so it may executed within the 58 | time limit. 59 | 60 | 61 | # Basic Usage 62 | 63 | Build: 64 | 65 | make 66 | 67 | Run: 68 | 69 | srun -n ./mpiBench > output.txt 70 | 71 | Analyze: 72 | 73 | crunch_mpiBench output.txt 74 | 75 | # Build Instructions 76 | 77 | There are several make targets available: 78 | - make -- simple build 79 | - make nobar -- build without barriers between consecutive collective invocations 80 | - make debug -- build with "-g -O0" for debugging purposes 81 | - make clean -- clean the build 82 | 83 | If you'd like to build manually without the makefiles, there are some 84 | compile-time options that you should be aware of: 85 | 86 | -D NO_BARRIER - drop barrier between consecutive collective 87 | invocations 88 | -D USE_GETTIMEOFDAY - use gettimeofday() instead of MPI_Wtime() for 89 | timing info 90 | 91 | # Usage Syntax 92 | 93 | Usage: mpiBench [options] [operations] 94 | 95 | Options: 96 | -b Beginning message size in bytes (default 0) 97 | -e Ending message size in bytes (default 1K) 98 | -i Maximum number of iterations for a single test 99 | (default 1000) 100 | -m Process memory buffer limit (send+recv) in bytes 101 | (default 1G) 102 | -t Time limit for any single test in microseconds 103 | (default 0 = infinity) 104 | -d Number of dimensions to split processes in 105 | (default 0 = MPI_COMM_WORLD only) 106 | -c Check receive buffer for expected data in last 107 | interation (default disabled) 108 | -C Check receive buffer for expected data every 109 | iteration (default disabled) 110 | -h Print this help screen and exit 111 | where = [0-9]+[KMG], e.g., 32K or 64M 112 | 113 | Operations: 114 | Barrier 115 | Bcast 116 | Alltoall, Alltoallv 117 | Allgather, Allgatherv 118 | Gather, Gatherv 119 | Scatter 120 | Allreduce 121 | Reduce 122 | 123 | # Examples 124 | 125 | ## mpiBench 126 | 127 | Run the default set of tests: 128 | 129 | srun -n2 -ppdebug mpiBench 130 | 131 | Run the default message size range and iteration count for Alltoall, Allreduce, and Barrier: 132 | 133 | srun -n2 -ppdebug mpiBench Alltoall Allreduce Barrier 134 | 135 | Run from 32-256 bytes and time across 100 iterations of Alltoall: 136 | 137 | srun -n2 -ppdebug mpiBench -b 32 -e 256 -i 100 Alltoall 138 | 139 | Run from 0-2K bytes and default iteration count for Gather, but 140 | reduce the iteration count, as necessary, so each message size 141 | test finishes within 100,000 usecs: 142 | 143 | srun -n2 -ppdebug mpiBench -e 2K -t 100000 Gather 144 | 145 | ## crunch_mpiBench 146 | 147 | Show data for just Alltoall: 148 | 149 | crunch_mpiBench -op Alltoall out.txt 150 | 151 | Merge data from several files into a single report: 152 | 153 | crunch_mpiBench out1.txt out2.txt out3.txt 154 | 155 | Display effective bandwidth for Allgather and Alltoall: 156 | 157 | crunch_mpiBench -bw -op Allgather,Alltoall out.txt 158 | 159 | Compare times in output files in dir1 with those in dir2: 160 | 161 | crunch_mpiBench -data DIR1_DATA dir1/* -data DIR2_DATA dir2/* 162 | 163 | # Additional Notes 164 | 165 | Rank 0 always acts as the root process for collectives which involve 166 | a root. 167 | 168 | If the minimum and maximum are quite different, then some processes 169 | may be escaping ahead to start later iterations before the last one 170 | has completely finished. In this case, one may use the maximum time 171 | reported or insert a barrier between consecutive invocations (build 172 | with "make" instead of "make nobar") to syncronize the processes. 173 | 174 | For Reduce and Allreduce, vectors of doubles are added, so message 175 | sizes of 1, 2, and 4-bytes are skipped. 176 | 177 | Two available make commands build mpiBench with test kernels like 178 | the following: 179 | 180 | "make" "make nobar" 181 | start=timer() start=timer() 182 | for(i=o;i