├── .gitignore
├── CONTRIBUTING
├── LICENSE
├── Makefile
├── README
├── _clang-format
├── arena.c
├── arena.h
├── br_asm.c
├── br_asm.h
├── cpu_util.h
├── fairness.c
├── gen_expand
├── multichase.c
├── multiload.c
├── permutation.c
├── permutation.h
├── pingpong.c
├── run_multiload.sh
├── timer.h
├── util.c
└── util.h


/.gitignore:
--------------------------------------------------------------------------------
1 | /*.o
2 | /expand.h
3 | 
4 | /fairness
5 | /multichase
6 | /multiload
7 | /pingpong
8 | 


--------------------------------------------------------------------------------
/CONTRIBUTING:
--------------------------------------------------------------------------------
 1 | Want to contribute? Great! First, read this page (including the small print at the end).
 2 | 
 3 | ### Before you contribute
 4 | Before we can use your code, you must sign the
 5 | [Google Individual Contributor License Agreement]
 6 | (https://cla.developers.google.com/about/google-individual)
 7 | (CLA), which you can do online. The CLA is necessary mainly because you own the
 8 | copyright to your changes, even after your contribution becomes part of our
 9 | codebase, so we need your permission to use and distribute your code. We also
10 | need to be sure of various other things—for instance that you'll tell us if you
11 | know that your code infringes on other people's patents. You don't have to sign
12 | the CLA until after you've submitted your code for review and a member has
13 | approved it, but you must do it before we can put your code into our codebase.
14 | Before you start working on a larger contribution, you should get in touch with
15 | us first through the issue tracker with your idea so that we can help out and
16 | possibly guide you. Coordinating up front makes it much easier to avoid
17 | frustration later on.
18 | 
19 | ### Code reviews
20 | All submissions, including submissions by project members, require review. We
21 | use Github pull requests for this purpose.
22 | 
23 | ### The small print
24 | Contributions made by corporations are covered by a different agreement than
25 | the one above, the
26 | [Software Grant and Corporate Contributor License Agreement]
27 | (https://cla.developers.google.com/about/google-corporate).
28 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Apache License
 2 | 
 3 | Version 2.0, January 2004
 4 | 
 5 | http://www.apache.org/licenses/
 6 | 
 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 8 | 
 9 | 1. Definitions.
10 | 
11 | "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
12 | "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
13 | "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
14 | "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
15 | "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
16 | "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
17 | "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
18 | "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
19 | "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
20 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
21 | 
22 | 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
23 | 
24 | 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
25 | 
26 | 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
27 | You must give any other recipients of the Work or Derivative Works a copy of this License; and
28 | You must cause any modified files to carry prominent notices stating that You changed the files; and
29 | You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
30 | If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 
31 | You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
32 | 
33 | 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
34 | 
35 | 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
36 | 
37 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
38 | 
39 | 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
40 | 
41 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
42 | 
43 | END OF TERMS AND CONDITIONS
44 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # Copyright 2015 Google Inc. All Rights Reserved.
 2 | # Licensed under the Apache License, Version 2.0 (the "License");
 3 | # you may not use this file except in compliance with the License.
 4 | # You may obtain a copy of the License at
 5 | #
 6 | #   http://www.apache.org/licenses/LICENSE-2.0
 7 | #
 8 | # Unless required by applicable law or agreed to in writing, software
 9 | # distributed under the License is distributed on an "AS IS" BASIS,
10 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 | # See the License for the specific language governing permissions and
12 | # limitations under the License.
13 | #
14 | CFLAGS=-std=gnu99 -g -O3 -fomit-frame-pointer -fno-unroll-loops -Wall -Wstrict-prototypes -Wmissing-prototypes -Wshadow -Wmissing-declarations -Wnested-externs -Wpointer-arith -W -Wno-unused-parameter -Werror -pthread -Wno-tautological-compare
15 | LDFLAGS=-g -O3 -static -pthread
16 | LDLIBS=-lrt -lm
17 | 
18 | ARCH ?= $(shell uname -m)
19 | 
20 | ifeq ($(ARCH),aarch64)
21 |  CAP ?= $(shell cat /proc/cpuinfo | grep -E 'atomics|sve' | head -1)
22 |  ifneq (,$(findstring sve,$(CAP)))
23 |   CFLAGS+=-march=armv8.2-a+sve
24 |  else ifneq (,$(findstring atomics,$(CAP)))
25 |   CFLAGS+=-march=armv8.1-a+lse
26 |  endif
27 | endif
28 | 
29 | EXE=multichase multiload fairness pingpong
30 | 
31 | all: $(EXE)
32 | 
33 | clean:
34 | 	rm -f $(EXE) *.o expand.h
35 | 
36 | .c.s:
37 | 	$(CC) $(CFLAGS) -S -c $<
38 | 
39 | multichase: multichase.o permutation.o arena.o br_asm.o util.o
40 | 
41 | multiload: multiload.o permutation.o arena.o util.o
42 | 
43 | fairness: LDLIBS += -lm
44 | 
45 | expand.h: gen_expand
46 | 	./gen_expand 200 >expand.h.tmp
47 | 	mv expand.h.tmp expand.h
48 | 
49 | depend:
50 | 	makedepend -Y -- $(CFLAGS) -- *.c
51 | 
52 | # DO NOT DELETE
53 | 
54 | arena.o: arena.h
55 | multichase.o: cpu_util.h timer.h expand.h permutation.h arena.h util.h
56 | multiload.o: cpu_util.h timer.h expand.h permutation.h arena.h util.h
57 | permutation.o: permutation.h
58 | util.o: util.h
59 | fairness.o: cpu_util.h expand.h timer.h
60 | pingpong.o: cpu_util.h timer.h
61 | 


--------------------------------------------------------------------------------
/README:
--------------------------------------------------------------------------------
 1 | Multichase - a pointer chaser benchmark
 2 | Multiload - a superset of multichase which runs latency, memory bandwidth, and loaded-latency
 3 | 
 4 | 1/ BUILD
 5 | 
 6 |    - just type:
 7 | 
 8 |      $ make
 9 | 
10 | 2/ INSTALL
11 | 
12 |    - just run from current directory or copy multichase wherever you need to
13 | 
14 | 3.1/ RUN Multichase
15 | 
16 |   - To get help
17 | 
18 |     $ multichase -h
19 | 
20 |    - By default, multichase will perform a pointer chase through an array
21 |      size of 256MB and a stride size of 256 bytes for 2.5 seconds on a single
22 |      thread:
23 | 
24 |      $ multichase
25 | 
26 |    - Pointer chase through an array of 4MB with a stride size of 64 bytes:
27 | 
28 |      $ multichase -m 4m -s 64
29 | 
30 |   - Pointer chase through an array of 1GB for 10 seconds (-n is the number of 0.5  second samples):
31 | 
32 |      $ multichase -m 1g -n 20
33 | 
34 |   - Pointer chase through an array of 256KB with a stride size of 128 bytes on 2 threads.
35 |     Thread 0 accesses every 128th byte, thread 1 accesses every 128th byte offset by sizeof(void*)=8
36 |     on 64bit architectures:
37 | 
38 |     $ multichase -m 256k -s 128 -t 2
39 | 
40 | 3.2/ RUN Multiload
41 | 
42 |   - Latency Only (simple pointer chase)
43 |     In this mode, Multiload can run any of the multichase commands above.
44 |     A "-c" chase arg (other than chaseload) can be used or it will default to "simple".
45 |     Using either "-c chaseload" and/or the "-l" load arguments will choose a different test mode.
46 | 
47 |     $ multiload
48 | 
49 |   - Bandwidth Only
50 |     Multiload can run a memory bandwidth test using the "-l" load argument. The "-c" chase argument MUST NOT be used.
51 |     Below command runs 5 samples (~2.5 seconds each), using 16 threads, using the glibc memcpy() function,
52 |     using a 512M buffer per thread.
53 | 
54 |     $ multiload -n 5 -t 16 -m 512M -l memcpy-libc
55 | 
56 |   - Loaded Latency.
57 |     Multiload can run 1 pointer chaser thread on logical cpu0 with multiple memory bandwidth load threads.
58 |     The "-c chaseload" arg MUST be used. The "-l" arg MUST be used with one of the memory load arguments.
59 |     Below command runs 5 samples (~2.5 seconds each), on 16 threads (1 chase, 15 stream-sum bandwidth loads),
60 |     using a 512M buffer per thread. The chase thread uses a stride=16.
61 | 
62 |     $ multiload -s 16 -n 5 -t 16 -m 512M -c chaseload -l stream-sum
63 | 
64 | 3.3/ RUN Pingpong & fairness
65 | 
66 |    - Pingpong: measure latency of exchanging a line between cores.
67 |      To run, simply do:
68 |     $ pingpong -u
69 | 
70 |    - Fairness: measure fairness with N threads competing to increment an atomic variable.
71 |      To run, simply do:
72 |     $ fairness
73 | 


--------------------------------------------------------------------------------
/_clang-format:
--------------------------------------------------------------------------------
1 | BasedOnStyle: Google
2 | IndentWidth: 2
3 | Language: Cpp
4 | 


--------------------------------------------------------------------------------
/arena.c:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #include "arena.h"
 15 | 
 16 | #include <linux/mempolicy.h>
 17 | #include <stdio.h>
 18 | #include <stdlib.h>
 19 | #include <string.h>
 20 | #include <sys/ipc.h>
 21 | #include <sys/mman.h>
 22 | #include <sys/shm.h>
 23 | #include <sys/syscall.h>
 24 | #include <unistd.h>
 25 | 
 26 | #include "permutation.h"
 27 | 
 28 | extern int verbosity;
 29 | extern int is_weighted_mbind;
 30 | extern uint16_t mbind_weights[MAX_MEM_NODES];
 31 | 
 32 | size_t get_native_page_size(void) {
 33 |   long sz;
 34 | 
 35 |   sz = sysconf(_SC_PAGESIZE);
 36 |   if (sz < 0) {
 37 |     perror("failed to get native page size");
 38 |     exit(1);
 39 |   }
 40 | 
 41 |   return (size_t)sz;
 42 | }
 43 | 
 44 | bool page_size_is_huge(size_t page_size) {
 45 |   return page_size > get_native_page_size();
 46 | }
 47 | 
 48 | void print_page_size(size_t page_size, bool use_thp) {
 49 |   FILE *f;
 50 |   size_t read;
 51 |   /* Big enough to fit UINT64_MAX + '\n' + '\0'. */
 52 |   char buf[22];
 53 | 
 54 |   if (!use_thp) {
 55 |     printf("page_size = %zu bytes\n", page_size);
 56 |     return;
 57 |   }
 58 | 
 59 |   f = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r");
 60 |   if (!f) goto err;
 61 | 
 62 |   read = fread(buf, 1, sizeof(buf) - 1, f);
 63 |   if (!feof(f) || (ferror(f) && !feof(f))) goto err;
 64 | 
 65 |   if (fclose(f)) goto err;
 66 | 
 67 |   if (buf[read - 1] == '\n') --read;
 68 |   buf[read] = '\0';
 69 | 
 70 |   printf("page_size = %s bytes (THP)\n", buf);
 71 |   return;
 72 | 
 73 | err:
 74 |   perror(
 75 |       "page_size = <failed to read "
 76 |       "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size>");
 77 | }
 78 | 
 79 | static inline int mbind(void *addr, unsigned long len, int mode,
 80 |                         unsigned long *nodemask, unsigned long maxnode,
 81 |                         unsigned flags) {
 82 |   return syscall(__NR_mbind, addr, len, mode, nodemask, maxnode, flags);
 83 | }
 84 | 
 85 | static void arena_weighted_mbind(size_t page_size, void *arena,
 86 |                                  size_t arena_size, uint16_t *weights,
 87 |                                  size_t nr_weights) {
 88 |   /* compute cumulative sum for weights
 89 |    * cumulative sum starts at -1
 90 |    * the method for determining a hit on a weight i is when the generated
 91 |    * random number (modulo sum of weights) <= weights_cumsum[i]
 92 |    */
 93 |   int64_t *weights_cumsum = malloc(nr_weights * sizeof(int64_t));
 94 |   if (!weights_cumsum) {
 95 |     fprintf(stderr, "Couldn't allocate memory for weights.\n");
 96 |     exit(1);
 97 |   }
 98 |   weights_cumsum[0] = weights[0] - 1;
 99 |   for (unsigned int i = 1; i < nr_weights; i++) {
100 |     weights_cumsum[i] = weights_cumsum[i - 1] + weights[i];
101 |   }
102 |   const int32_t weight_sum = weights_cumsum[nr_weights - 1] + 1;
103 | 
104 |   uint64_t mask = 0;
105 |   char *q = (char *)arena + arena_size;
106 |   rng_init(1);
107 |   for (char *p = arena; p < q; p += page_size) {
108 |     uint32_t r = rng_int(1 << 31) % weight_sum;
109 |     unsigned int node;
110 |     for (node = 0; node < nr_weights; node++) {
111 |       if (weights_cumsum[node] >= r) {
112 |         break;
113 |       }
114 |     }
115 |     mask = 1 << node;
116 |     if (mbind(p, page_size, MPOL_BIND, &mask, nr_weights, MPOL_MF_STRICT)) {
117 |       perror("mbind");
118 |       exit(1);
119 |     }
120 |     *p = 0;
121 |   }
122 |   free(weights_cumsum);
123 | }
124 | 
125 | static int get_page_size_flags(size_t page_size) {
126 |   int lg = 0;
127 | 
128 |   if (!page_size || (page_size & (page_size - 1))) {
129 |     fprintf(stderr, "page size must be a power of 2: %zu\n", page_size);
130 |     exit(1);
131 |   }
132 | 
133 |   if (!page_size_is_huge(page_size)) {
134 |     return 0;
135 |   }
136 | 
137 |   /*
138 |    * We need not just MAP_HUGETLB, but also a flag specifying the page size.
139 |    * mmap(2) says that these flags are defined as:
140 |    * log2(page size) << MAP_HUGE_SHIFT.
141 |    */
142 |   while (page_size >>= 1) {
143 |     ++lg;
144 |   }
145 |   return MAP_HUGETLB | (lg << MAP_HUGE_SHIFT);
146 | }
147 | 
148 | /*
149 |  * Reads a "state" file from sysfs at the given path, and returns the current
150 |  * state. The caller must free() the returned pointer when finished with it.
151 |  *
152 |  * The file must be formatted like this:
153 |  *
154 |  * state1 state2 [state3] state4
155 |  *
156 |  * The state surrounded by []s is the currently active one. It is returned
157 |  * as-is, including the surrounding []s.
158 |  */
159 | static char *read_sysfs_state_file(char const *path) {
160 |   FILE *f = fopen(path, "r");
161 |   char *token = NULL;
162 |   int ret;
163 | 
164 |   if (f == NULL) {
165 |     perror("open sysfs state file");
166 |     exit(1);
167 |   }
168 | 
169 |   while ((ret = fscanf(f, "%ms", &token)) == 1) {
170 |     if (token[0] == '[') break;
171 | 
172 |     free(token);
173 |     token = NULL;
174 |   }
175 | 
176 |   if (ferror(f) && !feof(f)) {
177 |     perror("read sysfs state file");
178 |     exit(1);
179 |   }
180 | 
181 |   if (fclose(f)) {
182 |     perror("close sysfs state file");
183 |     exit(1);
184 |   }
185 | 
186 |   return token;
187 | }
188 | 
189 | static void write_sysfs_file(char const *path, char const *value) {
190 |   FILE *f = fopen(path, "w");
191 | 
192 |   if (f == NULL) {
193 |     perror("open sysfs file for write");
194 |     exit(1);
195 |   }
196 | 
197 |   if (fprintf(f, "%s\n", value) < 0) {
198 |     perror("write value to sysfs file");
199 |     exit(1);
200 |   }
201 | 
202 |   if (fclose(f)) {
203 |     perror("close sysfs file");
204 |     exit(1);
205 |   }
206 | }
207 | 
208 | /*
209 |  * In order for MADV_HUGEPAGE to work, THP configuration must be in one of
210 |  * several acceptable states. Check if the existing system configuration is
211 |  * acceptable, and if not, try to change the configuration.
212 |  */
213 | static void check_thp_state(void) {
214 |   char *enabled =
215 |       read_sysfs_state_file("/sys/kernel/mm/transparent_hugepage/enabled");
216 |   char *defrag =
217 |       read_sysfs_state_file("/sys/kernel/mm/transparent_hugepage/defrag");
218 | 
219 |   if (strcmp(enabled, "[always]") && strcmp(enabled, "[madvise]")) {
220 |     write_sysfs_file("/sys/kernel/mm/transparent_hugepage/enabled", "madvise");
221 |   }
222 | 
223 |   if (strcmp(defrag, "[always]") && strcmp(defrag, "[defer+madvise]") &&
224 |       strcmp(defrag, "[madvise]")) {
225 |     write_sysfs_file("/sys/kernel/mm/transparent_hugepage/defrag", "madvise");
226 |   }
227 | 
228 |   free(enabled);
229 |   free(defrag);
230 | }
231 | 
232 | void *alloc_arena_mmap(size_t page_size, bool use_thp, size_t arena_size, int fd) {
233 |   void *arena;
234 |   size_t pagemask = page_size - 1;
235 |   int flags;
236 | 
237 |   if (fd == -1)
238 |     flags = MAP_PRIVATE | MAP_ANONYMOUS | get_page_size_flags(page_size);
239 |   else
240 |     flags = MAP_SHARED;
241 | 
242 |   arena_size = (arena_size + pagemask) & ~pagemask;
243 |   arena = mmap(0, arena_size, PROT_READ | PROT_WRITE, flags, fd, 0);
244 |   if (arena == MAP_FAILED) {
245 |     perror("mmap");
246 |     exit(1);
247 |   }
248 | 
249 |   if (use_thp && fd == -1)
250 |     check_thp_state();
251 | 
252 |   /* Explicitly disable THP for small pages. */
253 |   if (!page_size_is_huge(page_size) && fd == -1) {
254 |     if (madvise(arena, arena_size, use_thp ? MADV_HUGEPAGE : MADV_NOHUGEPAGE)) {
255 |       perror("madvise");
256 |     }
257 |   } else if (use_thp) {
258 |     fprintf(stderr,
259 |             "Can't use transparent hugepages with a non-native page size.\n");
260 |     exit(1);
261 |   }
262 | 
263 |   if (is_weighted_mbind) {
264 |     arena_weighted_mbind(page_size, arena, arena_size, mbind_weights,
265 |                          MAX_MEM_NODES);
266 |   }
267 |   return arena;
268 | }
269 | 
270 | void make_buffer_executable(void *buf, size_t len) {
271 |   if (mprotect(buf, len, PROT_EXEC)) {
272 |     perror("mprotect(PROT_EXEC)");
273 |     exit(1);
274 |   }
275 | }
276 | 
277 | 


--------------------------------------------------------------------------------
/arena.h:
--------------------------------------------------------------------------------
 1 | /* Copyright 2015 Google Inc. All Rights Reserved.
 2 |  * Licensed under the Apache License, Version 2.0 (the "License");
 3 |  * you may not use this file except in compliance with the License.
 4 |  * You may obtain a copy of the License at
 5 |  *
 6 |  *   http://www.apache.org/licenses/LICENSE-2.0
 7 |  *
 8 |  * Unless required by applicable law or agreed to in writing, software
 9 |  * distributed under the License is distributed on an "AS IS" BASIS,
10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 |  * See the License for the specific language governing permissions and
12 |  * limitations under the License.
13 |  */
14 | #ifndef ARENA_H_INCLUDED
15 | #define ARENA_H_INCLUDED
16 | 
17 | #include <stdbool.h>
18 | #include <stddef.h>
19 | #include <stdint.h>
20 | 
21 | #define MAX_MEM_NODES (8 * sizeof(uint64_t))
22 | 
23 | size_t get_native_page_size(void);
24 | bool page_size_is_huge(size_t page_size);
25 | void print_page_size(size_t page_size, bool use_thp);
26 | 
27 | void *alloc_arena_mmap(size_t page_size, bool use_thp, size_t arena_size, int fd);
28 | 
29 | void make_buffer_executable(void *buf, size_t len);
30 | #endif
31 | 


--------------------------------------------------------------------------------
/br_asm.c:
--------------------------------------------------------------------------------
  1 | #include <inttypes.h>
  2 | #include <math.h>
  3 | #include <stdio.h>
  4 | #include <stdlib.h>
  5 | 
  6 | #include "br_asm.h"
  7 | 
  8 | static int cycle_len(void *p) {
  9 |   int count = 0;
 10 |   char *next = (char *)p;
 11 |   do {
 12 |     ++count;
 13 |     next = *(char **)next;
 14 |   } while (next != p);
 15 |   return count;
 16 | }
 17 | 
 18 | #if defined(__aarch64__)
 19 | 
 20 | static uint64_t rbits(uint64_t val, int rmask) {
 21 |   return val & ((1ULL << rmask) - 1);
 22 | }
 23 | 
 24 | static uint64_t rmask_lsh(uint64_t val, int rmask, int left_shift) {
 25 |   return (val & ((1ULL << rmask) - 1)) << left_shift;
 26 | }
 27 | 
 28 | static void arm_emit_br(char *p, int rs) {
 29 |   *(int *)p = 0b11010110000111110000000000000000 | rmask_lsh(rs, 5, 5);
 30 | }
 31 | 
 32 | static void arm_emit_movk(char *p, int rd, int imm16, int shift16) {
 33 |   *(int *)p = (0b111100101 << 23) | rmask_lsh(shift16, 16, 21) |
 34 |               rmask_lsh(imm16, 16, 5) | rmask_lsh(rd, 5, 0);
 35 | }
 36 | 
 37 | static void arm_emit_movz(char *p, int rd, int imm16, int shift16) {
 38 |   *(int *)p = (0b110100101 << 23) | rmask_lsh(shift16, 16, 21) |
 39 |               rmask_lsh(imm16, 16, 5) | rmask_lsh(rd, 5, 0);
 40 | }
 41 | 
 42 | static void arm_emit_ret(char *p) { *(int *)p = 0xd65f03c0; }
 43 | 
 44 | // Convert a pointer chase to a branch chase, returning after chunk_size
 45 | // branches with a function pointer to the next branch in the chase.
 46 | int convert_pointers_to_branches(void *head, int chunk_size) {
 47 |   int remain = cycle_len(head);
 48 |   chunk_size = (remain < chunk_size)
 49 |                    ? remain
 50 |                    : remain / (1 << lround(log2(1.0 * remain / chunk_size)));
 51 |   int base_chunk_size = chunk_size;
 52 |   int chunks_remaining = remain / chunk_size;
 53 |   int chunk_count = 0;
 54 |   const int ptr_reg = 0;  // Address of next branch and return value.
 55 |   char *p = (char *)head;
 56 |   do {
 57 |     if (!chunk_count) chunk_count = remain / chunks_remaining;
 58 |     char *next = *((char **)p);
 59 |     const int br_code_len = 16;
 60 |     for (int i = 8; i < br_code_len; i++) {
 61 |       if (p[i]) {
 62 |         fprintf(stderr, "not enough space to convert a pointer to branches");
 63 |         exit(1);
 64 |       }
 65 |     }
 66 |     // VA is 48 bits max.
 67 |     uint64_t shift00 = rbits((intptr_t)next, 16);
 68 |     uint64_t shift16 = rbits((intptr_t)next >> 16, 16);
 69 |     uint64_t shift32 = rbits((intptr_t)next >> 32, 16);
 70 |     arm_emit_movz(p, ptr_reg, shift00, 0);
 71 |     arm_emit_movk(p + 4, ptr_reg, shift16, 1);
 72 |     arm_emit_movk(p + 8, ptr_reg, shift32, 2);
 73 |     // At end of branch chunk, return with a pointer to the next chunk.
 74 |     --remain;
 75 |     if (--chunk_count == 0) {
 76 |       arm_emit_ret(p + 12);
 77 |       --chunks_remaining;
 78 |     } else {
 79 |       arm_emit_br(p + 12, ptr_reg);
 80 |     }
 81 |     p = next;
 82 |   } while (p != head);
 83 |   return base_chunk_size;
 84 | }
 85 | 
 86 | #elif defined(__x86_64__)
 87 | 
 88 | static char *x64_emit_mov_imm64_rax(char *p, uint64_t imm64) {
 89 |   *p++ = 0x48;  // movabs
 90 |   *p++ = 0xb8;
 91 |   for (int i = 0; i < 8; i++)
 92 |     *p++ = imm64 >> (8 * i);  // immediate value (8 bytes)
 93 |   return p;
 94 | }
 95 | 
 96 | static char *x64_emit_jmp_to_rax(char *p) {
 97 |   *p++ = 0xff;  // jmp *rax
 98 |   *p++ = 0xe0;
 99 |   return p;
100 | }
101 | 
102 | static char *x64_emit_ret(char *p) {
103 |   *p++ = 0xc3;
104 |   return p;
105 | }
106 | 
107 | // Convert a pointer chase to a branch chase, returning after chunk_size
108 | // branches with a function pointer to the next branch in the chase.
109 | int convert_pointers_to_branches(void *head, int chunk_size) {
110 |   int remain = cycle_len(head);
111 |   chunk_size = (remain < chunk_size)
112 |                    ? remain
113 |                    : remain / (1 << lround(log2(1.0 * remain / chunk_size)));
114 |   int base_chunk_size = chunk_size;
115 |   int chunks_remaining = remain / chunk_size;
116 |   int chunk_count = 0;
117 |   const int br_code_len = 12;  // len(mov_imm64) + max(len(jmp), len(ret))
118 |   char *p = (char *)head;
119 |   do {
120 |     if (!chunk_count) chunk_count = remain / chunks_remaining;
121 |     char *next = *((char **)p);
122 |     for (int i = 8; i < br_code_len; i++) {
123 |       if (p[i]) {
124 |         fprintf(stderr, "not enough space to convert a pointer to branches\n");
125 |         exit(1);
126 |       }
127 |     }
128 |     p = x64_emit_mov_imm64_rax(p, (intptr_t)next);
129 |     // At end of branch chunk, return with a pointer to the next chunk.
130 |     --remain;
131 |     if (--chunk_count == 0) {
132 |       p = x64_emit_ret(p);
133 |       --chunks_remaining;
134 |     } else {
135 |       p = x64_emit_jmp_to_rax(p);
136 |     }
137 |     p = next;
138 |   } while (p != head);
139 |   return base_chunk_size;
140 | }
141 | 
142 | #else
143 | int convert_pointers_to_branches(void *head, int chunk_size) {
144 |   fprintf(stderr, "Not implemented on this architecture.\n");
145 |   exit(1);
146 | }
147 | #endif
148 | 


--------------------------------------------------------------------------------
/br_asm.h:
--------------------------------------------------------------------------------
1 | #ifndef ASM_H_INCLUDED
2 | #define ASM_H_INCLUDED
3 | 
4 | int convert_pointers_to_branches(void *head, int chunk_size);
5 | #endif
6 | 


--------------------------------------------------------------------------------
/cpu_util.h:
--------------------------------------------------------------------------------
 1 | /* Copyright 2015 Google Inc. All Rights Reserved.
 2 |  * Licensed under the Apache License, Version 2.0 (the "License");
 3 |  * you may not use this file except in compliance with the License.
 4 |  * You may obtain a copy of the License at
 5 |  *
 6 |  *   http://www.apache.org/licenses/LICENSE-2.0
 7 |  *
 8 |  * Unless required by applicable law or agreed to in writing, software
 9 |  * distributed under the License is distributed on an "AS IS" BASIS,
10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 |  * See the License for the specific language governing permissions and
12 |  * limitations under the License.
13 |  */
14 | #ifndef PORT_H_INCLUDED
15 | #define PORT_H_INCLUDED
16 | 
17 | // We assume 1024 bytes is good enough alignment to avoid false sharing on all
18 | // architectures.
19 | #define AVOID_FALSE_SHARING (1024)
20 | #ifndef CACHELINE_SIZE
21 | #define CACHELINE_SIZE 64
22 | #endif
23 | #ifndef SWEEP_MAX
24 | #define SWEEP_MAX 256
25 | #endif
26 | #ifndef SWEEP_SPACER
27 | #define SWEEP_SPACER (CACHELINE_SIZE - sizeof(unsigned))
28 | #endif
29 | 
30 | #if defined(__x86_64__) || defined(__i386__)
31 | static inline void cpu_relax(void) { asm volatile("rep; nop"); }
32 | #elif defined __powerpc__
33 | static inline void cpu_relax(void) {
34 |   // HMT_low()
35 |   asm volatile("or 1,1,1");
36 |   // HMT_medium()
37 |   asm volatile("or 2,2,2");
38 |   // barrier()
39 |   asm volatile("" : : : "memory");
40 | }
41 | #elif defined(__aarch64__)
42 | #define cpu_relax() asm volatile("yield" ::: "memory")
43 | #else
44 | #warning "no cpu_relax for your cpu"
45 | #define cpu_relax() \
46 |   do {              \
47 |   } while (0)
48 | #endif
49 | 
50 | #endif
51 | 


--------------------------------------------------------------------------------
/fairness.c:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #define _GNU_SOURCE
 15 | #include <errno.h>
 16 | #include <inttypes.h>
 17 | #include <math.h>
 18 | #include <pthread.h>
 19 | #include <sched.h>
 20 | #include <stdio.h>
 21 | #include <stdlib.h>
 22 | #include <string.h>
 23 | #include <unistd.h>
 24 | 
 25 | #include "cpu_util.h"
 26 | #include "expand.h"
 27 | #include "timer.h"
 28 | 
 29 | typedef unsigned atomic_t;
 30 | 
 31 | typedef union {
 32 |   struct {
 33 |     atomic_t count;
 34 |     int cpu;
 35 |   } x;
 36 |   char pad[AVOID_FALSE_SHARING];
 37 | } per_thread_t;
 38 | 
 39 | typedef struct {
 40 |   struct {
 41 |     atomic_t count;
 42 |     char spacer[SWEEP_SPACER];
 43 |   } x[SWEEP_MAX];
 44 |   int sweep_id;
 45 | } sync_count;
 46 | 
 47 | sync_count global_counter;
 48 | 
 49 | static volatile int relaxed;
 50 | 
 51 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER;
 52 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER;
 53 | static size_t nr_to_startup;
 54 | static uint64_t delay_mask;
 55 | 
 56 | static void wait_for_startup(void) {
 57 |   // wait for everyone to spawn
 58 |   pthread_mutex_lock(&wait_mutex);
 59 |   --nr_to_startup;
 60 |   if (nr_to_startup) {
 61 |     pthread_cond_wait(&wait_cond, &wait_mutex);
 62 |   } else {
 63 |     pthread_cond_broadcast(&wait_cond);
 64 |   }
 65 |   pthread_mutex_unlock(&wait_mutex);
 66 | }
 67 | 
 68 | static void *worker(void *_args) {
 69 |   per_thread_t *args = _args;
 70 | 
 71 |   // move to our target cpu
 72 |   cpu_set_t cpu;
 73 |   CPU_ZERO(&cpu);
 74 |   CPU_SET(args->x.cpu, &cpu);
 75 |   if (sched_setaffinity(0, sizeof(cpu), &cpu)) {
 76 |     perror("sched_setaffinity");
 77 |     exit(1);
 78 |   }
 79 | 
 80 |   wait_for_startup();
 81 | 
 82 |   if (delay_mask & (1u << args->x.cpu)) {
 83 |     sleep(1);
 84 |   }
 85 |   while (!relaxed) {
 86 |     atomic_t *target = &(global_counter.x[global_counter.sweep_id].count);
 87 |     x50(__sync_fetch_and_add(target, 1););
 88 |     __sync_fetch_and_add(&args->x.count, 50);
 89 |   }
 90 |   if (delay_mask & (1u << args->x.cpu)) {
 91 |     sleep(1);
 92 |   }
 93 |   while (relaxed) {
 94 |     atomic_t *target = &(global_counter.x[global_counter.sweep_id].count);
 95 |     x50(__sync_fetch_and_add(target, 1); cpu_relax(););
 96 |     __sync_fetch_and_add(&args->x.count, 50);
 97 |   }
 98 |   return NULL;
 99 | }
100 | 
101 | int main(int argc, char **argv) {
102 |   int c;
103 |   int sweep_max = 1;
104 |   size_t time_slice = 500000;
105 |   char sep = ' ';
106 |   size_t sample_size = 6;
107 | 
108 |   delay_mask = 0;
109 |   while ((c = getopt(argc, argv, "d:n:s:t:S:")) != -1) {
110 |     switch (c) {
111 |       case 'd':
112 |         delay_mask = strtoul(optarg, 0, 0);
113 |         break;
114 |       case 'n':
115 |         sample_size =  strtof(optarg, 0);
116 |         break;
117 |       case 's':
118 |         sweep_max = strtoul(optarg, 0, 0);
119 |         break;
120 |       case 't':
121 |         time_slice = strtof(optarg, 0) * 1000000;
122 |         break;
123 |       case 'S':
124 |         sep = *optarg;
125 |         break;
126 |       default:
127 |         goto usage;
128 |     }
129 |   }
130 | 
131 |   if (argc - optind != 0) {
132 |   usage:
133 |     fprintf(
134 |         stderr,
135 |         "usage: %s \n"
136 |         " [-d delay_mask]\n"
137 |         " [-n sample_nr]\n"
138 |         " [-s sweep_max]\n"
139 |         " [-t time]\n"
140 |         "By default runs one thread on each cpu, use taskset(1) to "
141 |         "restrict operation to fewer cpus/threads.\n"
142 |         "The optional delay_mask specifies a mask of cpus on which to delay "
143 |         "the startup.\n"
144 |         "The optional sample_nr determines the number of samples taken for "
145 |         "each test point.\n"
146 |         "The optional sweep_max causes testing across multiple different "
147 |         "cache lines.\n"
148 |         "The optional time determines how often to poll results (float in "
149 |         "seconds).\n",
150 |         argv[0]);
151 |     exit(1);
152 |   }
153 | 
154 |   setvbuf(stdout, NULL, _IONBF, BUFSIZ);
155 | 
156 |   // find the active cpus
157 |   cpu_set_t cpus;
158 |   if (sched_getaffinity(getpid(), sizeof(cpus), &cpus)) {
159 |     perror("sched_getaffinity");
160 |     exit(1);
161 |   }
162 | 
163 |   // could do this more efficiently, but whatever
164 |   size_t nr_threads = 0;
165 |   int i;
166 |   for (i = 0; i < CPU_SETSIZE; ++i) {
167 |     if (CPU_ISSET(i, &cpus)) {
168 |       ++nr_threads;
169 |     }
170 |   }
171 | 
172 |   per_thread_t *thread_args = calloc(nr_threads, sizeof(*thread_args));
173 |   nr_to_startup = nr_threads + 1;
174 |   size_t u;
175 |   i = 0;
176 |   for (u = 0; u < nr_threads; ++u) {
177 |     while (!CPU_ISSET(i, &cpus)) {
178 |       ++i;
179 |     }
180 |     thread_args[u].x.cpu = i;
181 |     ++i;
182 |     thread_args[u].x.count = 0;
183 |     pthread_t dummy;
184 |     if (pthread_create(&dummy, NULL, worker, &thread_args[u])) {
185 |       perror("pthread_create");
186 |       exit(1);
187 |     }
188 |   }
189 | 
190 |   wait_for_startup();
191 | 
192 |   atomic_t *samples = calloc(nr_threads, sizeof(*samples));
193 | 
194 |   printf(
195 |       "results are avg latency per locked increment in ns, one column per "
196 |       "thread\n");
197 |   char *fmt, *tail = "";
198 |   if (sep == ',') {
199 |     printf("relaxed,sweep");
200 |     fmt = ",cpu-%u";
201 |     tail = ",avg,stdev,min,max";
202 |   } else {
203 |     fmt = "%6u  ";
204 |     printf("cpu:");
205 |   }
206 |   for (u = 0; u < nr_threads; ++u) {
207 |     printf(fmt, thread_args[u].x.cpu);
208 |   }
209 |   printf("%s\n", tail);
210 |   global_counter.sweep_id = 0;
211 |   for (relaxed = 0; relaxed < 2; ++relaxed) {
212 |     if (sep != ',') printf(relaxed ? "relaxed:\n" : "unrelaxed:\n");
213 |     for (int sweep = 0; sweep < sweep_max; sweep++) {
214 |       global_counter.sweep_id = sweep;
215 |       uint64_t last_stamp = now_nsec();
216 |       size_t sample_nr;
217 |       // The first sample is thrown out, so increment sample_size by 1 to get
218 |       // the requested number of samples.
219 |       for (sample_nr = 0; sample_nr < sample_size + 1; ++sample_nr) {
220 |         double min = 1.0 / 0., max = 0.;
221 |         usleep(time_slice);
222 |         for (u = 0; u < nr_threads; ++u) {
223 |           samples[u] = __sync_lock_test_and_set(&thread_args[u].x.count, 0);
224 |         }
225 |         uint64_t stamp = now_nsec();
226 |         int64_t time_delta = stamp - last_stamp;
227 |         last_stamp = stamp;
228 | 
229 |         // throw away the first sample to avoid race issues at startup / mode
230 |         // switch
231 |         if (sample_nr == 0) continue;
232 |         if (sep == ',')
233 |           printf("%d,%p", relaxed,
234 |                  &(global_counter.x[global_counter.sweep_id].count));
235 | 
236 |         if (sep == ',') {
237 |           fmt = ",%.1f";
238 |         } else {
239 |           fmt = "  %6.1f";
240 |           printf("  ");
241 |         }
242 |         double sum = 0.;
243 |         double sum_squared = 0.;
244 |         for (u = 0; u < nr_threads; ++u) {
245 |           double s = time_delta / (double)samples[u];
246 |           printf(fmt, s);
247 |           min = min < s ? min : s;
248 |           max = max > s ? max : s;
249 |           sum += s;
250 |           sum_squared += s * s;
251 |         }
252 |         if (sep == ',') {
253 |           fmt = ",%.1f,%.1f,%.1f,%.1f\n";
254 |         } else {
255 |           fmt = " : avg %6.1f  sdev %6.1f  min %6.1f  max %6.1f\n";
256 |         }
257 |         printf(fmt, sum / nr_threads,
258 |                sqrt((sum_squared - sum * sum / nr_threads) / (nr_threads - 1)),
259 |                min, max);
260 |         fflush(stdout);
261 |       }
262 |     }
263 |   }
264 |   return 0;
265 | }
266 | 


--------------------------------------------------------------------------------
/gen_expand:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/perl
 2 | # Copyright 2015 Google Inc. All Rights Reserved.
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #   http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | use warnings;
16 | use strict;
17 | 
18 | defined(my $n = shift)
19 |         or die "usage: $0 n\n";
20 | 
21 | # special cases i can't be bothered to handle otherwise
22 | print "#define x1(x) x\n";
23 | print "#define x2(x) x x\n";
24 | print "#define x3(x) x2(x) x\n";
25 | 
26 | sub find_largest_factors($) {
27 |         my ($x) = @_;
28 |         for (my $y = int(sqrt($x)); $y >= 2; --$y) {
29 |                 if ($x % $y == 0) {
30 |                         my $z = $x/$y;
31 |                         return "x$y(x$z(x))";
32 |                 }
33 |         }
34 |         return undef;
35 | }
36 | 
37 | foreach my $i (4..$n) {
38 |         if (my $factors = find_largest_factors($i)) {
39 |                 print "#define x$i(x) $factors\n";
40 |         }
41 |         else {
42 |                 my $factors = find_largest_factors($i-1);
43 |                 print "#define x$i(x) $factors x\n";
44 |         }
45 | }
46 | 


--------------------------------------------------------------------------------
/multichase.c:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #define _GNU_SOURCE
 15 | #include <alloca.h>
 16 | #include <errno.h>
 17 | #include <fcntl.h>
 18 | #include <inttypes.h>
 19 | #include <pthread.h>
 20 | #include <sched.h>
 21 | #include <stdio.h>
 22 | #include <stdlib.h>
 23 | #include <string.h>
 24 | #include <sys/time.h>
 25 | #include <time.h>
 26 | #include <unistd.h>
 27 | 
 28 | #include "arena.h"
 29 | #include "br_asm.h"
 30 | #include "cpu_util.h"
 31 | #include "expand.h"
 32 | #include "permutation.h"
 33 | #include "timer.h"
 34 | #include "util.h"
 35 | 
 36 | // The total memory, stride, and TLB locality have been chosen carefully for
 37 | // the current generation of CPUs:
 38 | //
 39 | // - at stride of 64 bytes the L2 next-line prefetch on p-m/core/core2 gives a
 40 | //   helping hand
 41 | //
 42 | // - at stride of 128 bytes the stream prefetcher on various p4 decides the
 43 | //   random accesses sometimes look like a stream and gives a helping hand.
 44 | //
 45 | // - the TLB locality could have been raised beyond 4 pages to defeat various
 46 | //   stream prefetchers, but you need to get out well past 32 pages before
 47 | //   all existing hw prefetchers are defeated, and then you start exceding the
 48 | //   TLB locality on several CPUs and incurring some TLB overhead.
 49 | //   Hence, the default has been changed from 16 pages to 64 pages.
 50 | //
 51 | #define DEF_TOTAL_MEMORY ((size_t)256 * 1024 * 1024)
 52 | #define DEF_STRIDE ((size_t)256)
 53 | #define DEF_NR_SAMPLES ((size_t)5)
 54 | #define DEF_TLB_LOCALITY ((size_t)64)
 55 | #define DEF_NR_THREADS ((size_t)1)
 56 | #define DEF_CACHE_FLUSH ((size_t)64 * 1024 * 1024)
 57 | #define DEF_OFFSET ((size_t)0)
 58 | #define DELAY_USECS 500000
 59 | #define NSECS_PER_USEC 1000
 60 | #define NSECS_PER_MSEC (1000 * 1000)
 61 | 
 62 | int verbosity;
 63 | int print_timestamp;
 64 | int is_weighted_mbind;
 65 | uint16_t mbind_weights[MAX_MEM_NODES];
 66 | 
 67 | #ifdef __i386__
 68 | #define MAX_PARALLEL (6)  // maximum number of chases in parallel
 69 | #else
 70 | #define MAX_PARALLEL (10)
 71 | #endif
 72 | 
 73 | // forward declare
 74 | typedef struct chase_t chase_t;
 75 | 
 76 | // the arguments for the chase threads
 77 | typedef union {
 78 |   char pad[AVOID_FALSE_SHARING];
 79 |   struct {
 80 |     unsigned thread_num;        // which thread is this
 81 |     unsigned count;             // count of number of iterations
 82 |     void *cycle[MAX_PARALLEL];  // initial address for the chases
 83 |     const char *extra_args;
 84 |     int dummy;  // useful for confusing the compiler
 85 | 
 86 |     const struct generate_chase_common_args *genchase_args;
 87 |     size_t nr_threads;
 88 |     const chase_t *chase;
 89 |     void *flush_arena;
 90 |     size_t cache_flush_size;
 91 |     bool use_longer_chase;
 92 |     int branch_chunk_size;
 93 |   } x;
 94 | } per_thread_t;
 95 | 
 96 | int always_zero;
 97 | 
 98 | static void chase_simple(per_thread_t *t) {
 99 |   void *p = t->x.cycle[0];
100 | 
101 |   do {
102 |     x200(p = *(void **)p;)
103 |   } while (__sync_add_and_fetch(&t->x.count, 200));
104 | 
105 |   // we never actually reach here, but the compiler doesn't know that
106 |   t->x.dummy = (uintptr_t)p;
107 | }
108 | 
109 | // parallel chases
110 | 
111 | #define declare(i) void *p##i = start[i];
112 | #define cleanup(i) tmp += (uintptr_t)p##i;
113 | #if MAX_PARALLEL == 6
114 | #define parallel(foo) foo(0) foo(1) foo(2) foo(3) foo(4) foo(5)
115 | #else
116 | #define parallel(foo) \
117 |   foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) foo(6) foo(7) foo(8) foo(9)
118 | #endif
119 | 
120 | #define template(n, expand, inner)                        \
121 |   static void chase_parallel##n(per_thread_t *t) {        \
122 |     void **start = t->x.cycle;                            \
123 |     parallel(declare) do { x##expand(inner) }             \
124 |     while (__sync_add_and_fetch(&t->x.count, n * expand)) \
125 |       ;                                                   \
126 |                                                           \
127 |     uintptr_t tmp = 0;                                    \
128 |     parallel(cleanup) t->x.dummy = tmp;                   \
129 |   }
130 | 
131 | #if defined(__x86_64__) || defined(__i386__)
132 | #define D(n) asm volatile("mov (%1),%0" : "=r"(p##n) : "r"(p##n));
133 | #else
134 | #define D(n) p##n = *(void **)p##n;
135 | #endif
136 | template(2, 100, D(0) D(1));
137 | template(3, 66, D(0) D(1) D(2));
138 | template(4, 50, D(0) D(1) D(2) D(3));
139 | template(5, 40, D(0) D(1) D(2) D(3) D(4));
140 | template(6, 32, D(0) D(1) D(2) D(3) D(4) D(5));
141 | #if MAX_PARALLEL > 6
142 | template(7, 28, D(0) D(1) D(2) D(3) D(4) D(5) D(6));
143 | template(8, 24, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7));
144 | template(9, 22, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8));
145 | template(10, 20, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8) D(9));
146 | #endif
147 | #undef D
148 | #undef parallel
149 | #undef cleanup
150 | #undef declare
151 | 
152 | static void chase_work(per_thread_t *t) {
153 |   void *p = t->x.cycle[0];
154 |   size_t extra_work = strtoul(t->x.extra_args, 0, 0);
155 |   size_t work = 0;
156 |   size_t i;
157 | 
158 |   // the extra work is intended to be overlapped with a dereference,
159 |   // but we don't want it to skip past the next dereference.  so
160 |   // we fold in the value of the pointer, and launch the deref then
161 |   // go into a loop performing extra work, hopefully while the
162 |   // deref occurs.
163 |   do {
164 |     x25(work += (uintptr_t)p; p = *(void **)p;
165 |         for (i = 0; i < extra_work; ++i) { work ^= i; })
166 |   } while (__sync_add_and_fetch(&t->x.count, 25));
167 | 
168 |   // we never actually reach here, but the compiler doesn't know that
169 |   t->x.cycle[0] = p;
170 |   t->x.dummy = work;
171 | }
172 | 
173 | struct incr_struct {
174 |   struct incr_struct *next;
175 |   unsigned incme;
176 | };
177 | 
178 | static void chase_incr(per_thread_t *t) {
179 |   struct incr_struct *p = t->x.cycle[0];
180 | 
181 |   do {
182 |     x50(++p->incme; p = *(void **)p;)
183 |   } while (__sync_add_and_fetch(&t->x.count, 50));
184 | 
185 |   // we never actually reach here, but the compiler doesn't know that
186 |   t->x.cycle[0] = p;
187 | }
188 | 
189 | static void chase_branch(per_thread_t *t) {
190 |   void *p = t->x.cycle[0];
191 |   typedef void* (*fptr)(void);
192 |   fptr fp = p;
193 | 
194 |   do {
195 |     fp = (fptr)(fp)();
196 |   } while (__sync_add_and_fetch(&t->x.count, t->x.branch_chunk_size));
197 | 
198 |   // we never actually reach here, but the compiler doesn't know that
199 |   t->x.dummy = (uintptr_t)p;
200 | }
201 | 
202 | #if defined(__x86_64__) || defined(__i386__)
203 | #define chase_prefetch(type)                                               \
204 |   static void chase_prefetch##type(per_thread_t *t) {                      \
205 |     void *p = t->x.cycle[0];                                               \
206 |                                                                            \
207 |     do {                                                                   \
208 |       x100(asm volatile("prefetch" #type " %0" ::"m"(*(void **)p));        \
209 |            p = *(void **)p;)                                               \
210 |     } while (__sync_add_and_fetch(&t->x.count, 100));                      \
211 |                                                                            \
212 |     /* we never actually reach here, but the compiler doesn't know that */ \
213 |     t->x.cycle[0] = p;                                                     \
214 |   }
215 | chase_prefetch(t0);
216 | chase_prefetch(t1);
217 | chase_prefetch(t2);
218 | chase_prefetch(nta);
219 | #undef chase_prefetch
220 | #endif
221 | 
222 | #if defined(__x86_64__)
223 | static void chase_movdqa(per_thread_t *t) {
224 |   void *p = t->x.cycle[0];
225 | 
226 |   do {
227 |     x100(asm volatile("\n     movdqa (%%rax),%%xmm0"
228 |                       "\n     movdqa 16(%%rax),%%xmm1"
229 |                       "\n     paddq %%xmm1,%%xmm0"
230 |                       "\n     movdqa 32(%%rax),%%xmm2"
231 |                       "\n     paddq %%xmm2,%%xmm0"
232 |                       "\n     movdqa 48(%%rax),%%xmm3"
233 |                       "\n     paddq %%xmm3,%%xmm0"
234 |                       "\n     movq %%xmm0,%%rax"
235 |                       : "=a"(p)
236 |                       : "0"(p));)
237 |   } while (__sync_add_and_fetch(&t->x.count, 100));
238 |   t->x.cycle[0] = p;
239 | }
240 | 
241 | static void chase_movntdqa(per_thread_t *t) {
242 |   void *p = t->x.cycle[0];
243 | 
244 |   do {
245 |     x100(asm volatile(
246 | #ifndef BINUTILS_HAS_MOVNTDQA
247 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x00"
248 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x48,0x10"
249 |              "\n     paddq %%xmm1,%%xmm0"
250 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x50,0x20"
251 |              "\n     paddq %%xmm2,%%xmm0"
252 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x58,0x30"
253 |              "\n     paddq %%xmm3,%%xmm0"
254 |              "\n     movq %%xmm0,%%rax"
255 | #else
256 |              "\n     movntdqa (%%rax),%%xmm0"
257 |              "\n     movntdqa 16(%%rax),%%xmm1"
258 |              "\n     paddq %%xmm1,%%xmm0"
259 |              "\n     movntdqa 32(%%rax),%%xmm2"
260 |              "\n     paddq %%xmm2,%%xmm0"
261 |              "\n     movntdqa 48(%%rax),%%xmm3"
262 |              "\n     paddq %%xmm3,%%xmm0"
263 |              "\n     movq %%xmm0,%%rax"
264 | #endif
265 |              : "=a"(p)
266 |              : "0"(p));)
267 |   } while (__sync_add_and_fetch(&t->x.count, 100));
268 |   t->x.cycle[0] = p;
269 | }
270 | 
271 | static void chase_critword2(per_thread_t *t) {
272 |   void *p = t->x.cycle[0];
273 |   size_t offset = strtoul(t->x.extra_args, 0, 0);
274 |   void *q = (char *)p + offset;
275 | 
276 |   do {
277 |     x100(asm volatile("mov (%1),%0"
278 |                       : "=r"(p)
279 |                       : "r"(p));
280 |          asm volatile("mov (%1),%0"
281 |                       : "=r"(q)
282 |                       : "r"(q));)
283 |   } while (__sync_add_and_fetch(&t->x.count, 100));
284 | 
285 |   t->x.cycle[0] = (void *)((uintptr_t)p + (uintptr_t)q);
286 | }
287 | #endif
288 | 
289 | struct chase_t {
290 |   void (*fn)(per_thread_t *t);
291 |   size_t base_object_size;
292 |   const char *name;
293 |   const char *usage1;
294 |   const char *usage2;
295 |   int requires_arg;
296 |   unsigned parallelism;  // number of parallel chases (at least 1)
297 | };
298 | static const chase_t chases[] = {
299 |     // the default must be first
300 |     {
301 |         .fn = chase_simple,
302 |         .base_object_size = sizeof(void *),
303 |         .name = "simple",
304 |         .usage1 = "simple",
305 |         .usage2 = "no frills pointer dereferencing",
306 |         .requires_arg = 0,
307 |         .parallelism = 1,
308 |     },
309 |     {
310 |         .fn = chase_work,
311 |         .base_object_size = sizeof(void *),
312 |         .name = "work",
313 |         .usage1 = "work:N",
314 |         .usage2 = "loop simple computation N times in between derefs",
315 |         .requires_arg = 1,
316 |         .parallelism = 1,
317 |     },
318 |     {
319 |         .fn = chase_incr,
320 |         .base_object_size = sizeof(struct incr_struct),
321 |         .name = "incr",
322 |         .usage1 = "incr",
323 |         .usage2 = "modify the cache line after each deref",
324 |         .requires_arg = 0,
325 |         .parallelism = 1,
326 |     },
327 |     {
328 |         .fn = chase_branch,
329 |         .base_object_size = 16,
330 |         .name = "branch",
331 |         .usage1 = "branch",
332 |         .usage2 = "convert pointers to branches",
333 |         .requires_arg = 0,
334 |         .parallelism = 1,
335 |     },
336 | #if defined(__x86_64__) || defined(__i386__)
337 | #define chase_prefetch(type)                                        \
338 |   {                                                                 \
339 |     .fn = chase_prefetch##type, .base_object_size = sizeof(void *), \
340 |     .name = #type, .usage1 = #type,                                 \
341 |     .usage2 = "perform prefetch" #type " before each deref",        \
342 |     .requires_arg = 0, .parallelism = 1,                            \
343 |   }
344 |     chase_prefetch(t0),
345 |     chase_prefetch(t1),
346 |     chase_prefetch(t2),
347 |     chase_prefetch(nta),
348 | #endif
349 | #if defined(__x86_64__)
350 |     {
351 |         .fn = chase_movdqa,
352 |         .base_object_size = 64,
353 |         .name = "movdqa",
354 |         .usage1 = "movdqa",
355 |         .usage2 = "use movdqa to read from memory",
356 |         .requires_arg = 0,
357 |         .parallelism = 1,
358 |     },
359 |     {
360 |         .fn = chase_movntdqa,
361 |         .base_object_size = 64,
362 |         .name = "movntdqa",
363 |         .usage1 = "movntdqa",
364 |         .usage2 = "use movntdqa to read from memory",
365 |         .requires_arg = 0,
366 |         .parallelism = 1,
367 |     },
368 | #endif
369 | #define PAR(n)                                                        \
370 |   {                                                                   \
371 |     .fn = chase_parallel##n, .base_object_size = sizeof(void *),      \
372 |     .name = "parallel" #n, .usage1 = "parallel" #n,                   \
373 |     .usage2 = "alternate " #n " non-dependent chases in each thread", \
374 |     .parallelism = n,                                                 \
375 |   }
376 |     PAR(2),
377 |     PAR(3),
378 |     PAR(4),
379 |     PAR(5),
380 |     PAR(6),
381 | #if MAX_PARALLEL > 6
382 |     PAR(7),
383 |     PAR(8),
384 |     PAR(9),
385 |     PAR(10),
386 | #endif
387 | #undef PAR
388 | #if defined(__x86_64__)
389 |     {
390 |         .fn = chase_critword2,
391 |         .base_object_size = 64,
392 |         .name = "critword2",
393 |         .usage1 = "critword2:N",
394 |         .usage2 = "a two-parallel chase which reads at X and X+N",
395 |         .requires_arg = 1,
396 |         .parallelism = 1,
397 |     },
398 | #endif
399 |     {
400 |         .fn = chase_simple,
401 |         .base_object_size = 64,
402 |         .name = "critword",
403 |         .usage1 = "critword:N",
404 |         .usage2 = "a non-parallel chase which reads at X and X+N",
405 |         .requires_arg = 1,
406 |         .parallelism = 1,
407 |     },
408 | };
409 | 
410 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER;
411 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER;
412 | static size_t nr_to_startup;
413 | static int set_thread_affinity = 1;
414 | 
415 | static void *thread_start(void *data) {
416 |   per_thread_t *args = data;
417 | 
418 |   // ensure every thread has a different RNG
419 |   rng_init(args->x.thread_num);
420 | 
421 |   if (set_thread_affinity) {
422 |     // find out which cpus we can run on and move us to an appropriate cpu
423 |     cpu_set_t cpus;
424 |     if (sched_getaffinity(0, sizeof(cpus), &cpus)) {
425 |       perror("sched_getaffinity");
426 |       exit(1);
427 |     }
428 |     int my_cpu;
429 |     unsigned num = args->x.thread_num;
430 |     for (my_cpu = 0; my_cpu < CPU_SETSIZE; ++my_cpu) {
431 |       if (!CPU_ISSET(my_cpu, &cpus)) continue;
432 |       if (num == 0) break;
433 |       --num;
434 |     }
435 |     if (my_cpu == CPU_SETSIZE) {
436 |       fprintf(stderr, "error: more threads than cpus available\n");
437 |       exit(1);
438 |     }
439 |     CPU_ZERO(&cpus);
440 |     CPU_SET(my_cpu, &cpus);
441 |     if (sched_setaffinity(0, sizeof(cpus), &cpus)) {
442 |       perror("sched_setaffinity");
443 |       exit(1);
444 |     }
445 |   }
446 | 
447 |   // generate chases -- using a different mixer index for every
448 |   // thread and for every parallel chase within a thread
449 |   unsigned parallelism = args->x.chase->parallelism;
450 |   for (unsigned par = 0; par < parallelism; ++par) {
451 |     if (args->x.use_longer_chase) {
452 |       args->x.cycle[par] = generate_chase_long(args->x.genchase_args,
453 |                                         parallelism * args->x.thread_num + par,
454 |                                         parallelism * args->x.nr_threads);
455 |     } else {
456 |       args->x.cycle[par] = generate_chase(args->x.genchase_args,
457 |                                         parallelism * args->x.thread_num + par);
458 |     }
459 |   }
460 | 
461 |   // handle critword2 chases
462 |   if (strcmp(args->x.chase->name, "critword2") == 0) {
463 |     size_t offset = strtoul(args->x.extra_args, 0, 0);
464 |     char *p = args->x.cycle[0];
465 |     char *q = p;
466 |     do {
467 |       char *next = *(char **)p;
468 |       *(void **)(p + offset) = next + offset;
469 |       p = next;
470 |     } while (p != q);
471 |   }
472 | 
473 |   // handle critword chases
474 |   if (strcmp(args->x.chase->name, "critword") == 0) {
475 |     size_t offset = strtoul(args->x.extra_args, 0, 0);
476 |     char *p = args->x.cycle[0];
477 |     char *q = p;
478 |     do {
479 |       char *next = *(char **)p;
480 |       *(void **)(p + offset) = next;
481 |       *(void **)p = p + offset;
482 |       p = next;
483 |     } while (p != q);
484 |   }
485 | 
486 |   // handle branch chases
487 |   if (!strcmp(args->x.chase->name, "branch")) {
488 |     void *p = args->x.cycle[0];
489 |     args->x.branch_chunk_size = convert_pointers_to_branches(p, 200);
490 | #if defined(__aarch64__)
491 |     __builtin___clear_cache(
492 |         args->x.genchase_args->arena,
493 |         args->x.genchase_args->arena + args->x.genchase_args->total_memory);
494 | #endif
495 |   }
496 |   
497 |   // now flush our caches
498 |   if (args->x.cache_flush_size) {
499 |     size_t nr_elts = args->x.cache_flush_size / sizeof(size_t);
500 |     size_t *p = args->x.flush_arena;
501 |     size_t sum = 0;
502 |     while (nr_elts) {
503 |       sum += *p;
504 |       ++p;
505 |       --nr_elts;
506 |     }
507 |     args->x.dummy += sum;
508 |   }
509 | 
510 |   // wait and/or wake up everyone if we're all ready
511 |   pthread_mutex_lock(&wait_mutex);
512 |   if (nr_to_startup == 1) {
513 |     if (!strcmp(args->x.chase->name, "branch")) {
514 |       // Change buffer to executable before letting the chases run.
515 |       make_buffer_executable(args->x.genchase_args->arena,
516 |                              args->x.genchase_args->total_memory);
517 |     }
518 |   }
519 |   --nr_to_startup;
520 |   if (nr_to_startup) {
521 |     pthread_cond_wait(&wait_cond, &wait_mutex);
522 |   } else {
523 |     pthread_cond_broadcast(&wait_cond);
524 |   }
525 |   pthread_mutex_unlock(&wait_mutex);
526 | 
527 |   args->x.chase->fn(data);
528 |   return NULL;
529 | }
530 | 
531 | static void timestamp(void) {
532 |   if (!print_timestamp) return;
533 |   struct timeval tv;
534 |   gettimeofday(&tv, NULL);
535 |   printf("%.6f ", tv.tv_sec + tv.tv_usec / 1000000.);
536 | }
537 | 
538 | int main(int argc, char **argv) {
539 |   char *p;
540 |   int c;
541 |   size_t i;
542 |   size_t default_page_size = get_native_page_size();
543 |   size_t page_size = default_page_size;
544 |   bool use_thp = false;
545 |   bool use_malloc = false;
546 |   bool use_longer_chase = false;
547 |   size_t nr_threads = DEF_NR_THREADS;
548 |   size_t nr_samples = DEF_NR_SAMPLES;
549 |   size_t cache_flush_size = DEF_CACHE_FLUSH;
550 |   size_t offset = DEF_OFFSET;
551 |   int print_average = 0;
552 |   const char *extra_args = NULL;
553 |   const char *chase_optarg = chases[0].name;
554 |   const chase_t *chase = &chases[0];
555 |   struct generate_chase_common_args genchase_args;
556 |   int fd = -1;
557 |   uint64_t duration_nsecs = -1;
558 | 
559 |   genchase_args.total_memory = DEF_TOTAL_MEMORY;
560 |   genchase_args.stride = DEF_STRIDE;
561 |   genchase_args.tlb_locality = DEF_TLB_LOCALITY * default_page_size;
562 |   genchase_args.gen_permutation = gen_random_permutation;
563 | 
564 |   setvbuf(stdout, NULL, _IOLBF, BUFSIZ);
565 | 
566 |   while ((c = getopt(argc, argv, "ac:F:p:HLm:Nn:oO:S:s:T:t:vXyW:f:M")) != -1) {
567 |     switch (c) {
568 |       case 'a':
569 |         print_average = 1;
570 |         break;
571 |       case 'c':
572 |         chase_optarg = optarg;
573 |         p = strchr(optarg, ':');
574 |         if (p == NULL) p = optarg + strlen(optarg);
575 |         for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) {
576 |           if (strncmp(optarg, chases[i].name, p - optarg) == 0) {
577 |             break;
578 |           }
579 |         }
580 |         if (i == sizeof(chases) / sizeof(chases[0])) {
581 |           fprintf(stderr, "not a recognized chase name: %s\n", optarg);
582 |           goto usage;
583 |         }
584 |         chase = &chases[i];
585 |         if (chase->requires_arg) {
586 |           if (p[0] != ':' || p[1] == 0) {
587 |             fprintf(stderr, "that chase requires an argument:\n-c %s\t%s\n",
588 |                     chase->usage1, chase->usage2);
589 |             exit(1);
590 |           }
591 |           extra_args = p + 1;
592 |         } else if (*p != 0) {
593 |           fprintf(stderr, "that chase does not take an argument:\n-c %s\t%s\n",
594 |                   chase->usage1, chase->usage2);
595 |            exit(1);
596 |         }
597 |         break;
598 |       case 'f':
599 |         if (fd != -1) {
600 |           fprintf(stderr, "only one file can be provided\n");
601 |           exit(1);
602 |         }
603 |         fd = open(optarg, O_RDWR);
604 |         if (fd == -1) {
605 |           perror("open");
606 |           exit(1);
607 |         }
608 |         break;
609 |       case 'F':
610 |         if (parse_mem_arg(optarg, &cache_flush_size)) {
611 |           fprintf(stderr,
612 |                   "cache_flush_size must be a non-negative integer (suffixed "
613 |                   "with k, m, or g)\n");
614 |           exit(1);
615 |         }
616 |         break;
617 |       case 'p':
618 |         if (parse_mem_arg(optarg, &page_size)) {
619 |           fprintf(stderr,
620 |                   "page size must be a non-negative integer (suffixed with k, "
621 |                   "m, or g)\n");
622 |           exit(1);
623 |         }
624 |         break;
625 |       case 'H':
626 |         use_thp = true;
627 |         break;
628 |       case 'M':
629 |          use_malloc = true;
630 | 	 break;
631 |       case 'L':
632 |         use_longer_chase = true;
633 |         break;
634 |       case 'm':
635 |         if (parse_mem_arg(optarg, &genchase_args.total_memory) ||
636 |             genchase_args.total_memory == 0) {
637 |           fprintf(stderr,
638 |                   "total_memory must be a positive integer (suffixed with k, m "
639 |                   "or g)\n");
640 |           exit(1);
641 |         }
642 |         break;
643 |       case 'N':
644 |         duration_nsecs = 0;
645 |         break;
646 |       case 'n':
647 |         nr_samples = strtoul(optarg, &p, 0);
648 |         if (*p) {
649 |           fprintf(stderr, "nr_samples must be a non-negative integer\n");
650 |           exit(1);
651 |         }
652 |         break;
653 |       case 'O':
654 |         if (parse_mem_arg(optarg, &offset)) {
655 |           fprintf(stderr,
656 |                   "offset must be a non-negative integer (suffixed with k, m, "
657 |                   "or g)\n");
658 |           exit(1);
659 |         }
660 |         break;
661 |       case 'o':
662 |         genchase_args.gen_permutation = gen_ordered_permutation;
663 |         break;
664 |       case 's':
665 |         if (parse_mem_arg(optarg, &genchase_args.stride)) {
666 |           fprintf(
667 |               stderr,
668 |               "stride must be a positive integer (suffixed with k, m, or g)\n");
669 |           exit(1);
670 |         }
671 |         break;
672 |       case 'T':
673 |         if (parse_mem_arg(optarg, &genchase_args.tlb_locality)) {
674 |           fprintf(stderr,
675 |                   "tlb locality must be a positive integer (suffixed with k, "
676 |                   "m, or g)\n");
677 |           exit(1);
678 |         }
679 |         break;
680 |       case 't':
681 |         nr_threads = strtoul(optarg, &p, 0);
682 |         if (*p || nr_threads == 0) {
683 |           fprintf(stderr, "nr_threads must be positive integer\n");
684 |           exit(1);
685 |         }
686 |         break;
687 |       case 'v':
688 |         ++verbosity;
689 |         break;
690 |       case 'W':
691 |         is_weighted_mbind = 1;
692 |         char *tok = NULL, *saveptr = NULL;
693 |         tok = strtok_r(optarg, ",", &saveptr);
694 |         while (tok != NULL) {
695 |           uint16_t node_id;
696 |           uint16_t weight;
697 |           int count = sscanf(tok, "%hu:%hu", &node_id, &weight);
698 |           if (count != 2) {
699 |             fprintf(stderr, "Expecting node_id:weight\n");
700 |             exit(1);
701 |           }
702 |           if (node_id >= sizeof(mbind_weights) / sizeof(mbind_weights[0])) {
703 |             fprintf(stderr, "Maximum node_id is %lu\n",
704 |                     sizeof(mbind_weights) / sizeof(mbind_weights[0]) - 1);
705 |             exit(1);
706 |           }
707 |           mbind_weights[node_id] = weight;
708 |           tok = strtok_r(NULL, ",", &saveptr);
709 |         }
710 |         break;
711 |       case 'X':
712 |         set_thread_affinity = 0;
713 |         break;
714 |       case 'y':
715 |         print_timestamp = 1;
716 |         break;
717 |       default:
718 |         goto usage;
719 |     }
720 |   }
721 | 
722 |   if (argc - optind != 0) {
723 |   usage:
724 |     fprintf(stderr, "usage: %s [options]\n", argv[0]);
725 |     fprintf(stderr,
726 |             "-a             print average latency (default is best latency)\n");
727 |     fprintf(stderr, "-c chase       select one of several different chases:\n");
728 |     for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) {
729 |       fprintf(stderr, "   %-12s%s\n", chases[i].usage1, chases[i].usage2);
730 |     }
731 |     fprintf(stderr, "               default: %s\n", chases[0].name);
732 |     fprintf(stderr, "-m nnnn[kmg]   total memory size (default %zu)\n",
733 |             DEF_TOTAL_MEMORY);
734 |     fprintf(stderr,
735 |             "               NOTE: memory size will be rounded down to a "
736 |             "multiple of -T option\n");
737 |     fprintf(stderr,
738 |             "-N             timed run (stops at nr_samples/2 seconds)\n");
739 |     fprintf(stderr,
740 |             "-n nr_samples  nr of 0.5 second samples to use (default %zu, 0 = "
741 |             "infinite)\n",
742 |             DEF_NR_SAMPLES);
743 |     fprintf(
744 |         stderr,
745 |         "-o             perform an ordered traversal (rather than random)\n");
746 |     fprintf(stderr, "-O nnnn[kmg]   offset the entire chase by nnnn bytes\n");
747 |     fprintf(stderr, "-s nnnn[kmg]   stride size (default %zu)\n", DEF_STRIDE);
748 |     fprintf(stderr, "-T nnnn[kmg]   TLB locality in bytes (default %zu)\n",
749 |             DEF_TLB_LOCALITY * default_page_size);
750 |     fprintf(stderr,
751 |             "               NOTE: TLB locality will be rounded down to a "
752 |             "multiple of stride\n");
753 |     fprintf(stderr, "-t nr_threads  number of threads (default %zu)\n",
754 |             DEF_NR_THREADS);
755 |     fprintf(stderr, "-p page_size   backing page size to use (default %zu)\n",
756 |             default_page_size);
757 |     fprintf(stderr, "-f file        mmap memory using the provided file\n");
758 |     fprintf(stderr,
759 |             "-H             use transparent hugepages (leave page size at "
760 |             "default)\n");
761 |     fprintf(stderr,
762 |             "-L             use longer chase\n");
763 |     fprintf(stderr,
764 |             "-M             use malloc instead of mmap to allocate arena\n");
765 |     fprintf(stderr,
766 |             "-F nnnn[kmg]   amount of memory to use to flush the caches after "
767 |             "constructing\n"
768 |             "               the chase and before starting the benchmark (use "
769 |             "with nta)\n"
770 |             "               default: %zu\n",
771 |             DEF_CACHE_FLUSH);
772 |     fprintf(
773 |         stderr,
774 |         "-W mbind list  list of node:weight,... pairs for allocating memory\n"
775 |         "               has no effect if -H flag is specified\n"
776 |         "               0:10,1:90 weights it as 10%% on 0 and 90%% on 1\n");
777 |     fprintf(stderr, "-X             do not set thread affinity\n");
778 |     fprintf(stderr, "-y             print timestamp in front of each line\n");
779 |     exit(1);
780 |   }
781 | 
782 |   if (genchase_args.stride < sizeof(void *)) {
783 |     fprintf(stderr, "stride must be at least %zu\n", sizeof(void *));
784 |     exit(1);
785 |   }
786 | 
787 |   // ensure some sanity in the various arguments
788 |   if (genchase_args.tlb_locality < genchase_args.stride) {
789 |     genchase_args.tlb_locality = genchase_args.stride;
790 |   } else {
791 |     genchase_args.tlb_locality -=
792 |         genchase_args.tlb_locality % genchase_args.stride;
793 |   }
794 | 
795 |   if (genchase_args.total_memory < genchase_args.tlb_locality) {
796 |     if (genchase_args.total_memory < genchase_args.stride) {
797 |       genchase_args.total_memory = genchase_args.stride;
798 |     } else {
799 |       genchase_args.total_memory -=
800 |           genchase_args.total_memory % genchase_args.stride;
801 |     }
802 |     genchase_args.tlb_locality = genchase_args.total_memory;
803 |   } else {
804 |     genchase_args.total_memory -=
805 |         genchase_args.total_memory % genchase_args.tlb_locality;
806 |   }
807 | 
808 |   genchase_args.nr_mixer_indices =
809 |       genchase_args.stride / chase->base_object_size;
810 |   if (genchase_args.nr_mixer_indices < nr_threads * chase->parallelism) {
811 |     fprintf(stderr,
812 |             "the stride is too small to interleave that many threads, need at "
813 |             "least %zu bytes\n",
814 |             nr_threads * chase->parallelism * chase->base_object_size);
815 |     exit(1);
816 |   }
817 | 
818 |   if (verbosity > 0) {
819 |     printf("nr_mixer_indices = %zu\n", genchase_args.nr_mixer_indices);
820 |     printf("base_object_size = %zu\n", chase->base_object_size);
821 |     printf("nr_threads = %zu\n", nr_threads);
822 |     print_page_size(page_size, use_thp);
823 |     printf("total_memory = %zu (%.1f MiB)\n", genchase_args.total_memory,
824 |            genchase_args.total_memory / (1024. * 1024.));
825 |     printf("stride = %zu\n", genchase_args.stride);
826 |     printf("tlb_locality = %zu\n", genchase_args.tlb_locality);
827 |     printf("chase = %s\n", chase_optarg);
828 |     if (use_malloc) printf("malloc allocation\n");
829 |   }
830 | 
831 |   rng_init(1);
832 | 
833 |   generate_chase_mixer(&genchase_args, nr_threads * chase->parallelism);
834 | 
835 |   // generate the chases by launching multiple threads
836 |   if (use_malloc) {
837 |     genchase_args.arena = (char *)malloc(genchase_args.total_memory + offset) +
838 |         offset;
839 |     if (!genchase_args.arena) {
840 |       perror("malloc");
841 |       exit(1);
842 |     }
843 |   } else {
844 |     genchase_args.arena =
845 |         (char *)alloc_arena_mmap(page_size, use_thp,
846 |                                  genchase_args.total_memory + offset, fd) +
847 |         offset;
848 |   }
849 |   per_thread_t *thread_data = alloc_arena_mmap(
850 |       default_page_size, false, nr_threads * sizeof(per_thread_t), -1);
851 |   void *flush_arena = NULL;
852 |   if (cache_flush_size) {
853 |     flush_arena = alloc_arena_mmap(default_page_size, false, cache_flush_size,
854 |                                    -1);
855 |     memset(flush_arena, 1, cache_flush_size);  // ensure pages are mapped
856 |   }
857 | 
858 |   pthread_t thread;
859 |   nr_to_startup = nr_threads;
860 |   for (i = 0; i < nr_threads; ++i) {
861 |     thread_data[i].x.genchase_args = &genchase_args;
862 |     thread_data[i].x.nr_threads = nr_threads;
863 |     thread_data[i].x.thread_num = i;
864 |     thread_data[i].x.extra_args = extra_args;
865 |     thread_data[i].x.chase = chase;
866 |     thread_data[i].x.flush_arena = flush_arena;
867 |     thread_data[i].x.cache_flush_size = cache_flush_size;
868 |     thread_data[i].x.use_longer_chase = use_longer_chase;
869 |     if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) {
870 |       perror("pthread_create");
871 |       exit(1);
872 |     }
873 |   }
874 | 
875 |   // now wait for them all to finish generating their chases and start chasing
876 |   pthread_mutex_lock(&wait_mutex);
877 |   if (nr_to_startup) {
878 |     pthread_cond_wait(&wait_cond, &wait_mutex);
879 |   }
880 |   pthread_mutex_unlock(&wait_mutex);
881 | 
882 |   // now start sampling their progress
883 |  if (!duration_nsecs)
884 |     duration_nsecs = nr_samples * DELAY_USECS * NSECS_PER_USEC;
885 |   nr_samples = nr_samples + 1;  // we drop the first sample
886 |   uint64_t *cur_samples = alloca(nr_threads * sizeof(*cur_samples));
887 |   uint64_t last_sample_time = now_nsec();
888 |   uint64_t start_time = last_sample_time;
889 |   uint64_t total = 0;
890 |   uint64_t stalls = 0;
891 |   double best = 1. / 0.;
892 |   double running_sum = 0.;
893 |   if (verbosity > 0)
894 |     printf("samples (one column per thread, one row per sample):\n");
895 |   for (size_t sample_no = 0; nr_samples == 1 || sample_no < nr_samples;
896 |        ++sample_no) {
897 |     usleep(DELAY_USECS);
898 | 
899 |     uint64_t sum = 0;
900 |     for (i = 0; i < nr_threads; ++i) {
901 |       cur_samples[i] = __sync_lock_test_and_set(&thread_data[i].x.count, 0);
902 |       sum += cur_samples[i];
903 |     }
904 | 
905 |     uint64_t cur_sample_time = now_nsec();
906 |     uint64_t time_delta = cur_sample_time - last_sample_time;
907 |     last_sample_time = cur_sample_time;
908 | 
909 |     // we drop the first sample because it's fairly likely one
910 |     // thread had some advantage initially due to still having
911 |     // portions of the chase in a cache.
912 |     if (sample_no == 0) continue;
913 | 
914 |     if (verbosity > 0) {
915 |       timestamp();
916 |       for (i = 0; i < nr_threads; ++i) {
917 |         double z = time_delta / (double)cur_samples[i];
918 |         printf(" %6.*f", z < 100. ? 3 : 1, z);
919 |       }
920 |     }
921 | 
922 |     total += sum;
923 |     if (!sum)
924 |       stalls++;
925 |     double t = time_delta / (double)sum;
926 |     running_sum += t;
927 |     if (t < best) {
928 |       best = t;
929 |     }
930 |     if (verbosity > 0) {
931 |       double z = t * nr_threads;
932 |       printf("  avg=%.*f\n", z < 100. ? 3 : 1, z);
933 |     }
934 | 
935 |     if (last_sample_time - start_time > duration_nsecs) {
936 |       printf("timed out: %lu samples, %lu stalls, %lu iterations per msec\n",
937 |              sample_no, stalls, total * NSECS_PER_MSEC / duration_nsecs);
938 |       break;
939 |     }
940 |   }
941 |   timestamp();
942 |   double res;
943 |   if (print_average) {
944 |     res = running_sum * nr_threads / (nr_samples - 1);
945 |   } else {
946 |     res = best * nr_threads;
947 |   }
948 |   printf("%6.*f\n", res < 100. ? 3 : 1, res);
949 | 
950 |   exit(0);
951 | }
952 | 


--------------------------------------------------------------------------------
/multiload.c:
--------------------------------------------------------------------------------
   1 | /* Copyright 2015 Google Inc. All Rights Reserved.
   2 |  * Licensed under the Apache License, Version 2.0 (the "License");
   3 |  * you may not use this file except in compliance with the License.
   4 |  * You may obtain a copy of the License at
   5 |  *
   6 |  *   http://www.apache.org/licenses/LICENSE-2.0
   7 |  *
   8 |  * Unless required by applicable law or agreed to in writing, software
   9 |  * distributed under the License is distributed on an "AS IS" BASIS,
  10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11 |  * See the License for the specific language governing permissions and
  12 |  * limitations under the License.
  13 |  */
  14 | #define _GNU_SOURCE
  15 | #include <alloca.h>
  16 | #include <errno.h>
  17 | #include <inttypes.h>
  18 | #include <math.h>
  19 | #include <pthread.h>
  20 | #include <sched.h>
  21 | #include <stdio.h>
  22 | #include <stdlib.h>
  23 | #include <string.h>
  24 | #include <sys/mman.h>
  25 | #include <sys/time.h>
  26 | #include <time.h>
  27 | #include <unistd.h>
  28 | 
  29 | #include "arena.h"
  30 | #include "cpu_util.h"
  31 | #include "expand.h"
  32 | #include "permutation.h"
  33 | #include "timer.h"
  34 | #include "util.h"
  35 | 
  36 | #if defined(__x86_64__)
  37 | #include <emmintrin.h>
  38 | #endif
  39 | 
  40 | // The total memory, stride, and TLB locality have been chosen carefully for
  41 | // the current generation of CPUs:
  42 | //
  43 | // - at stride of 64 bytes the L2 next-line prefetch on p-m/core/core2 gives a
  44 | //   helping hand
  45 | //
  46 | // - at stride of 128 bytes the stream prefetcher on various p4 decides the
  47 | //   random accesses sometimes look like a stream and gives a helping hand.
  48 | //
  49 | // - the TLB locality could have been raised beyond 4 pages to defeat various
  50 | //   stream prefetchers, but you need to get out well past 32 pages before
  51 | //   all existing hw prefetchers are defeated, and then you start exceding the
  52 | //   TLB locality on several CPUs and incurring some TLB overhead.
  53 | //   Hence, the default has been changed from 16 pages to 64 pages.
  54 | //
  55 | #define DEF_TOTAL_MEMORY ((size_t)256 * 1024 * 1024)
  56 | #define DEF_STRIDE ((size_t)256)
  57 | #define DEF_NR_SAMPLES ((size_t)5)
  58 | #define DEF_TLB_LOCALITY ((size_t)64)
  59 | #define DEF_NR_THREADS ((size_t)1)
  60 | #define DEF_CACHE_FLUSH ((size_t)64 * 1024 * 1024)
  61 | #define DEF_OFFSET ((size_t)0)
  62 | #define DEF_DELAY ((size_t)1)
  63 | #define DEF_CACHELINE ((size_t)64)
  64 | 
  65 | #define LOAD_DELAY_WARMUP_uS \
  66 |   4000000  // Latency & Load thread warmup before data sampling starts
  67 | #define LOAD_DELAY_RUN_uS 2000000  // Data sampling request frequency
  68 | #define LOAD_DELAY_SAMPLE_uS \
  69 |   10000  // Data sample polling loop delay waiting for load threads to update
  70 |          // mutex variable
  71 | 
  72 | typedef enum { RUN_CHASE, RUN_BANDWIDTH, RUN_CHASE_LOADED } test_type_t;
  73 | static volatile uint64_t use_result_dummy = 0x0123456789abcdef;
  74 | 
  75 | static size_t default_page_size;
  76 | static size_t page_size;
  77 | static bool use_thp;
  78 | int verbosity;
  79 | int print_timestamp;
  80 | int is_weighted_mbind;
  81 | uint16_t mbind_weights[MAX_MEM_NODES];
  82 | 
  83 | #ifdef __i386__
  84 | #define MAX_PARALLEL (6)  // maximum number of chases in parallel
  85 | #else
  86 | #define MAX_PARALLEL (10)
  87 | #endif
  88 | 
  89 | // forward declare
  90 | typedef struct chase_t chase_t;
  91 | 
  92 | // the arguments for the chase threads
  93 | typedef union {
  94 |   char pad[AVOID_FALSE_SHARING];
  95 |   struct {
  96 |     unsigned thread_num;        // which thread is this
  97 |     volatile uint64_t count;    // return thread measurement - need 64 bits when
  98 |                                 // passing bandwidth
  99 |     void *cycle[MAX_PARALLEL];  // initial address for the chases
 100 |     const char *extra_args;
 101 |     int dummy;  // useful for confusing the compiler
 102 | 
 103 |     const struct generate_chase_common_args *genchase_args;
 104 |     size_t nr_threads;
 105 |     const chase_t *chase;
 106 |     void *flush_arena;
 107 |     size_t cache_flush_size;
 108 |     bool use_longer_chase;
 109 |     test_type_t run_test_type;  // test type: chase or memory bandwidth
 110 |     const chase_t *memload;     // memory bandwidth function
 111 |     char *load_arena;           // load memory buffer used by this thread
 112 |     size_t load_total_memory;   // load size of the arena
 113 |     size_t load_offset;         // load offset of the arena
 114 |     size_t load_tlb_locality;   // group accesses within this range in order to
 115 |                                 // amortize TLB fills
 116 |     volatile size_t sample_no;  // flag from main thread to tell bandwdith
 117 |                                 // thread to start the next sample.
 118 |     size_t delay;               // injection delay for load
 119 |   } x;
 120 | } per_thread_t;
 121 | 
 122 | int always_zero;
 123 | 
 124 | static void chase_simple(per_thread_t *t) {
 125 |   void *p = t->x.cycle[0];
 126 | 
 127 |   do {
 128 |     x200(p = *(void **)p;)
 129 |   } while (__sync_add_and_fetch(&t->x.count, 200));
 130 | 
 131 |   // we never actually reach here, but the compiler doesn't know that
 132 |   t->x.dummy = (uintptr_t)p;
 133 | }
 134 | 
 135 | // parallel chases
 136 | 
 137 | #define declare(i) void *p##i = start[i];
 138 | #define cleanup(i) tmp += (uintptr_t)p##i;
 139 | #if MAX_PARALLEL == 6
 140 | #define parallel(foo) foo(0) foo(1) foo(2) foo(3) foo(4) foo(5)
 141 | #else
 142 | #define parallel(foo) \
 143 |   foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) foo(6) foo(7) foo(8) foo(9)
 144 | #endif
 145 | 
 146 | #define template(n, expand, inner)                        \
 147 |   static void chase_parallel##n(per_thread_t *t) {        \
 148 |     void **start = t->x.cycle;                            \
 149 |     parallel(declare) do { x##expand(inner) }             \
 150 |     while (__sync_add_and_fetch(&t->x.count, n * expand)) \
 151 |       ;                                                   \
 152 |                                                           \
 153 |     uintptr_t tmp = 0;                                    \
 154 |     parallel(cleanup) t->x.dummy = tmp;                   \
 155 |   }
 156 | 
 157 | #if defined(__x86_64__) || defined(__i386__)
 158 | #define D(n) asm volatile("mov (%1),%0" : "=r"(p##n) : "r"(p##n));
 159 | #else
 160 | #define D(n) p##n = *(void **)p##n;
 161 | #endif
 162 | template(2, 100, D(0) D(1));
 163 | template(3, 66, D(0) D(1) D(2));
 164 | template(4, 50, D(0) D(1) D(2) D(3));
 165 | template(5, 40, D(0) D(1) D(2) D(3) D(4));
 166 | template(6, 32, D(0) D(1) D(2) D(3) D(4) D(5));
 167 | #if MAX_PARALLEL > 6
 168 | template(7, 28, D(0) D(1) D(2) D(3) D(4) D(5) D(6));
 169 | template(8, 24, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7));
 170 | template(9, 22, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8));
 171 | template(10, 20, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8) D(9));
 172 | #endif
 173 | #undef D
 174 | #undef parallel
 175 | #undef cleanup
 176 | #undef declare
 177 | 
 178 | static void chase_work(per_thread_t *t) {
 179 |   void *p = t->x.cycle[0];
 180 |   size_t extra_work = strtoul(t->x.extra_args, 0, 0);
 181 |   size_t work = 0;
 182 |   size_t i;
 183 | 
 184 |   // the extra work is intended to be overlapped with a dereference,
 185 |   // but we don't want it to skip past the next dereference.  so
 186 |   // we fold in the value of the pointer, and launch the deref then
 187 |   // go into a loop performing extra work, hopefully while the
 188 |   // deref occurs.
 189 |   do {
 190 |     x25(work += (uintptr_t)p; p = *(void **)p;
 191 |         for (i = 0; i < extra_work; ++i) { work ^= i; })
 192 |   } while (__sync_add_and_fetch(&t->x.count, 25));
 193 | 
 194 |   // we never actually reach here, but the compiler doesn't know that
 195 |   t->x.cycle[0] = p;
 196 |   t->x.dummy = work;
 197 | }
 198 | 
 199 | struct incr_struct {
 200 |   struct incr_struct *next;
 201 |   unsigned incme;
 202 | };
 203 | 
 204 | static void chase_incr(per_thread_t *t) {
 205 |   struct incr_struct *p = t->x.cycle[0];
 206 | 
 207 |   do {
 208 |     x50(++p->incme; p = *(void **)p;)
 209 |   } while (__sync_add_and_fetch(&t->x.count, 50));
 210 | 
 211 |   // we never actually reach here, but the compiler doesn't know that
 212 |   t->x.cycle[0] = p;
 213 | }
 214 | 
 215 | #if defined(__x86_64__) || defined(__i386__)
 216 | #define chase_prefetch(type)                                               \
 217 |   static void chase_prefetch##type(per_thread_t *t) {                      \
 218 |     void *p = t->x.cycle[0];                                               \
 219 |                                                                            \
 220 |     do {                                                                   \
 221 |       x100(asm volatile("prefetch" #type " %0" ::"m"(*(void **)p));        \
 222 |            p = *(void **)p;)                                               \
 223 |     } while (__sync_add_and_fetch(&t->x.count, 100));                      \
 224 |                                                                            \
 225 |     /* we never actually reach here, but the compiler doesn't know that */ \
 226 |     t->x.cycle[0] = p;                                                     \
 227 |   }
 228 | chase_prefetch(t0);
 229 | chase_prefetch(t1);
 230 | chase_prefetch(t2);
 231 | chase_prefetch(nta);
 232 | #undef chase_prefetch
 233 | #endif
 234 | 
 235 | #if defined(__x86_64__)
 236 | static void chase_movdqa(per_thread_t *t) {
 237 |   void *p = t->x.cycle[0];
 238 | 
 239 |   do {
 240 |     x100(asm volatile("\n     movdqa (%%rax),%%xmm0"
 241 |                       "\n     movdqa 16(%%rax),%%xmm1"
 242 |                       "\n     paddq %%xmm1,%%xmm0"
 243 |                       "\n     movdqa 32(%%rax),%%xmm2"
 244 |                       "\n     paddq %%xmm2,%%xmm0"
 245 |                       "\n     movdqa 48(%%rax),%%xmm3"
 246 |                       "\n     paddq %%xmm3,%%xmm0"
 247 |                       "\n     movq %%xmm0,%%rax"
 248 |                       : "=a"(p)
 249 |                       : "0"(p));)
 250 |   } while (__sync_add_and_fetch(&t->x.count, 100));
 251 |   t->x.cycle[0] = p;
 252 | }
 253 | 
 254 | static void chase_movntdqa(per_thread_t *t) {
 255 |   void *p = t->x.cycle[0];
 256 | 
 257 |   do {
 258 |     x100(asm volatile(
 259 | #ifndef BINUTILS_HAS_MOVNTDQA
 260 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x00"
 261 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x48,0x10"
 262 |              "\n     paddq %%xmm1,%%xmm0"
 263 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x50,0x20"
 264 |              "\n     paddq %%xmm2,%%xmm0"
 265 |              "\n     .byte 0x66,0x0f,0x38,0x2a,0x58,0x30"
 266 |              "\n     paddq %%xmm3,%%xmm0"
 267 |              "\n     movq %%xmm0,%%rax"
 268 | #else
 269 |              "\n     movntdqa (%%rax),%%xmm0"
 270 |              "\n     movntdqa 16(%%rax),%%xmm1"
 271 |              "\n     paddq %%xmm1,%%xmm0"
 272 |              "\n     movntdqa 32(%%rax),%%xmm2"
 273 |              "\n     paddq %%xmm2,%%xmm0"
 274 |              "\n     movntdqa 48(%%rax),%%xmm3"
 275 |              "\n     paddq %%xmm3,%%xmm0"
 276 |              "\n     movq %%xmm0,%%rax"
 277 | #endif
 278 |              : "=a"(p)
 279 |              : "0"(p));)
 280 |   } while (__sync_add_and_fetch(&t->x.count, 100));
 281 |   t->x.cycle[0] = p;
 282 | }
 283 | 
 284 | static void chase_critword2(per_thread_t *t) {
 285 |   void *p = t->x.cycle[0];
 286 |   size_t offset = strtoul(t->x.extra_args, 0, 0);
 287 |   void *q = (char *)p + offset;
 288 | 
 289 |   do {
 290 |     x100(asm volatile("mov (%1),%0"
 291 |                       : "=r"(p)
 292 |                       : "r"(p));
 293 |          asm volatile("mov (%1),%0"
 294 |                       : "=r"(q)
 295 |                       : "r"(q));)
 296 |   } while (__sync_add_and_fetch(&t->x.count, 100));
 297 | 
 298 |   t->x.cycle[0] = (void *)((uintptr_t)p + (uintptr_t)q);
 299 | }
 300 | #endif
 301 | 
 302 | struct chase_t {
 303 |   void (*fn)(per_thread_t *t);
 304 |   size_t base_object_size;
 305 |   const char *name;
 306 |   const char *usage1;
 307 |   const char *usage2;
 308 |   int requires_arg;
 309 |   unsigned parallelism;  // number of parallel chases (at least 1)
 310 | };
 311 | static const chase_t chases[] = {
 312 |     // the default must be first
 313 |     {
 314 |         .fn = chase_simple,
 315 |         .base_object_size = sizeof(void *),
 316 |         .name = "simple    ",
 317 |         .usage1 = "simple",
 318 |         .usage2 = "no frills pointer dereferencing",
 319 |         .requires_arg = 0,
 320 |         .parallelism = 1,
 321 |     },
 322 |     {
 323 |         .fn = chase_simple,
 324 |         .base_object_size = sizeof(void *),
 325 |         .name = "chaseload",
 326 |         .usage1 = "chaseload",
 327 |         .usage2 = "runs simple chase with multiple memory bandwidth loads",
 328 |         .requires_arg = 0,
 329 |         .parallelism = 1,
 330 |     },
 331 |     {
 332 |         .fn = chase_work,
 333 |         .base_object_size = sizeof(void *),
 334 |         .name = "work     ",
 335 |         .usage1 = "work:N",
 336 |         .usage2 = "loop simple computation N times in between derefs",
 337 |         .requires_arg = 1,
 338 |         .parallelism = 1,
 339 |     },
 340 |     {
 341 |         .fn = chase_incr,
 342 |         .base_object_size = sizeof(struct incr_struct),
 343 |         .name = "incr     ",
 344 |         .usage1 = "incr",
 345 |         .usage2 = "modify the cache line after each deref",
 346 |         .requires_arg = 0,
 347 |         .parallelism = 1,
 348 |     },
 349 | #if defined(__x86_64__) || defined(__i386__)
 350 | #define chase_prefetch(type)                                        \
 351 |   {                                                                 \
 352 |     .fn = chase_prefetch##type, .base_object_size = sizeof(void *), \
 353 |     .name = #type, .usage1 = #type,                                 \
 354 |     .usage2 = "perform prefetch" #type " before each deref",        \
 355 |     .requires_arg = 0, .parallelism = 1,                            \
 356 |   }
 357 |     chase_prefetch(t0),
 358 |     chase_prefetch(t1),
 359 |     chase_prefetch(t2),
 360 |     chase_prefetch(nta),
 361 | #endif
 362 | #if defined(__x86_64__)
 363 |     {
 364 |         .fn = chase_movdqa,
 365 |         .base_object_size = 64,
 366 |         .name = "movdqa",
 367 |         .usage1 = "movdqa",
 368 |         .usage2 = "use movdqa to read from memory",
 369 |         .requires_arg = 0,
 370 |         .parallelism = 1,
 371 |     },
 372 |     {
 373 |         .fn = chase_movntdqa,
 374 |         .base_object_size = 64,
 375 |         .name = "movntdqa",
 376 |         .usage1 = "movntdqa",
 377 |         .usage2 = "use movntdqa to read from memory",
 378 |         .requires_arg = 0,
 379 |         .parallelism = 1,
 380 |     },
 381 | #endif
 382 | #define PAR(n)                                                        \
 383 |   {                                                                   \
 384 |     .fn = chase_parallel##n, .base_object_size = sizeof(void *),      \
 385 |     .name = "parallel" #n, .usage1 = "parallel" #n,                   \
 386 |     .usage2 = "alternate " #n " non-dependent chases in each thread", \
 387 |     .parallelism = n,                                                 \
 388 |   }
 389 |     PAR(2),
 390 |     PAR(3),
 391 |     PAR(4),
 392 |     PAR(5),
 393 |     PAR(6),
 394 | #if MAX_PARALLEL > 6
 395 |     PAR(7),
 396 |     PAR(8),
 397 |     PAR(9),
 398 |     PAR(10),
 399 | #endif
 400 | #undef PAR
 401 | #if defined(__x86_64__)
 402 |     {
 403 |         .fn = chase_critword2,
 404 |         .base_object_size = 64,
 405 |         .name = "critword2",
 406 |         .usage1 = "critword2:N",
 407 |         .usage2 = "a two-parallel chase which reads at X and X+N",
 408 |         .requires_arg = 1,
 409 |         .parallelism = 1,
 410 |     },
 411 | #endif
 412 |     {
 413 |         .fn = chase_simple,
 414 |         .base_object_size = 64,
 415 |         .name = "critword",
 416 |         .usage1 = "critword:N",
 417 |         .usage2 = "a non-parallel chase which reads at X and X+N",
 418 |         .requires_arg = 1,
 419 |         .parallelism = 1,
 420 |     },
 421 | };
 422 | 
 423 | //========================================================================================================
 424 | // Memory Bandwidth load generation
 425 | //========================================================================================================
 426 | #define LOAD_MEMORY_INIT_MIBPS                             \
 427 |   register uint64_t loops = 0;                             \
 428 |   size_t cur_sample = -1, nxt_sample = 0;                  \
 429 |   double time0, time1, timetot;                            \
 430 |   double bite_sum = 0, mibps = 0; /*mibps = MiB per sec.*/ \
 431 |   time0 = (double)now_nsec();
 432 | 
 433 | #define LOAD_MEMORY_SAMPLE_MIBPS                                             \
 434 |   loops++;                                                                   \
 435 |   /* Main thread will increment x.sample_no when it wants a sample. */       \
 436 |   nxt_sample = t->x.sample_no;                                               \
 437 |   /* printf("T(%d)%ld:%ld,%ld ", t->x.thread_num, cur_sample, nxt_sample,    \
 438 |    * t->x.count ); */                                                        \
 439 |   /* if ( (loops&0xF)==0xf ) printf("T(%d)%ld:%ld,%ld ", t->x.thread_num,    \
 440 |    * cur_sample, nxt_sample, t->x.count ); */                                \
 441 |   if ((cur_sample != nxt_sample) && (t->x.count == 0)) {                     \
 442 |     time1 = (double)now_nsec();                                              \
 443 |     bite_sum = loops * load_bites;                                           \
 444 |     timetot = time1 - time0;                                                 \
 445 |     mibps = (bite_sum * 1000000000.0) / (timetot * 1024 * 1024);             \
 446 |     /* printf(" T(%d)=%.0f:%ld:%ld ", t->x.thread_num, mibps, cur_sample,    \
 447 |      * nxt_sample ); */                                                      \
 448 |     /* printf(" %ld,%ld,%ld,%.0f:%.0f(%.1fMiBs)\n", cur_sample, loops,       \
 449 |      * t->x.count, bite_sum, timetot, mibps); */                             \
 450 |     /* update the MiB/s count. Main thread will read and set to 0 so we know \
 451 |      * this sample is done. */                                               \
 452 |     __sync_add_and_fetch(&t->x.count, (uint64_t)mibps);                      \
 453 |     cur_sample = nxt_sample;                                                 \
 454 |     loops = 0;                                                               \
 455 |     time0 = (double)now_nsec();                                              \
 456 |   }
 457 | 
 458 | //--------------------------------------------------------------------------------------------------------
 459 | static void load_memcpy_libc(per_thread_t *t) {
 460 | #define LOOP_OPS 2
 461 |   uint64_t load_loop = t->x.load_total_memory / LOOP_OPS;
 462 |   uint64_t load_bites = load_loop * LOOP_OPS;
 463 |   register char *a = (char *)t->x.load_arena;
 464 |   register char *b = a + load_loop;
 465 |   register char *tmp;
 466 | 
 467 |   LOAD_MEMORY_INIT_MIBPS
 468 |   do {
 469 |     tmp = a;
 470 |     a = b;
 471 |     b = tmp;
 472 |     memcpy((void *)a, (void *)b, load_loop);
 473 |     LOAD_MEMORY_SAMPLE_MIBPS
 474 |   } while (1);
 475 | #undef LOOP_OPS
 476 | }
 477 | 
 478 | //--------------------------------------------------------------------------------------------------------
 479 | static void load_memset_libc(per_thread_t *t) {
 480 | #define LOOP_OPS 1
 481 |   uint64_t load_bites = t->x.load_total_memory * LOOP_OPS;
 482 |   register char *a = (char *)t->x.load_arena;
 483 | 
 484 |   LOAD_MEMORY_INIT_MIBPS
 485 |   do {
 486 |     memset((void *)a, 0xdeadbeef, load_bites);
 487 |     LOAD_MEMORY_SAMPLE_MIBPS
 488 |   } while (1);
 489 | #undef LOOP_OPS
 490 | }
 491 | 
 492 | //--------------------------------------------------------------------------------------------------------
 493 | static void load_memsetz_libc(per_thread_t *t) {
 494 | #define LOOP_OPS 1
 495 |   uint64_t load_bites = t->x.load_total_memory * LOOP_OPS;
 496 |   register char *a = (char *)t->x.load_arena;
 497 | 
 498 |   LOAD_MEMORY_INIT_MIBPS
 499 |   do {
 500 |     memset((void *)a, 0, load_bites);
 501 |     LOAD_MEMORY_SAMPLE_MIBPS
 502 |   } while (1);
 503 | #undef LOOP_OPS
 504 | }
 505 | 
 506 | //--------------------------------------------------------------------------------------------------------
 507 | static void load_stream_triad(per_thread_t *t) {
 508 | #define LOOP_OPS 3
 509 | #define LOOP_ALIGN 16
 510 |   uint64_t load_loop, load_bites;
 511 |   register uint64_t N, i;
 512 |   register double *a;
 513 |   register double *b;
 514 |   register double *c;
 515 |   register double *tmp;
 516 | 
 517 |   load_loop =
 518 |       t->x.load_total_memory -
 519 |       (LOOP_OPS * LOOP_ALIGN);  // subract to allow aligning count/addresses
 520 |   load_loop = (load_loop / LOOP_OPS) &
 521 |               ~(LOOP_ALIGN - 1);  // divide by 3 buffers and align byte count on
 522 |                                   // LOOP_ALIGN byte multiple
 523 |   N = load_loop / sizeof(double);
 524 |   load_bites = N * sizeof(double) * LOOP_OPS;
 525 |   size_t aa = (((size_t)t->x.load_arena + LOOP_ALIGN) &
 526 |                ~(LOOP_ALIGN));  // align on 16 byte address
 527 |   a = (double *)aa;
 528 |   b = a + N;
 529 |   c = b + N;
 530 |   if (verbosity > 1) {
 531 |     printf(
 532 |         "load_arena=%p, load_total_memory=0x%lX, load_loop=0x%lX, N=0x%lX, "
 533 |         "a=%p, b=%p, c=%p\n",
 534 |         (char *)t->x.load_arena, t->x.load_total_memory, load_loop, N, a, b, c);
 535 |   }
 536 | 
 537 |   LOAD_MEMORY_INIT_MIBPS
 538 |   do {
 539 |     tmp = a;
 540 |     a = b;
 541 |     b = c;
 542 |     c = tmp;
 543 | 
 544 |     for (i = 0; i < N; ++i) {
 545 |       a[i] = b[i] + c[i];
 546 |     }
 547 |     LOAD_MEMORY_SAMPLE_MIBPS
 548 |   } while (1);
 549 | #undef LOOP_OPS
 550 | #undef LOOP_ALIGN
 551 | }
 552 | 
 553 | static inline void delay_until_iteration(uint64_t iteration) {
 554 |   while (iteration--) {
 555 |     asm volatile("nop");
 556 |   }
 557 | }
 558 | 
 559 | //--------------------------------------------------------------------------------------------------------
 560 | // Stream triad pattern with nontermpoal hint and adjustable injection delay
 561 | static void load_stream_triad_nontemporal_injection_delay(per_thread_t *t) {
 562 | #define LOOP_OPS 3
 563 | #define LOOP_ALIGN 16
 564 |   uint64_t load_loop, load_bites;
 565 |   register uint64_t N, i;
 566 |   register uint64_t *a;
 567 |   register uint64_t *b;
 568 |   register uint64_t *c;
 569 |   register uint64_t *tmp;
 570 |   const uint64_t num_elem_twocachelines = DEF_CACHELINE / sizeof(uint64_t) * 2;
 571 | 
 572 |   load_loop =
 573 |       t->x.load_total_memory -
 574 |       (LOOP_OPS * LOOP_ALIGN);  // subtract to allow aligning count/addresses
 575 |   load_loop = (load_loop / LOOP_OPS) &
 576 |               ~(LOOP_ALIGN - 1);  // divide by 3 buffers and align byte count on
 577 |                                   // LOOP_ALIGN byte multiple
 578 |   N = load_loop / sizeof(uint64_t);
 579 |   load_bites = N * sizeof(uint64_t) * LOOP_OPS;
 580 |   size_t aa = (((size_t)t->x.load_arena + LOOP_ALIGN) &
 581 |                ~(LOOP_ALIGN));  // align on 16 byte address
 582 |   a = (uint64_t *)aa;
 583 |   b = a + N;
 584 |   c = b + N;
 585 | 
 586 |   if (verbosity > 1) {
 587 |     printf(
 588 |         "load_arena=%p, load_total_memory=0x%lX, load_loop=0x%lX, N=0x%lX, "
 589 |         "a=%p, b=%p, c=%p\n",
 590 |         (char *)t->x.load_arena, t->x.load_total_memory, load_loop, N, a, b, c);
 591 |   }
 592 | 
 593 |   LOAD_MEMORY_INIT_MIBPS
 594 |   do {
 595 |     tmp = a;
 596 |     a = b;
 597 |     b = c;
 598 |     c = tmp;
 599 | 
 600 |     for (i = 0; i < N; i += 2) {
 601 |       if (i % num_elem_twocachelines == 0) delay_until_iteration(t->x.delay);
 602 | #if defined(__aarch64__)
 603 |       asm volatile ("stnp %0, %1, [%2]" :: "r"(b[i]+c[i]), "r"(b[i+1]+c[i+1]), "r" (a+i));
 604 | #elif defined(__x86_64__)
 605 |       _mm_stream_si64(&((long long*)a)[i], b[i]+c[i]);
 606 |       _mm_stream_si64(&((long long*)a)[i+1], b[i+1]+c[i+1]);
 607 | #else
 608 |       a[i] = b[i] + c[i];
 609 |       a[i+1] = b[i+1] + c[i+1];
 610 | #endif
 611 |     }
 612 |     LOAD_MEMORY_SAMPLE_MIBPS
 613 |   } while (1);
 614 | #undef LOOP_OPS
 615 | #undef LOOP_ALIGN
 616 | }
 617 | 
 618 | //--------------------------------------------------------------------------------------------------------
 619 | static void load_stream_copy(per_thread_t *t) {
 620 | #define LOOP_OPS 2
 621 |   uint64_t load_loop = t->x.load_total_memory / LOOP_OPS;
 622 |   register uint64_t N = load_loop / sizeof(double);
 623 |   uint64_t load_bites = N * sizeof(double) * LOOP_OPS;
 624 |   register uint64_t i;
 625 |   register double *a = (double *)t->x.load_arena;
 626 |   register double *b = a + N;
 627 |   register double *tmp;
 628 | 
 629 |   LOAD_MEMORY_INIT_MIBPS
 630 |   do {
 631 |     tmp = a;
 632 |     a = b;
 633 |     b = tmp;
 634 |     for (i = 0; i < N; ++i) {
 635 |       b[i] = a[i];
 636 |     }
 637 |     LOAD_MEMORY_SAMPLE_MIBPS
 638 |   } while (1);
 639 | #undef LOOP_OPS
 640 | }
 641 | 
 642 | //--------------------------------------------------------------------------------------------------------
 643 | static void load_stream_sum(per_thread_t *t) {
 644 | #define LOOP_OPS 1
 645 |   uint64_t load_loop = t->x.load_total_memory / LOOP_OPS;
 646 |   register uint64_t N = load_loop / sizeof(uint64_t);
 647 |   uint64_t load_bites = N * sizeof(uint64_t) * LOOP_OPS;
 648 |   register uint64_t i;
 649 |   register uint64_t *a = (uint64_t *)t->x.load_arena;
 650 |   register uint64_t s = 0;
 651 | 
 652 |   LOAD_MEMORY_INIT_MIBPS
 653 |   do {
 654 |     for (i = 0; i < N; ++i) {
 655 |       s += a[i];
 656 |     }
 657 |     LOAD_MEMORY_SAMPLE_MIBPS
 658 |     use_result_dummy += s;
 659 |   } while (1);
 660 | #undef LOOP_OPS
 661 | }
 662 | 
 663 | //--------------------------------------------------------------------------------------------------------
 664 | static const chase_t memloads[] = {
 665 |     // the default must be first
 666 |     {
 667 |         .fn = load_memcpy_libc,
 668 |         .base_object_size = sizeof(void *),
 669 |         .name = "memcpy-libc",
 670 |         .usage1 = "memcpy-libc",
 671 |         .usage2 = "1:1 rd:wr - memcpy()",
 672 |         .requires_arg = 0,
 673 |         .parallelism = 0,
 674 |     },
 675 |     {
 676 |         .fn = load_memset_libc,
 677 |         .base_object_size = sizeof(void *),
 678 |         .name = "memset-libc",
 679 |         .usage1 = "memset-libc",
 680 |         .usage2 = "0:1 rd:wr - memset() non-zero data",
 681 |         .requires_arg = 0,
 682 |         .parallelism = 0,
 683 |     },
 684 |     {
 685 |         .fn = load_memsetz_libc,
 686 |         .base_object_size = sizeof(void *),
 687 |         .name = "memsetz-libc",
 688 |         .usage1 = "memsetz-libc",
 689 |         .usage2 = "0:1 rd:wr - memset() zero data",
 690 |         .requires_arg = 0,
 691 |         .parallelism = 0,
 692 |     },
 693 |     {
 694 |         .fn = load_stream_copy,
 695 |         .base_object_size = sizeof(void *),
 696 |         .name = "stream-copy",
 697 |         .usage1 = "stream-copy",
 698 |         .usage2 = "1:1 rd:wr - lmbench stream copy ",
 699 |         .requires_arg = 0,
 700 |         .parallelism = 0,
 701 |     },
 702 |     {
 703 |         .fn = load_stream_sum,
 704 |         .base_object_size = sizeof(void *),
 705 |         .name = "stream-sum",
 706 |         .usage1 = "stream-sum",
 707 |         .usage2 = "1:0 rd:wr - lmbench stream sum ",
 708 |         .requires_arg = 0,
 709 |         .parallelism = 0,
 710 |     },
 711 |     {
 712 |         .fn = load_stream_triad,
 713 |         .base_object_size = sizeof(void *),
 714 |         .name = "stream-triad",
 715 |         .usage1 = "stream-triad",
 716 |         .usage2 = "2:1 rd:wr - lmbench stream triad a[i]=b[i]+(scalar*c[i])",
 717 |         .requires_arg = 0,
 718 |         .parallelism = 0,
 719 |     },
 720 |     {
 721 |         .fn = load_stream_triad_nontemporal_injection_delay,
 722 |         .base_object_size = sizeof(void *),
 723 |         .name = "stream-triad-nontemporal-injection-delay",
 724 |         .usage1 = "stream-triad-nontemporal-injection-delay",
 725 |         .usage2 = "2:1 rd:wr - lmbench stream triad with nontemporal hint",
 726 |         .requires_arg = 0,
 727 |         .parallelism = 0,
 728 |     }};
 729 | 
 730 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER;
 731 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER;
 732 | static size_t nr_to_startup;
 733 | static int set_thread_affinity = 1;
 734 | 
 735 | static void *thread_start(void *data) {
 736 |   per_thread_t *args = data;
 737 | 
 738 |   // ensure every thread has a different RNG
 739 |   rng_init(args->x.thread_num);
 740 | 
 741 |   if (set_thread_affinity) {
 742 |     // find out which cpus we can run on and move us to an appropriate cpu
 743 |     cpu_set_t cpus;
 744 |     if (sched_getaffinity(0, sizeof(cpus), &cpus)) {
 745 |       perror("sched_getaffinity");
 746 |       exit(1);
 747 |     }
 748 |     int my_cpu;
 749 |     unsigned num = args->x.thread_num;
 750 |     for (my_cpu = 0; my_cpu < CPU_SETSIZE; ++my_cpu) {
 751 |       if (!CPU_ISSET(my_cpu, &cpus)) continue;
 752 |       if (num == 0) break;
 753 |       --num;
 754 |     }
 755 |     if (my_cpu == CPU_SETSIZE) {
 756 |       fprintf(stderr, "error: more threads than cpus available\n");
 757 |       exit(1);
 758 |     }
 759 |     CPU_ZERO(&cpus);
 760 |     CPU_SET(my_cpu, &cpus);
 761 |     if (sched_setaffinity(0, sizeof(cpus), &cpus)) {
 762 |       perror("sched_setaffinity");
 763 |       exit(1);
 764 |     }
 765 |   }
 766 |   if (args->x.run_test_type == RUN_CHASE) {
 767 |     // generate chases -- using a different mixer index for every
 768 |     // thread and for every parallel chase within a thread
 769 |     unsigned parallelism = args->x.chase->parallelism;
 770 |     for (unsigned par = 0; par < parallelism; ++par) {
 771 |       if (args->x.use_longer_chase) {
 772 |         args->x.cycle[par] = generate_chase_long(
 773 |             args->x.genchase_args, parallelism * args->x.thread_num + par,
 774 |             parallelism);
 775 |       } else {
 776 |         args->x.cycle[par] = generate_chase(
 777 |             args->x.genchase_args, parallelism * args->x.thread_num + par);
 778 |       }
 779 |     }
 780 |     // handle critword2 chases
 781 |     if (strcmp(args->x.chase->name, "critword2") == 0) {
 782 |       size_t offset = strtoul(args->x.extra_args, 0, 0);
 783 |       char *p = args->x.cycle[0];
 784 |       char *q = p;
 785 |       do {
 786 |         char *next = *(char **)p;
 787 |         *(void **)(p + offset) = next + offset;
 788 |         p = next;
 789 |       } while (p != q);
 790 |     }
 791 |     // handle critword chases
 792 |     if (strcmp(args->x.chase->name, "critword") == 0) {
 793 |       size_t offset = strtoul(args->x.extra_args, 0, 0);
 794 |       char *p = args->x.cycle[0];
 795 |       char *q = p;
 796 |       do {
 797 |         char *next = *(char **)p;
 798 |         *(void **)(p + offset) = next;
 799 |         *(void **)p = p + offset;
 800 |         p = next;
 801 |       } while (p != q);
 802 |     }
 803 |     // now flush our caches
 804 |     if (args->x.cache_flush_size) {
 805 |       size_t nr_elts = args->x.cache_flush_size / sizeof(size_t);
 806 |       size_t *p = args->x.flush_arena;
 807 |       size_t sum = 0;
 808 |       while (nr_elts) {
 809 |         sum += *p;
 810 |         ++p;
 811 |         --nr_elts;
 812 |       }
 813 |       args->x.dummy += sum;
 814 |     }
 815 |   } else {
 816 |     if (verbosity > 2)
 817 |       printf("thread_start(%d) memload generate buffers\n", args->x.thread_num);
 818 |     // generate buffers
 819 |     args->x.load_arena = (char *)alloc_arena_mmap(
 820 |                              page_size, use_thp,
 821 |                              args->x.load_total_memory + args->x.load_offset,
 822 |                              -1) + args->x.load_offset;
 823 |     memset(args->x.load_arena, 1,
 824 |            args->x.load_total_memory);  // ensure pages are mapped
 825 |   }
 826 | 
 827 |   if (verbosity > 2)
 828 |     printf("thread_start(%d) wait and/or wake up everyone\n",
 829 |            args->x.thread_num);
 830 |   // wait and/or wake up everyone if we're all ready
 831 |   pthread_mutex_lock(&wait_mutex);
 832 |   --nr_to_startup;
 833 |   if (nr_to_startup) {
 834 |     pthread_cond_wait(&wait_cond, &wait_mutex);
 835 |   } else {
 836 |     pthread_cond_broadcast(&wait_cond);
 837 |   }
 838 |   pthread_mutex_unlock(&wait_mutex);
 839 | 
 840 |   if (args->x.run_test_type == RUN_CHASE) {
 841 |     if (verbosity > 2) printf("thread_start: C(%d)\n", args->x.thread_num);
 842 |     args->x.chase->fn(data);
 843 |   } else {
 844 |     if (verbosity > 2) printf("thread_start: M(%d)\n", args->x.thread_num);
 845 |     args->x.memload->fn(data);
 846 |   }
 847 |   return NULL;
 848 | }
 849 | 
 850 | static void timestamp(void) {
 851 |   if (!print_timestamp) return;
 852 |   struct timeval tv;
 853 |   gettimeofday(&tv, NULL);
 854 |   printf("%.6f ", tv.tv_sec + tv.tv_usec / 1000000.);
 855 | }
 856 | 
 857 | int main(int argc, char **argv) {
 858 |   char *p;
 859 |   int c;
 860 |   size_t i;
 861 |   size_t nr_threads = DEF_NR_THREADS;
 862 |   size_t nr_samples = DEF_NR_SAMPLES;
 863 |   size_t cache_flush_size = DEF_CACHE_FLUSH;
 864 |   size_t delay = DEF_DELAY;
 865 |   size_t offset = DEF_OFFSET;
 866 |   bool use_longer_chase = false;
 867 |   int print_average = 0;
 868 |   const char *extra_args = NULL;
 869 |   const char *chase_optarg = chases[0].name;
 870 |   const chase_t *chase = &chases[0];
 871 |   const char *memload_optarg = memloads[0].name;
 872 |   const chase_t *memload = &memloads[0];
 873 |   test_type_t run_test_type =
 874 |       RUN_CHASE;  // RUN_CHASE, RUN_BANDWIDTH, RUN_CHASE_LOADED
 875 |   struct generate_chase_common_args genchase_args;
 876 | 
 877 |   default_page_size = page_size = get_native_page_size();
 878 | 
 879 |   genchase_args.total_memory = DEF_TOTAL_MEMORY;
 880 |   genchase_args.stride = DEF_STRIDE;
 881 |   genchase_args.tlb_locality = DEF_TLB_LOCALITY * default_page_size;
 882 |   genchase_args.gen_permutation = gen_random_permutation;
 883 | 
 884 |   setvbuf(stdout, NULL, _IOLBF, BUFSIZ);
 885 | 
 886 |   while ((c = getopt(argc, argv, "ac:d:l:F:p:HLm:n:oO:S:s:T:t:vXyW:")) != -1) {
 887 |     switch (c) {
 888 |       case 'a':
 889 |         print_average = 1;
 890 |         break;
 891 |       case 'c':
 892 |         chase_optarg = optarg;
 893 |         p = strchr(optarg, ':');
 894 |         if (p == NULL) p = optarg + strlen(optarg);
 895 |         for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) {
 896 |           if (strncmp(optarg, chases[i].name, p - optarg) == 0) {
 897 |             break;
 898 |           }
 899 |         }
 900 |         if (i == sizeof(chases) / sizeof(chases[0])) {
 901 |           fprintf(stderr, "Error: not a recognized chase name: %s\n", optarg);
 902 |           goto usage;
 903 |         }
 904 |         chase = &chases[i];
 905 |         if (strncmp("chaseload", chases[i].name, 12) == 0) {
 906 |           run_test_type = RUN_CHASE_LOADED;
 907 |           if (verbosity > 0) {
 908 |             fprintf(stdout,
 909 |                     "Info: Loaded Latency chase selected. A -l memload can be "
 910 |                     "used to select a specific memory load\n");
 911 |           }
 912 |           break;
 913 |         }
 914 |         if (run_test_type == RUN_BANDWIDTH) {
 915 |           fprintf(stderr,
 916 |                   "Error: When using -l memload, the only valid -c selection "
 917 |                   "is chaseload (ie. loaded latency)\n");
 918 |           goto usage;
 919 |         }
 920 |         if (chase->requires_arg) {
 921 |           if (p[0] != ':' || p[1] == 0) {
 922 |             fprintf(stderr,
 923 |                     "Error: that chase requires an argument:\n-c %s\t%s\n",
 924 |                     chase->usage1, chase->usage2);
 925 |             exit(1);
 926 |           }
 927 |           extra_args = p + 1;
 928 |         } else if (*p != 0) {
 929 |           fprintf(stderr,
 930 |                   "Error: that chase does not take an argument:\n-c %s\t%s\n",
 931 |                   chase->usage1, chase->usage2);
 932 |           exit(1);
 933 |         }
 934 |         break;
 935 |       case 'd':
 936 |         delay = strtoul(optarg, &p, 0);
 937 |         if (*p) {
 938 |           fprintf(stderr, "Error: delay must be a non-negative integer\n");
 939 |           exit(1);
 940 |         }
 941 |         break;
 942 |       case 'F':
 943 |         if (parse_mem_arg(optarg, &cache_flush_size)) {
 944 |           fprintf(stderr,
 945 |                   "Error: cache_flush_size must be a non-negative integer "
 946 |                   "(suffixed with k, m, or g)\n");
 947 |           exit(1);
 948 |         }
 949 |         break;
 950 |       case 'p':
 951 |         if (parse_mem_arg(optarg, &page_size)) {
 952 |           fprintf(stderr,
 953 |                   "Error: page_size must be a non-negative integer (suffixed "
 954 |                   "with k, m, or g)\n");
 955 |           exit(1);
 956 |         }
 957 |         break;
 958 |       case 'H':
 959 |         use_thp = true;
 960 |         break;
 961 |       case 'l':
 962 |         memload_optarg = optarg;
 963 |         p = strchr(optarg, ':');
 964 |         if (p == NULL) p = optarg + strlen(optarg);
 965 |         for (i = 0; i < sizeof(memloads) / sizeof(memloads[0]); ++i) {
 966 |           if (strncmp(optarg, memloads[i].name, p - optarg) == 0) {
 967 |             break;
 968 |           }
 969 |         }
 970 |         if (i == sizeof(memloads) / sizeof(memloads[0])) {
 971 |           fprintf(stderr, "Error: not a recognized memload name: %s\n", optarg);
 972 |           goto usage;
 973 |         }
 974 |         memload = &memloads[i];
 975 |         if (run_test_type != RUN_CHASE_LOADED) {
 976 |           run_test_type = RUN_BANDWIDTH;
 977 |           if (verbosity > 0) {
 978 |             fprintf(stdout,
 979 |                     "Memory Bandwidth test selected. For loaded latency, -c "
 980 |                     "chaseload must also be selected\n");
 981 |           }
 982 |         }
 983 |         if (memload->requires_arg) {
 984 |           if (p[0] != ':' || p[1] == 0) {
 985 |             fprintf(stderr,
 986 |                     "Error: that memload requires an argument:\n-c %s\t%s\n",
 987 |                     memload->usage1, memload->usage2);
 988 |             exit(1);
 989 |           }
 990 |           extra_args = p + 1;
 991 |         } else if (*p != 0) {
 992 |           fprintf(stderr,
 993 |                   "Error: that memload does not take an argument:\n-c %s\t%s\n",
 994 |                   memload->usage1, memload->usage2);
 995 |           exit(1);
 996 |         }
 997 |         break;
 998 |       case 'L':
 999 |         use_longer_chase = true;
1000 |         break;
1001 |       case 'm':
1002 |         if (parse_mem_arg(optarg, &genchase_args.total_memory) ||
1003 |             genchase_args.total_memory == 0) {
1004 |           fprintf(stderr,
1005 |                   "Error: total_memory must be a positive integer (suffixed "
1006 |                   "with k, m or g)\n");
1007 |           exit(1);
1008 |         }
1009 |         break;
1010 |       case 'n':
1011 |         nr_samples = strtoul(optarg, &p, 0);
1012 |         if (*p) {
1013 |           fprintf(stderr, "Error: nr_samples must be a non-negative integer\n");
1014 |           exit(1);
1015 |         }
1016 |         break;
1017 |       case 'O':
1018 |         if (parse_mem_arg(optarg, &offset)) {
1019 |           fprintf(stderr,
1020 |                   "Error: offset must be a non-negative integer (suffixed with "
1021 |                   "k, m, or g)\n");
1022 |           exit(1);
1023 |         }
1024 |         break;
1025 |       case 'o':
1026 |         genchase_args.gen_permutation = gen_ordered_permutation;
1027 |         break;
1028 |       case 's':
1029 |         if (parse_mem_arg(optarg, &genchase_args.stride)) {
1030 |           fprintf(stderr,
1031 |                   "Error: stride must be a positive integer (suffixed with k, "
1032 |                   "m, or g)\n");
1033 |           exit(1);
1034 |         }
1035 |         break;
1036 |       case 'T':
1037 |         if (parse_mem_arg(optarg, &genchase_args.tlb_locality)) {
1038 |           fprintf(stderr,
1039 |                   "Error: tlb locality must be a positive integer (suffixed "
1040 |                   "with k, m, or g)\n");
1041 |           exit(1);
1042 |         }
1043 |         break;
1044 |       case 't':
1045 |         nr_threads = strtoul(optarg, &p, 0);
1046 |         if (*p || nr_threads == 0) {
1047 |           fprintf(stderr, "Error: nr_threads must be positive integer\n");
1048 |           exit(1);
1049 |         }
1050 |         break;
1051 |       case 'v':
1052 |         ++verbosity;
1053 |         break;
1054 |       case 'W':
1055 |         is_weighted_mbind = 1;
1056 |         char *tok = NULL, *saveptr = NULL;
1057 |         tok = strtok_r(optarg, ",", &saveptr);
1058 |         while (tok != NULL) {
1059 |           uint16_t node_id;
1060 |           uint16_t weight;
1061 |           int count = sscanf(tok, "%hu:%hu", &node_id, &weight);
1062 |           if (count != 2) {
1063 |             fprintf(stderr, "Error: Expecting node_id:weight\n");
1064 |             exit(1);
1065 |           }
1066 |           if (node_id >= sizeof(mbind_weights) / sizeof(mbind_weights[0])) {
1067 |             fprintf(stderr, "Error: Maximum node_id is %lu\n",
1068 |                     sizeof(mbind_weights) / sizeof(mbind_weights[0]) - 1);
1069 |             exit(1);
1070 |           }
1071 |           mbind_weights[node_id] = weight;
1072 |           tok = strtok_r(NULL, ",", &saveptr);
1073 |         }
1074 |         break;
1075 |       case 'X':
1076 |         set_thread_affinity = 0;
1077 |         break;
1078 |       case 'y':
1079 |         print_timestamp = 1;
1080 |         break;
1081 |       default:
1082 |         goto usage;
1083 |     }
1084 |   }
1085 | 
1086 |   if (argc - optind != 0) {
1087 |   usage:
1088 |     fprintf(stderr, "usage: %s [options]\n", argv[0]);
1089 |     fprintf(stderr,
1090 |             "This program can run either read latency, memory bandwidth, or "
1091 |             "loaded-latency:\n");
1092 |     fprintf(stderr,
1093 |             "    Latency only;   -c MUST NOT be chaseload. -l memload MUST NOT "
1094 |             "be used\n");
1095 |     fprintf(stderr,
1096 |             "    Bandwidth only: -c MUST NOT be used.      -l memload MUST be "
1097 |             "used\n");
1098 |     fprintf(stderr,
1099 |             "    Loaded-latency: -c MUST be chaseload,     -l memload MUST be "
1100 |             "used\n");
1101 |     fprintf(stderr,
1102 |             "-a       print average latency (default is best latency)\n");
1103 |     fprintf(stderr, "-c chase       select one of several different chases:\n");
1104 |     for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) {
1105 |       fprintf(stderr, "   %-12s%s\n", chases[i].usage1, chases[i].usage2);
1106 |     }
1107 |     fprintf(stderr, "         default: %s\n", chases[0].name);
1108 |     fprintf(stderr,
1109 |             "-l memload     select one of several different memloads:\n");
1110 |     for (i = 0; i < sizeof(memloads) / sizeof(memloads[0]); ++i) {
1111 |       fprintf(stderr, "   %-12s%s\n", memloads[i].usage1, memloads[i].usage2);
1112 |     }
1113 |     fprintf(stderr, "         default: %s\n", memloads[0].name);
1114 |     fprintf(stderr,
1115 |             "-d delay  delay used between loads; only effecitve if used with "
1116 |             "load pattern with suffix injection_delay. (default %zu)\n",
1117 |             DEF_DELAY);
1118 |     fprintf(stderr,
1119 |             "-F nnnn[kmg]   amount of memory to use to flush the caches after "
1120 |             "constructing\n"
1121 |             "         the chase/memload and before starting the benchmark (use "
1122 |             "with nta)\n"
1123 |             "         default: %zu\n",
1124 |             DEF_CACHE_FLUSH);
1125 |     fprintf(stderr, "-p nnnn[kmg]   backing page size to use (default %zu)\n",
1126 |             default_page_size);
1127 |     fprintf(
1128 |         stderr,
1129 |         "-H       use transparent hugepages (leave page size at default)\n");
1130 |     fprintf(stderr, "-m nnnn[kmg]   total memory size (default %zu)\n",
1131 |             DEF_TOTAL_MEMORY);
1132 |     fprintf(stderr,
1133 |             "         NOTE: memory size will be rounded down to a multiple of "
1134 |             "-T option\n");
1135 |     fprintf(stderr,
1136 |             "-L             use longer chase\n");
1137 |     fprintf(stderr,
1138 |             "-n nr_samples  nr of 0.5 second samples to use (default %zu, 0 = "
1139 |             "infinite)\n",
1140 |             DEF_NR_SAMPLES);
1141 |     fprintf(stderr,
1142 |             "-o       perform an ordered traversal (rather than random)\n");
1143 |     fprintf(
1144 |         stderr,
1145 |         "-O nnnn[kmg]   offset the entire chase by nnnn bytes (default %zu)\n",
1146 |         DEF_OFFSET);
1147 |     fprintf(stderr, "-s nnnn[kmg]   stride size (default %zu)\n", DEF_STRIDE);
1148 |     fprintf(stderr, "-T nnnn[kmg]   TLB locality in bytes (default %zu)\n",
1149 |             DEF_TLB_LOCALITY * default_page_size);
1150 |     fprintf(stderr,
1151 |             "         NOTE: TLB locality will be rounded down to a multiple of "
1152 |             "stride\n");
1153 |     fprintf(stderr, "-t nr_threads  number of threads (default %zu)\n",
1154 |             DEF_NR_THREADS);
1155 |     fprintf(stderr, "-v       verbose output (default %u)\n", verbosity);
1156 |     fprintf(
1157 |         stderr,
1158 |         "-W mbind list  list of node:weight,... pairs for allocating memory\n"
1159 |         "         has no effect if -H flag is specified\n"
1160 |         "         0:10,1:90 weights it as 10%% on 0 and 90%% on 1\n");
1161 |     fprintf(stderr, "-X       do not set thread affinity (default %u)\n",
1162 |             set_thread_affinity);
1163 |     fprintf(stderr,
1164 |             "-y       print timestamp in front of each line (default %u)\n",
1165 |             print_timestamp);
1166 |     exit(1);
1167 |   }
1168 | 
1169 |   if (genchase_args.stride < sizeof(void *)) {
1170 |     fprintf(stderr, "stride must be at least %zu\n", sizeof(void *));
1171 |     exit(1);
1172 |   }
1173 | 
1174 |   // ensure some sanity in the various arguments
1175 |   if (genchase_args.tlb_locality < genchase_args.stride) {
1176 |     genchase_args.tlb_locality = genchase_args.stride;
1177 |   } else {
1178 |     genchase_args.tlb_locality -=
1179 |         genchase_args.tlb_locality % genchase_args.stride;
1180 |   }
1181 | 
1182 |   if (genchase_args.total_memory < genchase_args.tlb_locality) {
1183 |     if (genchase_args.total_memory < genchase_args.stride) {
1184 |       genchase_args.total_memory = genchase_args.stride;
1185 |     } else {
1186 |       genchase_args.total_memory -=
1187 |           genchase_args.total_memory % genchase_args.stride;
1188 |     }
1189 |     genchase_args.tlb_locality = genchase_args.total_memory;
1190 |   } else {
1191 |     genchase_args.total_memory -=
1192 |         genchase_args.total_memory % genchase_args.tlb_locality;
1193 |   }
1194 | 
1195 |   genchase_args.nr_mixer_indices =
1196 |       genchase_args.stride / chase->base_object_size;
1197 |   if ((run_test_type == RUN_CHASE) &&
1198 |       (genchase_args.nr_mixer_indices < nr_threads * chase->parallelism)) {
1199 |     fprintf(stderr,
1200 |             "the stride is too small to interleave that many threads, need at "
1201 |             "least %zu bytes\n",
1202 |             nr_threads * chase->parallelism * chase->base_object_size);
1203 |     exit(1);
1204 |   }
1205 | 
1206 |   if (verbosity > 0) {
1207 |     printf("nr_threads = %zu\n", nr_threads);
1208 |     print_page_size(page_size, use_thp);
1209 |     printf("total_memory = %zu (%.1f MiB)\n", genchase_args.total_memory,
1210 |            genchase_args.total_memory / (1024. * 1024.));
1211 |     printf("stride = %zu\n", genchase_args.stride);
1212 |     printf("tlb_locality = %zu\n", genchase_args.tlb_locality);
1213 |     printf("chase = %s\n", chase_optarg);
1214 |     printf("memload = %s\n", memload_optarg);
1215 |     if (run_test_type == RUN_CHASE) printf("run_test_type = RUN_CHASE\n");
1216 |     if (run_test_type == RUN_BANDWIDTH)
1217 |       printf("run_test_type = RUN_BANDWIDTH\n");
1218 |     if (run_test_type == RUN_CHASE_LOADED)
1219 |       printf("run_test_type = RUN_CHASE_LOADED\n");
1220 |   }
1221 | 
1222 |   rng_init(1);
1223 | 
1224 |   if (run_test_type != RUN_BANDWIDTH) {
1225 |     generate_chase_mixer(&genchase_args, nr_threads * chase->parallelism);
1226 | 
1227 |     // generate the chases by launching multiple threads
1228 |     if (verbosity > 2) printf("allocate genchase_args.arena\n");
1229 |     genchase_args.arena =
1230 |         (char *)alloc_arena_mmap(page_size, use_thp,
1231 |                                  genchase_args.total_memory + offset, -1) +
1232 |         offset;
1233 |   }
1234 |   per_thread_t *thread_data = alloc_arena_mmap(
1235 |       default_page_size, false, nr_threads * sizeof(per_thread_t), -1);
1236 |   void *flush_arena = NULL;
1237 |   if (verbosity > 2) printf("allocate cache flush\n");
1238 |   if (cache_flush_size) {
1239 |     flush_arena = alloc_arena_mmap(default_page_size, false, cache_flush_size,
1240 |                                    -1);
1241 |     memset(flush_arena, 1, cache_flush_size);  // ensure pages are mapped
1242 |   }
1243 | 
1244 |   pthread_t thread;
1245 |   size_t nr_chase_threads = 0, nr_load_threads = 0;
1246 |   nr_to_startup = nr_threads;
1247 |   for (i = 0; i < nr_threads; ++i) {
1248 |     thread_data[i].x.genchase_args = &genchase_args;
1249 |     thread_data[i].x.nr_threads = nr_threads;
1250 |     thread_data[i].x.thread_num = i;
1251 |     thread_data[i].x.extra_args = extra_args;
1252 |     thread_data[i].x.chase = chase;
1253 |     thread_data[i].x.flush_arena = flush_arena;
1254 |     thread_data[i].x.cache_flush_size = cache_flush_size;
1255 |     thread_data[i].x.memload = memload;
1256 |     thread_data[i].x.load_arena = NULL;  // memory buffer used by this thread
1257 |     thread_data[i].x.load_total_memory =
1258 |         genchase_args.total_memory;         // size of the arena
1259 |     thread_data[i].x.load_offset = offset;  // memory buffer offset
1260 |     thread_data[i].x.delay = delay;
1261 |     thread_data[i].x.use_longer_chase = use_longer_chase;
1262 | 
1263 |     if (run_test_type == RUN_CHASE_LOADED) {
1264 |       if (i == 0) {
1265 |         thread_data[i].x.run_test_type = RUN_CHASE;
1266 |         nr_chase_threads++;
1267 |         if (verbosity > 2) printf("main: Starting C[%ld]\n", i);
1268 |         if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) {
1269 |           perror("pthread_create");
1270 |           exit(1);
1271 |         }
1272 |       } else {
1273 |         thread_data[i].x.run_test_type = RUN_BANDWIDTH;
1274 |         nr_load_threads++;
1275 |         if (verbosity > 2) printf("main: Starting M[%ld]\n", i);
1276 |         if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) {
1277 |           perror("pthread_create");
1278 |           exit(1);
1279 |         }
1280 |       }
1281 | 
1282 |     } else if (run_test_type == RUN_CHASE) {
1283 |       thread_data[i].x.run_test_type = RUN_CHASE;
1284 |       nr_chase_threads++;
1285 |       if (verbosity > 2) printf("main: Starting C[%ld]\n", i);
1286 |       if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) {
1287 |         perror("pthread_create");
1288 |         exit(1);
1289 |       }
1290 |     } else {
1291 |       nr_load_threads++;
1292 |       thread_data[i].x.run_test_type = RUN_BANDWIDTH;
1293 |       if (verbosity > 2) printf("main: Starting M[%ld]\n", i);
1294 |       if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) {
1295 |         perror("pthread_create");
1296 |         exit(1);
1297 |       }
1298 |     }
1299 |   }
1300 | 
1301 |   // now wait for them all to finish generating their chases/memloads and start
1302 |   // testing
1303 |   if (verbosity > 2) printf("main: waiting for threads to initialize\n");
1304 |   pthread_mutex_lock(&wait_mutex);
1305 |   if (nr_to_startup) {
1306 |     pthread_cond_wait(&wait_cond, &wait_mutex);
1307 |   }
1308 |   pthread_mutex_unlock(&wait_mutex);
1309 |   usleep(LOAD_DELAY_WARMUP_uS);  // Give OS scheduler thread migrations time to
1310 |                                  // settle down.
1311 | 
1312 |   if (verbosity > 2) printf("main: start sampling thread progress\n");
1313 |   // now start sampling their progress
1314 |   nr_samples = nr_samples + 1;  // we drop the first sample
1315 |   double *cur_samples = alloca(nr_threads * sizeof(*cur_samples));
1316 |   uint64_t last_sample_time, cur_sample_time;
1317 |   double chase_min = 1. / 0., chase_max = 0.;
1318 |   double chase_running_sum = 0., load_running_sum = 0.,
1319 |          chase_running_geosum = 0.;
1320 |   double load_max_mibps = 0, load_min_mibps = 1. / 0.;
1321 |   double chase_thd_sum = 0, load_thd_sum = 0;
1322 |   uint64_t time_delta = 0;
1323 |   int ready;
1324 | 
1325 |   last_sample_time = now_nsec();
1326 |   for (size_t sample_no = 0; nr_samples == 1 || sample_no < nr_samples;
1327 |        ++sample_no) {
1328 |     if (verbosity > 0) printf("main: sample_no=%ld ", sample_no);
1329 |     usleep(LOAD_DELAY_RUN_uS);
1330 |     // Request threads to update their sample
1331 |     for (i = 0; i < nr_threads; ++i) {
1332 |       thread_data[i].x.sample_no = sample_no;
1333 |     }
1334 | 
1335 |     chase_thd_sum = 0.;
1336 |     load_thd_sum = 0.;
1337 |     usleep(LOAD_DELAY_SAMPLE_uS);  // Give load threads time to update sample
1338 |                                    // count. Chase threads are always updating
1339 |     for (i = 0; i < nr_threads; ++i) {
1340 |       if (verbosity > 2) printf("-");
1341 |       ready = 0;
1342 |       while (ready == 0) {
1343 |         cur_samples[i] =
1344 |             (double)__sync_lock_test_and_set(&thread_data[i].x.count, 0);
1345 |         if (cur_samples[i] != 0) {
1346 |           // Chase threads start at thread 0 and should always be ready,
1347 |           // therefore we read chase timestamp as soon as finished reading the
1348 |           // last chase thread Load threads return pre-calculated MiB/s so don't
1349 |           // use this timer
1350 |           if ((i + 1) == nr_chase_threads) {
1351 |             cur_sample_time = now_nsec();
1352 |             time_delta = cur_sample_time - last_sample_time;
1353 |             last_sample_time = cur_sample_time;
1354 |           }
1355 |           ready = 1;
1356 |         } else {
1357 |           if (verbosity > 2) printf("*");
1358 |           usleep(LOAD_DELAY_SAMPLE_uS);
1359 |         }
1360 |       }
1361 |     }
1362 | 
1363 |     for (i = 0; i < nr_threads; ++i) {
1364 |       // printf("main: thread[%d], run_mode=%i\n", t->x.thread_num,
1365 |       // t->x.run_test_type);
1366 |       if (thread_data[i].x.run_test_type == RUN_CHASE) {
1367 |         chase_thd_sum += (double)cur_samples[i];
1368 |         if (verbosity > 1) {
1369 |           double z = time_delta / (double)cur_samples[i];
1370 |           double mibps = sizeof(void *) / (z / 1000000000.0) / (1024 * 1024);
1371 |           printf(" MC(%ld)%.3f, %6.1f(ns), %.3f(MiB/s)", i, cur_samples[i], z,
1372 |                  mibps);
1373 |         }
1374 |       } else {
1375 |         load_thd_sum += (double)cur_samples[i];
1376 |         if (verbosity > 1) {
1377 |           printf(" ML(%ld)%.0f(MiB/s)", i, cur_samples[i]);
1378 |         }
1379 |       }
1380 |     }
1381 | 
1382 |     // we drop the first sample because it's fairly likely one
1383 |     // thread had some advantage initially due to still having
1384 |     // portions of the chase in a cache.
1385 |     if (sample_no == 0) {
1386 |       if (verbosity > 0) printf("\n");
1387 |       continue;
1388 |     }
1389 | 
1390 |     // Calcuate chase overall thread stats.
1391 |     if (chase_thd_sum != 0) {
1392 |       double t = time_delta / (double)chase_thd_sum;
1393 |       chase_running_sum += t;
1394 |       chase_running_geosum += log(t);
1395 |       if (t < chase_min) chase_min = t;
1396 |       if (t > chase_max) chase_max = t;
1397 |       if (verbosity > 0) {
1398 |         double z = t * nr_chase_threads;
1399 |         printf(" avg=%.1f(ns)\n", z);
1400 |       }
1401 |     }
1402 | 
1403 |     // Calculate memory load overall thread stats
1404 |     if (load_thd_sum != 0) {
1405 |       if (load_thd_sum > load_max_mibps) load_max_mibps = load_thd_sum;
1406 |       if (load_thd_sum < load_min_mibps) load_min_mibps = load_thd_sum;
1407 |       load_running_sum += load_thd_sum;
1408 |       if (verbosity > 0) {
1409 |         printf(" main: threads=%ld, Total(MiB/s)=%.*f, PerThread=%.f\n",
1410 |                nr_load_threads, load_thd_sum < 100. ? 3 : 1, load_thd_sum,
1411 |                load_thd_sum / nr_load_threads);
1412 |       }
1413 |     }
1414 |   }
1415 | 
1416 |   // printf("sample_sum=%.f\n", sample_sum);
1417 |   // printf("main: float=%li, void*=%li, size_t=%li, uint64_t=%li, double=%li,
1418 |   // long=%li, int=%li\n",
1419 |   //       sizeof(float), sizeof(void*), sizeof(size_t), sizeof(uint64_t),
1420 |   //       sizeof(double), sizeof(long), sizeof(int) );
1421 |   double ChasNS = 0, ChasDEV = 0, ChasBEST = 0, ChasWORST = 0, ChasAVG = 0,
1422 |          ChasMibs = 0, ChasGEO = 0;
1423 |   double LdAvgMibs = 0, LdMibsDEV = 0;
1424 | 
1425 |   if (nr_chase_threads != 0) {
1426 |     ChasAVG = chase_running_sum * nr_chase_threads / (nr_samples - 1);
1427 |     ChasGEO =
1428 |         nr_chase_threads * exp(chase_running_geosum / ((double)nr_samples - 1));
1429 |     ChasBEST = chase_min * nr_chase_threads;
1430 |     ChasWORST = chase_max * nr_chase_threads;
1431 |     ChasDEV = ((ChasWORST - ChasBEST) / ChasAVG);
1432 |     if (verbosity > 0) {
1433 |       printf(
1434 |           "ChasAVG=%-8f, ChasGEO=%-8f, ChasBEST=%-8f, ChasWORST=%-8f, "
1435 |           "ChasDEV=%-8.3f\n",
1436 |           ChasAVG, ChasGEO, ChasBEST, ChasWORST, ChasDEV);
1437 |     }
1438 |     // if (print_average) ChasNS = ChasAVG;
1439 |     if (print_average)
1440 |       ChasNS = ChasGEO;
1441 |     else
1442 |       ChasNS = ChasBEST;
1443 |     ChasMibs = nr_chase_threads *
1444 |                (sizeof(void *) / (ChasNS / 1000000000.0) / (1024 * 1024));
1445 |   }
1446 | 
1447 |   if (nr_load_threads != 0) {
1448 |     LdAvgMibs = load_running_sum / (nr_samples - 1);
1449 |     LdMibsDEV = ((load_max_mibps - load_min_mibps) / LdAvgMibs);
1450 |     if (verbosity > 0) {
1451 |       printf(
1452 |           "LdAvgMibs=%-8f, LdMaxMibs=%-8f, LdMinMibs=%-8f, LdDevMibs=%-8.3f\n",
1453 |           LdAvgMibs, load_max_mibps, load_min_mibps, LdMibsDEV);
1454 |     }
1455 |   }
1456 | 
1457 |   const char *not_used = "--------";
1458 |   printf(
1459 |       "Samples\t, Byte/thd\t, ChaseThds\t, ChaseNS\t, ChaseMibs\t, "
1460 |       "ChDeviate\t, LoadThds\t, LdMaxMibs\t, LdAvgMibs\t, LdDeviate\t, "
1461 |       "ChaseArg\t, MemLdArg\n");
1462 |   printf(
1463 |       "%-6ld\t, %-11ld\t, %-8ld\t, %-8.3f\t, %-8.f\t, %-8.3f\t, %-8.f\t, "
1464 |       "%-8.f\t, %-8.f\t, %-8.3f",
1465 |       nr_samples - 1, thread_data[0].x.load_total_memory, nr_chase_threads,
1466 |       ChasNS, ChasMibs, ChasDEV, (double)nr_load_threads, load_max_mibps,
1467 |       LdAvgMibs, LdMibsDEV);
1468 |   switch (run_test_type) {
1469 |     case RUN_CHASE_LOADED:
1470 |       printf("\t, %s\t, %s\n", chase_optarg, memload_optarg);
1471 |       break;
1472 |     case RUN_BANDWIDTH:
1473 |       printf("\t, %s\t, %s\n", not_used, memload_optarg);
1474 |       break;
1475 |     default:
1476 |       printf("\t, %s\t, %s\n", chase_optarg, not_used);
1477 |   }
1478 | 
1479 |   timestamp();
1480 |   exit(0);
1481 | }
1482 | 


--------------------------------------------------------------------------------
/permutation.c:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #include "permutation.h"
 15 | 
 16 | #include <assert.h>
 17 | #include <limits.h>
 18 | #include <stdio.h>
 19 | #include <stdlib.h>
 20 | #include <stdbool.h>
 21 | #include <string.h>
 22 | 
 23 | // some asserts are more expensive than we want in general use, but there are a
 24 | // few i want active even in general use.
 25 | #if 1
 26 | #define dassert(x) \
 27 |   do {             \
 28 |   } while (0)
 29 | #else
 30 | #define dassert(x) assert(x)
 31 | #endif
 32 | 
 33 | // XXX: declare this somewhere
 34 | extern int verbosity;
 35 | 
 36 | __thread char *rng_buf;  // buf size (32 for now) determines the "randomness"
 37 | __thread struct random_data *rand_state;  // per_thread state for random_r
 38 | 
 39 | //============================================================================
 40 | // a random permutation generator.  i think this algorithm is from Knuth.
 41 | 
 42 | void gen_random_permutation(perm_t *perm, size_t nr, size_t base) {
 43 |   size_t i;
 44 | 
 45 |   for (i = 0; i < nr; ++i) {
 46 |     size_t t = rng_int(i);
 47 |     perm[i] = perm[t];
 48 |     perm[t] = base + i;
 49 |   }
 50 | }
 51 | 
 52 | void gen_ordered_permutation(perm_t *perm, size_t nr, size_t base) {
 53 |   size_t i;
 54 | 
 55 |   for (i = 0; i < nr; ++i) {
 56 |     perm[i] = base + i;
 57 |   }
 58 | }
 59 | 
 60 | int is_a_permutation(const perm_t *perm, size_t nr_elts) {
 61 |   uint8_t *vec;
 62 |   size_t vec_len = (nr_elts + 7) / 8;
 63 |   size_t i;
 64 | 
 65 |   vec = malloc(vec_len);
 66 |   memset(vec, 0, vec_len);
 67 | 
 68 |   for (i = 0; i < nr_elts; ++i) {
 69 |     size_t vec_elt = perm[i] / 8;
 70 |     size_t test_bit = 1u << (perm[i] % 8);
 71 |     if (vec[vec_elt] & test_bit) {
 72 |       free(vec);
 73 |       return 0;
 74 |     }
 75 |     vec[vec_elt] |= test_bit;
 76 |   }
 77 |   for (i = 0; i < nr_elts / 8; ++i) {
 78 |     if (vec[i] != 0xff) {
 79 |       free(vec);
 80 |       return 0;
 81 |     }
 82 |   }
 83 |   if (nr_elts % 8) {
 84 |     if (vec[vec_len - 1] != ((1u << (nr_elts % 8)) - 1)) {
 85 |       free(vec);
 86 |       return 0;
 87 |     }
 88 |   }
 89 |   free(vec);
 90 |   return 1;
 91 | }
 92 | 
 93 | void generate_chase_mixer(struct generate_chase_common_args *args,
 94 |                           size_t nr_mixers) {
 95 |   size_t nr_mixer_indices = args->nr_mixer_indices;
 96 |   void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation;
 97 | 
 98 |   /* Set number of mixers rounded up to the power of two */
 99 |   if (nr_mixer > 1) {
100 |     args->nr_mixers = 1 << (CHAR_BIT * sizeof(long) -
101 |                             __builtin_clzl(nr_mixers - 1));
102 |   }
103 |   
104 |   if (args->nr_mixers < 64) {
105 |     args->nr_mixers = 64;
106 |   }
107 | 
108 |   if (verbosity > 1)
109 |     printf("nr_mixers = %zu\n", args->nr_mixers);
110 |   perm_t *t = malloc(nr_mixer_indices * sizeof(*t));
111 |   if (t == NULL) {
112 |     fprintf(stderr, "Could not allocate %lu bytes, check stride/memory size?\n",
113 |             nr_mixer_indices * sizeof(*t));
114 |     exit(1);
115 |   }
116 |   perm_t *r = malloc(nr_mixer_indices * args->nr_mixers * sizeof(*r));
117 |   if (r == NULL) {
118 |     fprintf(stderr, "Could not allocate %lu bytes, check stride/memory size?\n",
119 |             nr_mixer_indices * args->nr_mixers * sizeof(*r));
120 |     exit(1);
121 |   }
122 |   size_t i;
123 |   size_t j;
124 | 
125 |   // we arrange r in a transposed manner so that all of the
126 |   // data for a particular mixer_idx is packed together.
127 |   for (i = 0; i < args->nr_mixers; ++i) {
128 |     gen_permutation(t, nr_mixer_indices, 0);
129 |     for (j = 0; j < nr_mixer_indices; ++j) {
130 |       r[j * args->nr_mixers + i] = t[j];
131 |     }
132 |   }
133 |   free(t);
134 | 
135 |   args->mixer = r;
136 | }
137 | 
138 | // Generate a pointer chasing sequence according to chase args. 
139 | void *generate_chase(const struct generate_chase_common_args *args,
140 |                      size_t mixer_idx) {
141 |   char *arena = args->arena;
142 |   size_t total_memory = args->total_memory;
143 |   size_t stride = args->stride;
144 |   size_t tlb_locality = args->tlb_locality;
145 |   void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation;
146 |   const perm_t *mixer = args->mixer + mixer_idx * args->nr_mixers;
147 |   size_t nr_mixer_indices = args->nr_mixer_indices;
148 | 
149 |   size_t nr_tlb_groups = total_memory / tlb_locality;
150 |   size_t nr_elts_per_tlb = tlb_locality / stride;
151 |   size_t nr_elts = total_memory / stride;
152 |   perm_t *tlb_perm;
153 |   perm_t *perm;
154 |   size_t i;
155 |   perm_t *perm_inverse;
156 |   size_t mixer_scale = stride / nr_mixer_indices;
157 | 
158 |   if (verbosity > 1)
159 |     printf("generating permutation of %zu elements (in %zu TLB groups)\n",
160 |            nr_elts, nr_tlb_groups);
161 |   tlb_perm = malloc(nr_tlb_groups * sizeof(*tlb_perm));
162 |   gen_permutation(tlb_perm, nr_tlb_groups, 0);
163 |   perm = malloc(nr_elts * sizeof(*perm));
164 |   for (i = 0; i < nr_tlb_groups; ++i) {
165 |     gen_permutation(&perm[i * nr_elts_per_tlb], nr_elts_per_tlb,
166 |                     tlb_perm[i] * nr_elts_per_tlb);
167 |   }
168 |   free(tlb_perm);
169 | 
170 |   dassert(is_a_permutation(perm, nr_elts));
171 | 
172 |   if (verbosity > 1) printf("generating inverse permtuation\n");
173 |   perm_inverse = malloc(nr_elts * sizeof(*perm));
174 |   for (i = 0; i < nr_elts; ++i) {
175 |     perm_inverse[perm[i]] = i;
176 |   }
177 | 
178 |   dassert(is_a_permutation(perm_inverse, nr_elts));
179 | 
180 | #define MIXED(x) ((x)*stride + mixer[(x) & (args->nr_mixers - 1)] * mixer_scale)
181 | 
182 |   if (verbosity > 1)
183 |     printf("threading the chase (mixer_idx = %zu)\n", mixer_idx);
184 |   for (i = 0; i < nr_elts; ++i) {
185 |     size_t next;
186 |     dassert(perm[perm_inverse[i]] == i);
187 |     next = perm_inverse[i] + 1;
188 |     next = (next == nr_elts) ? 0 : next;
189 |     *(void **)(arena + MIXED(i)) = (void *)(arena + MIXED(perm[next]));
190 |   }
191 | 
192 |   free(perm);
193 |   free(perm_inverse);
194 | 
195 |   return arena + MIXED(0);
196 | }
197 | 
198 | // Generates nr_mixer_indices/total_par number of permutations and switch to
199 | // the next permutation in each iteration of the chase.
200 | // This modification is effective in getting around CMC prefetcher.
201 | void *generate_chase_long(const struct generate_chase_common_args *args,
202 |                      size_t mixer_idx, size_t total_par) {
203 |   char *arena = args->arena;
204 |   size_t total_memory = args->total_memory;
205 |   size_t stride = args->stride;
206 |   size_t tlb_locality = args->tlb_locality;
207 |   void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation;
208 |   size_t nr_mixer_indices = args->nr_mixer_indices;
209 |   size_t nr_iteration = nr_mixer_indices / total_par;
210 |   const perm_t *mixer = args->mixer + mixer_idx * nr_iteration * args->nr_mixers;
211 | 
212 |   size_t nr_tlb_groups = total_memory / tlb_locality;
213 |   size_t nr_elts_per_tlb = tlb_locality / stride;
214 |   size_t nr_elts = total_memory / stride;
215 |   perm_t *tlb_perm;
216 |   perm_t *perm;
217 |   size_t i;
218 |   size_t j;
219 |   size_t base;
220 |   perm_t *perm_inverse;
221 |   size_t mixer_scale = stride / nr_mixer_indices;
222 | 
223 |   if (verbosity > 1)
224 |     printf("generating permutation of %zu elements (in %zu TLB groups)\n",
225 |            nr_elts, nr_tlb_groups);
226 | 
227 |   perm = malloc(nr_iteration * nr_elts * sizeof(*perm));
228 |   if (perm == NULL) {
229 |     fprintf(stderr, "Could not allocate %lu bytes\n",
230 |             nr_iteration * nr_elts * sizeof(*perm));
231 |     exit(1);
232 |   }
233 | 
234 |   // Generate nr_iteration number of permutations.
235 |   for (j = 0; j < nr_iteration; j++) {
236 |     base = j * nr_elts;
237 | 
238 |     tlb_perm = malloc(nr_tlb_groups * sizeof(*tlb_perm));
239 |     if (tlb_perm == NULL) {
240 |       fprintf(stderr, "Could not allocate %lu bytes\n",
241 |               nr_tlb_groups * sizeof(*tlb_perm));
242 |       exit(1);
243 |     }
244 | 
245 |     gen_permutation(tlb_perm, nr_tlb_groups, 0);
246 | 
247 |     for (i = 0; i < nr_tlb_groups; ++i) {
248 |       gen_permutation(&perm[j * nr_elts + i * nr_elts_per_tlb], nr_elts_per_tlb,
249 |                     base + tlb_perm[i] * nr_elts_per_tlb);
250 |     }
251 |     free(tlb_perm);
252 | 
253 |     dassert(is_a_permutation(perm, nr_elts));
254 |     if (verbosity > 1)
255 |       printf("generating inverse permutation\n");
256 |   }
257 | 
258 |   dassert(is_a_permutation(perm, nr_iteration * nr_elts));
259 | 
260 |   perm_inverse = malloc(nr_iteration * nr_elts * sizeof(*perm_inverse));
261 |   if (perm_inverse == NULL) {
262 |     fprintf(stderr, "Could not allocate %lu bytes\n",
263 |             nr_iteration * nr_elts * sizeof(*perm_inverse));
264 |     exit(1);
265 |   }
266 | 
267 |   for (i = 0; i < nr_iteration * nr_elts; ++i) {
268 |     perm_inverse[perm[i]] = i;
269 |   }
270 | 
271 |   dassert(is_a_permutation(perm_inverse, nr_iteration, nr_elts));
272 | 
273 | // Get the [(x mod NR_MIXER)th element in the jth row of mixer]th element
274 | // in the xth stride of the array.
275 | #define MIXED_2(x,j,n) ((x)*stride + (mixer + j*n)[(x) & (n-1)] * mixer_scale)
276 | 
277 |   if (verbosity > 1)
278 |     printf("threading the chase (mixer_idx = %zu)\n", mixer_idx);
279 | 
280 |   // Generate the final permutation, which connects nr_iteration * nr_elts
281 |   // number of permutations together into one.
282 |   for (i = 0; i < nr_elts * nr_iteration; ++i) {
283 |     size_t next;
284 |     dassert(perm[perm_inverse[i]] == i);
285 |     assert(*(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) == NULL);
286 |     next = perm_inverse[i] + 1;
287 |     // If next is the position representing the start of a new iteration of
288 |     // permutation, set next to be the position representing the start of
289 |     // current iteration, because we want to finish current permutation before
290 |     // proceeding to the next.
291 |     next = (next%nr_elts == 0 && next/nr_elts > i/nr_elts) ? i/nr_elts*nr_elts : next;
292 | 
293 |     if (perm[next]%nr_elts == 0) {
294 |       // If current iteration of permutation is finished,
295 |       // new position is the start of next iteration.
296 |       size_t new = (i/nr_elts + 1) * nr_elts;
297 |       new = (new == nr_iteration * nr_elts) ? 0 : new;
298 |       *(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) =
299 |         (void *)(arena + MIXED_2(new%nr_elts,new/nr_elts, args->nr_mixers));
300 |     } else {
301 |       *(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) =
302 |         (void *)(arena + MIXED_2(perm[next]%nr_elts,perm[next]/nr_elts, args->nr_mixers));
303 |     }
304 |   }
305 | 
306 |   free(perm);
307 |   free(perm_inverse);
308 | 
309 |   return arena + MIXED_2(0,0, args->nr_mixers);
310 | }
311 | 


--------------------------------------------------------------------------------
/permutation.h:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #ifndef PERMUTATION_H_INCLUDED
 15 | #define PERMUTATION_H_INCLUDED
 16 | 
 17 | #include <assert.h>
 18 | #include <stdint.h>
 19 | #include <stdio.h>
 20 | #include <stdlib.h>
 21 | 
 22 | typedef size_t perm_t;
 23 | 
 24 | void gen_random_permutation(perm_t *perm, size_t nr, size_t base);
 25 | void gen_ordered_permutation(perm_t *perm, size_t nr, size_t base);
 26 | int is_a_permutation(const perm_t *perm, size_t nr_elts);
 27 | 
 28 | // regarding the mixer:  suppose we have a stride of 256 ... what we want
 29 | // to avoid is having the entire chase at offset 0 into the 256 byte element.
 30 | // otherwise we might favour one bank/branch/etc. of the memory system.
 31 | // similarly when we perform a parallel chase with multiple threads we don't
 32 | // want one of the many chases to favour a particular offset into the stride.
 33 | // so we have a "mixer" permutation.
 34 | //
 35 | // consider the arena as a large set of elements of size stride, then the naive
 36 | // "mixer" would use some fixed offset into each element for a particular
 37 | // (thread number, parallel chase) index.  the mixer is a function on (element
 38 | // index, thread number, parallel chase) which makes the offset into the element
 39 | // unpredictable.
 40 | //
 41 | // the actual mixer is implemented as a large set of permutations on the
 42 | // low bits of the element number.  the details are private to the
 43 | // implementation.
 44 | 
 45 | // these are the common args required for generating all chases (and the mixer).
 46 | struct generate_chase_common_args {
 47 |   char *arena;          // memory used for all chases
 48 |   size_t total_memory;  // size of the arena
 49 |   size_t stride;        // size of each element
 50 |   size_t tlb_locality;  // group accesses within this range in order to
 51 |                         // amortize TLB fills
 52 |   size_t nr_mixers;     // Rounded up to power of two number of mixers:
 53 |                         // nr_threads * parallelism rounded up to power of two
 54 |   void (*gen_permutation)(perm_t *, size_t,
 55 |                           size_t);  // function for generating
 56 |                                     // permutations
 57 |                                     // typically gen_random_permutation
 58 |   size_t nr_mixer_indices;          // number of mixer indices
 59 |                                     // typically stride/sizeof(void*)
 60 |   const perm_t *mixer;              // the mixer function itself
 61 | };
 62 | 
 63 | // create the mixer table
 64 | void generate_chase_mixer(struct generate_chase_common_args *args,
 65 |                           size_t nr_mixers);
 66 | 
 67 | // create a chase for the given mixer_idx and return its first pointer
 68 | void *generate_chase(const struct generate_chase_common_args *args,
 69 |                      size_t mixer_idx);
 70 | 
 71 | // create a longer chase for the given mixer_idx and total_par and
 72 | // return its first pointer
 73 | void *generate_chase_long(const struct generate_chase_common_args *args,
 74 |                           size_t mixer_idx, size_t total_par);
 75 | 
 76 | //============================================================================
 77 | // Modern multicore CPUs have increasingly large caches, so the LCRNG code
 78 | // that was previously used is not sufficiently random anymore.
 79 | // Now using glibc's reentrant random number generator "random_r"
 80 | // still reproducible on the same platform, although not across systems/libs.
 81 | 
 82 | // RNG_BUF_SIZE sets the size of rng_buf below, which is used by initstate_r
 83 | // to decide how sophisticated a random number generator it should use: the
 84 | // larger the state array, the better the random numbers will be.
 85 | // 32 bytes was deemed to generate sufficient entropy.
 86 | #define RNG_BUF_SIZE 32
 87 | extern __thread char *rng_buf;
 88 | extern __thread struct random_data *rand_state;
 89 | 
 90 | static inline void rng_init(unsigned thread_num) {
 91 |   rng_buf = (char *)calloc(1, RNG_BUF_SIZE);
 92 |   rand_state = (struct random_data *)calloc(1, sizeof(struct random_data));
 93 |   assert(rand_state);
 94 |   if (initstate_r(thread_num, rng_buf, RNG_BUF_SIZE, rand_state) != 0) {
 95 |     perror("initstate_r");
 96 |     exit(1);
 97 |   }
 98 | }
 99 | 
100 | static inline perm_t rng_int(perm_t limit) {
101 |   int r1, r2, r3, r4;
102 |   uint64_t r;
103 | 
104 |   if (random_r(rand_state, &r1) || random_r(rand_state, &r2)
105 |     || random_r(rand_state, &r3) || random_r(rand_state, &r4)) {
106 |     perror("random_r");
107 |     exit(1);
108 |   }
109 |   // Assume that RAND_MAX is at least 16-bit long
110 |   _Static_assert (RAND_MAX >= (1ul << 16), "RAND_MAX is too small");
111 |   r = (((uint64_t)r1 <<  0) & 0x000000000000FFFFull) |
112 |       (((uint64_t)r2 << 16) & 0x00000000FFFF0000ull) |
113 |       (((uint64_t)r3 << 32) & 0x0000FFFF00000000ull) |
114 |       (((uint64_t)r4 << 48) & 0xFFFF000000000000ull);
115 | 
116 |   return r % (limit + 1);
117 | }
118 | 
119 | #endif
120 | 


--------------------------------------------------------------------------------
/pingpong.c:
--------------------------------------------------------------------------------
  1 | /* Copyright 2015 Google Inc. All Rights Reserved.
  2 |  * Licensed under the Apache License, Version 2.0 (the "License");
  3 |  * you may not use this file except in compliance with the License.
  4 |  * You may obtain a copy of the License at
  5 |  *
  6 |  *   http://www.apache.org/licenses/LICENSE-2.0
  7 |  *
  8 |  * Unless required by applicable law or agreed to in writing, software
  9 |  * distributed under the License is distributed on an "AS IS" BASIS,
 10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11 |  * See the License for the specific language governing permissions and
 12 |  * limitations under the License.
 13 |  */
 14 | #define _GNU_SOURCE
 15 | #include <errno.h>
 16 | #include <inttypes.h>
 17 | #include <pthread.h>
 18 | #include <sched.h>
 19 | #include <stdio.h>
 20 | #include <stdlib.h>
 21 | #include <sys/mman.h>
 22 | #include <sys/types.h>
 23 | #include <unistd.h>
 24 | 
 25 | #include "cpu_util.h"
 26 | #include "timer.h"
 27 | 
 28 | #define NR_SAMPLES (5)
 29 | #define SAMPLE_US (250000)
 30 | 
 31 | static size_t nr_relax = 10;
 32 | static size_t nr_tested_cores = ~0;
 33 | 
 34 | typedef unsigned atomic_t;
 35 | 
 36 | // this points to the mutex which will be pingponged back and forth
 37 | // from core to core.  it is allocated with mmap by the even thread
 38 | // so that it should be local to at least one of the two cores (and
 39 | // won't have any false sharing issues).
 40 | static atomic_t *pingpong_mutex;
 41 | 
 42 | // try avoid false sharing by padding out the atomic_t
 43 | typedef union {
 44 |   atomic_t x;
 45 |   char pad[AVOID_FALSE_SHARING];
 46 | } big_atomic_t __attribute__((aligned(AVOID_FALSE_SHARING)));
 47 | static big_atomic_t nr_pingpongs;
 48 | 
 49 | // an array we optionally modify to examine the effect of passing
 50 | // more dirty data between caches.
 51 | size_t nr_array_elts = 0;
 52 | size_t *communication_array;
 53 | 
 54 | //
 55 | static volatile int stop_loops;
 56 | 
 57 | typedef struct {
 58 |   cpu_set_t cpus;
 59 |   atomic_t me;
 60 |   atomic_t buddy;
 61 | } thread_args_t;
 62 | 
 63 | static void common_setup(thread_args_t *args) {
 64 |   // move to our target cpu
 65 |   if (sched_setaffinity(0, sizeof(cpu_set_t), &args->cpus)) {
 66 |     perror("sched_setaffinity");
 67 |     exit(1);
 68 |   }
 69 | 
 70 |   // test if we're supposed to allocate the pingpong_mutex memory
 71 |   if (args->me == 0) {
 72 |     pingpong_mutex = mmap(0, getpagesize(), PROT_READ | PROT_WRITE,
 73 |                           MAP_ANON | MAP_PRIVATE, -1, 0);
 74 |     if (pingpong_mutex == MAP_FAILED) {
 75 |       perror("mmap");
 76 |       exit(1);
 77 |     }
 78 |     *pingpong_mutex = args->me;
 79 |   }
 80 | 
 81 |   // ensure both threads are ready before we leave -- so that
 82 |   // both threads have a copy of pingpong_mutex.
 83 |   static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER;
 84 |   static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER;
 85 |   static int wait_for_buddy = 1;
 86 |   pthread_mutex_lock(&wait_mutex);
 87 |   if (wait_for_buddy) {
 88 |     wait_for_buddy = 0;
 89 |     pthread_cond_wait(&wait_cond, &wait_mutex);
 90 |   } else {
 91 |     wait_for_buddy = 1;  // for next invocation
 92 |     pthread_cond_broadcast(&wait_cond);
 93 |   }
 94 |   pthread_mutex_unlock(&wait_mutex);
 95 | }
 96 | 
 97 | #define template(name, xchg)                                          \
 98 |   static void *name(void *data) {                                     \
 99 |     thread_args_t *args = (thread_args_t *)data;                      \
100 |                                                                       \
101 |     common_setup(args);                                               \
102 |                                                                       \
103 |     atomic_t nr = 0;                                                  \
104 |     atomic_t me = args->me;                                           \
105 |     atomic_t buddy = args->buddy;                                     \
106 |     atomic_t *cache_pingpong_mutex = pingpong_mutex;                  \
107 |     while (1) {                                                       \
108 |       if (stop_loops) {                                               \
109 |         pthread_exit(0);                                              \
110 |       }                                                               \
111 |                                                                       \
112 |       if (xchg(cache_pingpong_mutex, me, buddy)) {                    \
113 |         for (size_t x = 0; x < nr_array_elts; ++x) {                  \
114 |           ++communication_array[x];                                   \
115 |         }                                                             \
116 |         /* don't do the atomic_add every time... it costs too much */ \
117 |         ++nr;                                                         \
118 |         if (nr == 10000 && me == 0) {                                 \
119 |           __sync_fetch_and_add(&nr_pingpongs.x, 2 * nr);              \
120 |           nr = 0;                                                     \
121 |         }                                                             \
122 |       }                                                               \
123 |       for (size_t i = 0; i < nr_relax; ++i) {                         \
124 |         cpu_relax();                                                  \
125 |       }                                                               \
126 |     }                                                                 \
127 |   }
128 | 
129 | template(locked_loop, __sync_bool_compare_and_swap)
130 | 
131 |     static inline int unlocked_xchg(atomic_t *p, atomic_t old, atomic_t new) {
132 |   if (*(volatile atomic_t *)p == old) {
133 |     *(volatile atomic_t *)p = new;
134 |     return 1;
135 |   }
136 |   return 0;
137 | }
138 | 
139 | template(unlocked_loop, unlocked_xchg)
140 | 
141 |     static void *xadd_loop(void *data) {
142 |   thread_args_t *args = (thread_args_t *)data;
143 | 
144 |   common_setup(args);
145 |   uint64_t *xadder = (uint64_t *)pingpong_mutex;
146 |   atomic_t me = args->me;
147 |   uint64_t add_amt = (me == 0) ? 1 : (1ull << 32);
148 |   uint32_t last_lo = 0;
149 |   atomic_t nr = 0;
150 | 
151 |   while (1) {
152 |     if (stop_loops) {
153 |       pthread_exit(0);
154 |     }
155 | 
156 |     uint64_t swap = __sync_fetch_and_add(xadder, add_amt);
157 |     if (me == 1 && last_lo != (uint32_t)swap) {
158 |       last_lo = swap;
159 |       ++nr;
160 |       if (nr == 10000) {
161 |         __sync_fetch_and_add(&nr_pingpongs.x, 2 * nr);
162 |         nr = 0;
163 |       }
164 |     }
165 |     for (size_t i = 0; i < nr_relax; ++i) {
166 |       cpu_relax();
167 |     }
168 |   }
169 | }
170 | 
171 | int main(int argc, char **argv) {
172 |   void *(*thread_fn)(void *data) = NULL;
173 |   int c;
174 |   char *p;
175 | 
176 |   while ((c = getopt(argc, argv, "c:lur:xs:")) != -1) {
177 |     switch (c) {
178 |       case 'l':
179 |         if (thread_fn) goto thread_fn_error;
180 |         thread_fn = locked_loop;
181 |         break;
182 |       case 'u':
183 |         if (thread_fn) goto thread_fn_error;
184 |         thread_fn = unlocked_loop;
185 |         break;
186 |       case 'x':
187 |         if (thread_fn) goto thread_fn_error;
188 |         thread_fn = xadd_loop;
189 |         break;
190 |       case 'r':
191 |         nr_relax = strtoul(optarg, &p, 0);
192 |         if (*p) {
193 |           fprintf(stderr, "-r requires a numeric argument\n");
194 |           exit(1);
195 |         }
196 |         break;
197 |       case 'c':
198 |         nr_tested_cores = strtoul(optarg, &p, 0);
199 |         if (*p) {
200 |           fprintf(stderr, "-c requires a numeric argument\n");
201 |           exit(1);
202 |         }
203 |         break;
204 |       case 's':
205 |         nr_array_elts = strtoul(optarg, &p, 0);
206 |         if (*p) {
207 |           fprintf(stderr, "-s requires a numeric argument\n");
208 |           exit(1);
209 |         }
210 |         if (posix_memalign((void **)&communication_array, 1ull << 21,
211 |                            nr_array_elts * sizeof(*communication_array))) {
212 |           fprintf(stderr, "posix_memalign failed\n");
213 |           exit(1);
214 |         }
215 |         break;
216 |       default:
217 |         fprintf(stderr,
218 |                 "usage: %s [-l | -u | -x] [-r nr_relax] [-s "
219 |                 "nr_array_elts_to_dirty] [-c nr_tested_cores]\n",
220 |                 argv[0]);
221 |         exit(1);
222 |     }
223 |   }
224 |   if (thread_fn == NULL) {
225 |   thread_fn_error:
226 |     fprintf(stderr, "must specify exactly one of -u, -l or -x\n");
227 |     exit(1);
228 |   }
229 | 
230 |   setvbuf(stdout, NULL, _IONBF, BUFSIZ);
231 | 
232 |   // find the active cpus
233 |   cpu_set_t cpus;
234 |   if (sched_getaffinity(getpid(), sizeof(cpus), &cpus)) {
235 |     perror("sched_getaffinity");
236 |     exit(1);
237 |   }
238 | 
239 |   printf(
240 |       "avg latency to communicate a modified line from one core to another\n");
241 |   printf("times are in ns\n\n");
242 | 
243 |   // print top row header
244 |   const int col_width = 8;
245 |   size_t first_cpu = ~0;
246 |   size_t last_cpu = 0;
247 |   printf("   ");
248 |   for (size_t j = 0; j < CPU_SETSIZE; ++j) {
249 |     if (CPU_ISSET(j, &cpus)) {
250 |       if (first_cpu > j) {
251 |         first_cpu = j;
252 |       } else {
253 |         printf("%*zu", col_width, j);
254 |       }
255 |       if (last_cpu < j) {
256 |         last_cpu = j;
257 |       }
258 |     }
259 |   }
260 |   printf("\n");
261 | 
262 |   for (size_t i = 0, core = 0; i < last_cpu && core < nr_tested_cores; ++i) {
263 |     if (!CPU_ISSET(i, &cpus)) {
264 |       continue;
265 |     }
266 |     ++core;
267 |     thread_args_t even;
268 |     CPU_ZERO(&even.cpus);
269 |     CPU_SET(i, &even.cpus);
270 |     even.me = 0;
271 |     even.buddy = 1;
272 |     printf("%2zu:", i);
273 |     for (size_t j = first_cpu + 1; j <= i; ++j) {
274 |       if (CPU_ISSET(j, &cpus)) {
275 |         printf("%*s", col_width, "");
276 |       }
277 |     }
278 |     for (size_t j = i + 1; j <= last_cpu; ++j) {
279 |       if (!CPU_ISSET(j, &cpus)) {
280 |         continue;
281 |       }
282 | 
283 |       thread_args_t odd;
284 |       CPU_ZERO(&odd.cpus);
285 |       CPU_SET(j, &odd.cpus);
286 |       odd.me = 1;
287 |       odd.buddy = 0;
288 |       __sync_lock_test_and_set(&nr_pingpongs.x, 0);
289 |       pthread_t odd_thread;
290 |       if (pthread_create(&odd_thread, NULL, thread_fn, &odd)) {
291 |         perror("pthread_create odd");
292 |         exit(1);
293 |       }
294 |       pthread_t even_thread;
295 |       if (pthread_create(&even_thread, NULL, thread_fn, &even)) {
296 |         perror("pthread_create even");
297 |         exit(1);
298 |       }
299 | 
300 |       uint64_t last_stamp = now_nsec();
301 |       double best_sample = 1. / 0.;  // infinity
302 |       for (size_t sample_no = 0; sample_no < NR_SAMPLES; ++sample_no) {
303 |         usleep(SAMPLE_US);
304 |         atomic_t s = __sync_lock_test_and_set(&nr_pingpongs.x, 0);
305 |         uint64_t time_stamp = now_nsec();
306 |         double sample = (time_stamp - last_stamp) / (double)s;
307 |         last_stamp = time_stamp;
308 |         if (sample < best_sample) {
309 |           best_sample = sample;
310 |         }
311 |       }
312 |       printf("%*.1f", col_width, best_sample);
313 | 
314 |       stop_loops = 1;
315 |       if (pthread_join(odd_thread, NULL)) {
316 |         perror("pthread_join odd_thread");
317 |         exit(1);
318 |       }
319 |       if (pthread_join(even_thread, NULL)) {
320 |         perror("pthread_join even_thread");
321 |         exit(1);
322 |       }
323 |       stop_loops = 0;
324 | 
325 |       if (munmap(pingpong_mutex, getpagesize())) {
326 |         perror("munmap");
327 |         exit(1);
328 |       }
329 |       pingpong_mutex = NULL;
330 |     }
331 |     printf("\n");
332 |   }
333 |   printf("\n");
334 | 
335 |   return 0;
336 | }
337 | 


--------------------------------------------------------------------------------
/run_multiload.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # Copyright 2020 Ampere Computing LLC. All Rights Reserved.
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #   http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | ARCH=$(uname -p)
 16 | HOSTNAME=$(cat /etc/hostname)
 17 | DATE=`date +"%Y_%m-%d_%H.%M.%S.%p"`
 18 | TOPDIR=$(pwd)
 19 | start_time=$(date +'%s')
 20 | MULTILOAD_PATH=${TOPDIR}
 21 | BINDIR="${MULTILOAD_PATH}"
 22 | OUTPUT_DIR="${TOPDIR}/results"
 23 | mkdir -p ${OUTPUT_DIR}
 24 | LOG_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_run.log"
 25 | log(){
 26 |     echo -e "[$(date +"%m/%d %H:%M:%S %p")] $1" | tee -a $LOG_FILE
 27 | }
 28 | 
 29 | log_without_date(){
 30 |     echo -e "$1" | tee -a $LOG_FILE
 31 | }
 32 | 
 33 | ################################################################################
 34 | # Test configuration variables
 35 | ################################################################################
 36 | RUN_TEST_TYPE=2     # 0=latency only, 1=load only, 2=loaded latency
 37 | ITERATIONS=3        # Number of times the program is run. Due to "Samples" above, reliable data can typically be had with only 1 iteration.
 38 | SAMPLES=5           # Specfies the number of data samples taken during a single run of mulitiload
 39 |                     # ~2sec per sample + 4sec warmup. Duration depends on LOAD_DELAY_* defines in multiload.c
 40 |                     # Default is to return the best latency of the samples. Command "-a" can be used to return the average.
 41 | SOCKET_EVAL=1                # 1=1P testing on a 2P system. 2=2P testing on a 2P system. Does not apply to a Non-Numa system.
 42 | USE_REMOTE_MEMNODE=0  # Numactl only: 0= use localalloc, 1=force SOCKET_EVAL=1 and use remote memory (ie. 2nd half of the numa nodes)
 43 | THREAD_AFFINITY_ENABLED=1  # enables use of tasket/numactl for thread control
 44 | MPSTAT_PROFILE_ENABLE=0     #enables mpstat data collection
 45 | VMSTAT_PROFILE_ENABLE=0     #enables vmstat data collection
 46 | PROFILING_INTERVAL_SEC=3    #defines sampling rate
 47 | 
 48 | usage(){
 49 |     echo "========================================================================================"
 50 |     echo "Multiload memory read latency, bandwidth, and loaded-latency Benchmark Test"
 51 |     echo "hostname is ${HOSTNAME}, $(date)"
 52 |     echo "========================================================================================"
 53 |     echo " "
 54 |     echo "Command args: $0 <run_test_type> <iterations> <samples> <socket_eval> <remote_memnode> <thread_affinity>"
 55 |     echo "Default is:   $0 $RUN_TEST_TYPE $ITERATIONS $SAMPLES $SOCKET_EVAL $USE_REMOTE_MEMNODE $THREAD_AFFINITY_ENABLED"
 56 |     echo " "
 57 |     echo "<run_test_type>"
 58 |     echo "      0 = Memory Read Latency (Runs multichase \"simple\" test, but all multichase commands should work manually)"
 59 |     echo "      1 = Memory Bandwidth    (Runs a list of bandwidth load algorithms)"
 60 |     echo "      2 = Loaded Latency.     (\"Chaseload\" combines 1 \"simple\" latency thread and multiple Bandwidth threads)"
 61 |     echo " "
 62 |     echo "The following are optional"
 63 |     echo "<iterations:    >  Number of multiload test runs.                                            (Default=$ITERATIONS)"
 64 |     echo "<samples:       >  Data samples per test run.                                                (Default=$SAMPLES)"
 65 |     echo "<socket_eval    >  2P system only: 1=test as 1P, 2=test as 2P.                               (Default=$SOCKET_EVAL)"
 66 |     echo "<remote_memnode >  NUMA system only: 0= --localalloc, 1= -membind to 2nd half of numa nodes  (Default=$USE_REMOTE_MEMNODE)"
 67 |     echo "<thread_affinity>  0=no affinity, 1=use taskset for No-Numa and use numactl for Numa systems (Default=$THREAD_AFFINITY_ENABLED)"
 68 |     echo " "
 69 |     echo "========================================================================================"
 70 |     echo "Chase algorithm list. Issue multiload -h command for full list. The 2 used by this script are:"
 71 |     echo "   simple - randomized pointer chaser latency"
 72 |     echo "   chaseload - Runs 1 thread of \"simple\" latency with multiple threads using the loads below."
 73 |     echo " "
 74 |     echo "Load algorithm list to test various rd/wr ratios. More algorithms can easily be added to multiload.c. Current algorithms are:"
 75 |     echo "   memcpy-libc    1:1 rd:wr ratio - glibc memcpy()"
 76 |     echo "   memset-libc    0:1 rd:wr ratio - glibc memset() non-zero data"
 77 |     echo "   memsetz-libc   0:1 rd:wr ratio - glibc memset() zero data"
 78 |     echo "   stream-copy    1:1 rd:wr ratio - lmbench stream copy instructions b[i]=a[i] (actual binary depends on compiler & -O level)"
 79 |     echo "   stream-sum     1:0 rd:wr ratio - lmbench stream sum instructions: a[i]+=1   (actual binary depends on compiler & -O level)"
 80 |     echo "   stream-triad   2:1 rd:wr ratio - lmbench stream triad instructions: a[i]=b[i]+(scalar*c[i])"
 81 |     echo " "
 82 |     echo "*** Due to the complexity of other options, they can only be changed by editting this script"
 83 |     echo "========================================================================================"
 84 | }
 85 | 
 86 | if [ "$#" == "0" ] ; then
 87 |     usage
 88 |     exit 1
 89 | fi
 90 | if [ ! -z $1 ]; then
 91 |     RUN_TEST_TYPE=$1
 92 |     if [ ! -z $2 ]; then
 93 |         ITERATIONS=$2
 94 |         if [ ! -z $3 ]; then
 95 |             SAMPLES=$3
 96 |             if [ ! -z $4 ]; then
 97 |                 SOCKET_EVAL=$4
 98 |                 if [ ! -z $5 ]; then
 99 |                     USE_REMOTE_MEMNODE=$5
100 |                     if [ ! -z $6 ]; then
101 |                         THREAD_AFFINITY_ENABLED=$6
102 |                     fi
103 |                 fi
104 |             fi
105 |         fi
106 |     fi
107 | fi
108 | 
109 | RUN_CHASE=0
110 | RUN_BANDWIDTH=1
111 | RUN_CHASE_LOADED=2
112 | if [ $RUN_TEST_TYPE = $RUN_CHASE ]; then
113 |     PSTEP_START=1                # Parallel thread start value when running thread scaling tests.
114 |     PSTEP_INC=4                 # Parallel thread steps when running thread scaling tests.
115 |     PSTEP_END=512         # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END.
116 |     CHASE_ALGORITHM="simple"
117 |     LOAD_ALGORITHM_LIST="none"
118 |     RAND_STRIDE=16      #lmbench latmemrd uses 16 for simple chase. Other chase/mem sizes may need to be bigger (ie. 512).
119 |     BUFLIST_TYPE=0       # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom
120 |     let MEM_SIZE_END_B=1*1024*1024*1024
121 |     let MEM_SIZE_START_B=4*1024
122 |     #let MEM_SIZE_START_B=MEM_SIZE_END_B
123 |     buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) )     # 64K / 1M / 32M caches
124 | 
125 | elif [ $RUN_TEST_TYPE = $RUN_BANDWIDTH ]; then
126 |     PSTEP_START=1                # Parallel thread start value when running thread scaling tests.
127 |     PSTEP_INC=4                 # Parallel thread steps when running thread scaling tests.
128 |     PSTEP_END=512       # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END.
129 |     CHASE_ALGORITHM="none"
130 |     LOAD_ALGORITHM_LIST="memcpy-libc memset-libc memsetz-libc stream-sum stream-triad"
131 |     RAND_STRIDE=16       #not used for bandwidth test
132 |     BUFLIST_TYPE=1       # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom
133 |     let MEM_SIZE_END_B=1*1024*1024*1024
134 |     #let MEM_SIZE_START_B=4*1024
135 |     let MEM_SIZE_START_B=MEM_SIZE_END_B
136 |     buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) )     # 64K / 1M / 32M caches
137 | 
138 | elif [ $RUN_TEST_TYPE = $RUN_CHASE_LOADED ]; then
139 |     PSTEP_START=1                # Parallel thread start value when running thread scaling tests.
140 |     PSTEP_INC=4                 # Parallel thread steps when running thread scaling tests.
141 |     PSTEP_END=512       # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END.
142 |     CHASE_ALGORITHM="chaseload"
143 |     LOAD_ALGORITHM_LIST="memcpy-libc memset-libc memsetz-libc stream-sum stream-triad"
144 |     RAND_STRIDE=16       #lmbench latmemrd uses 16 for simple chase. Other chase/mem sizes may need to be bigger (ie. 512).
145 |     BUFLIST_TYPE=0       # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom
146 |     let MEM_SIZE_END_B=1*1024*1024*1024
147 |     #let MEM_SIZE_START_B=4*1024
148 |     let MEM_SIZE_START_B=MEM_SIZE_END_B
149 |     buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) )     # 64K / 1M / 32M caches
150 | else
151 |     echo "Found unknown RUN_TEST_TYPE=$RUN_TEST_TYPE"
152 |     usage
153 |     exit
154 | fi
155 | 
156 | ################################################################################
157 | # Functions
158 | ################################################################################
159 | profiling_start(){
160 |     if [ "$MPSTAT_PROFILE_ENABLE" == "1" ] ; then
161 |         echo "$1" >> ${LOG_MPSTATS_FILE}
162 |         mpstat -P ALL $PROFILING_INTERVAL_SEC >> ${LOG_MPSTATS_FILE} 2>&1 &
163 |         mpstat_pid=$!
164 |     fi
165 |     if [ "$VMSTAT_PROFILE_ENABLE" == "1" ] ; then
166 |         echo "$1" >> ${LOG_VMSTATS_FILE}
167 |         vmstat -t $PROFILING_INTERVAL_SEC >> ${LOG_VMSTATS_FILE} 2>&1 &
168 |         vmstat_pid=$!
169 |     fi
170 | }
171 | 
172 | profiling_end(){
173 |     # kill the profiling pids and try to hide the "terminated" messages
174 |     if [ "$MPSTAT_PROFILE_ENABLE" == "1" ] ; then
175 |         ( kill $mpstat_pid &> /dev/null ) &
176 |         wait $mpstat_pid &> /dev/null
177 |     fi
178 |     if [ "$VMSTAT_PROFILE_ENABLE" == "1" ] ; then
179 |         ( kill $vmstat_pid &> /dev/null ) &
180 |         wait $vmstat_pid &> /dev/null
181 |     fi
182 | }
183 | 
184 | get_hardware_config ()
185 | {
186 |         log_without_date " "
187 |         numactl --hardware | tee -a ${LOG_FILE}                 # display current NUMA & memory setup
188 |         log_without_date " "
189 | 
190 |         phycore_num=`lscpu | grep "Core(s) per socket" | tr -d ' ' | cut -d':' -f2 2> /dev/null`
191 |         core_threads=`lscpu | grep "Thread(s) per core:" | tr -d ' ' | cut -d':' -f2 2> /dev/null`
192 |         cputhread_num=`lscpu | grep "CPU(s):            " | head -n 1 | tr -d ' ' | cut -d ':' -f2 2> /dev/null`
193 |         numa_num=`lscpu | grep "NUMA node(s)" | tr -d ' ' | cut -d':' -f2 2> /dev/null`
194 |         socket_num=`lscpu | grep "Socket(s)" | tr -d ' ' | cut -d':' -f2 2> /dev/null`
195 |     MEMBIND_LIST=`numactl --show 2> /dev/null | grep membind | cut -d':' -f2 2> /dev/null`
196 |            let phycore_end=$phycore_num-1
197 |         let ht_threads=$phycore_num*$core_threads
198 |     #echo "get_hardware_config: phyend=$phycore_end, ht_t=$ht_threads"
199 | 
200 |         if [ -z $cputhread_num ]; then
201 |                 log_without_date "Can't find the CPU(s) core count, exiting"
202 |                 exit $?
203 |         else
204 |                 CPUTHREADS=$cputhread_num
205 |         fi
206 | 
207 |     log "Found the following hardware:"
208 |         log_without_date "  sockets         = $socket_num"
209 |         log_without_date "  physical cores  = $phycore_num"
210 |         log_without_date "  threads per core= $core_threads"
211 |         log_without_date "  logical threads = $cputhread_num"
212 |         if [ -z "$numa_num" ]; then
213 |                 NUMA_NODES=1
214 |             log_without_date "  NUMA nodes      = none found"
215 |         else
216 |                 NUMA_NODES=$numa_num
217 |             log_without_date "  NUMA nodes      = $NUMA_NODES"
218 |         fi
219 | 
220 |     if [ $USE_REMOTE_MEMNODE == "1" ] && [ $NUMA_NODES -gt "1" ] && [ $THREAD_AFFINITY_ENABLED == "1" ]; then
221 |         SOCKET_EVAL=1
222 |         if [ $numa_num == "2" ]; then
223 |             NODE_MEMBIND="1"
224 |         elif [ $numa_num == "4" ]; then
225 |             NODE_MEMBIND="2,3"
226 |         elif [ $numa_num == "8" ]; then
227 |             NODE_MEMBIND="4,5,6,7"
228 |         else
229 |             NODE_MEMBIND="0"
230 |         fi
231 |     fi
232 | 
233 |     if [ $socket_num -eq "1" ]; then
234 |         #Check if this is only a 1P box force SOCKET_EVAL=1
235 |         SOCKET_EVAL=1
236 |     elif [ $socket_num -ge "2" ] && [ $SOCKET_EVAL -eq "1" ]; then
237 |         #Check if doing 1P only testing on a 2P+ box and adjust CPUTHREADS for 1P
238 |         let CPUTHREADS=$cputhread_num/$socket_num
239 |     fi
240 | 
241 |     if [ $PSTEP_END -gt $CPUTHREADS ]; then
242 |             PSTEP_END=$CPUTHREADS
243 |     fi
244 | }
245 | 
246 | duration(){    # calculates duration in secs
247 |     duration=$SECONDS
248 |     log_without_date
249 |         log "$1 runtime: $(($duration / 3600)) hrs, $((($duration % 3600) / 60)) mins, $(($duration % 60)) secs"
250 | }
251 | 
252 | #converts an array into string and deletes spaces (can also add delimiters using $1)
253 | join_ws() { local d=$1 s=$2; shift 2 && printf %s "$s${@/#/$d}"; }
254 | 
255 | create_taskset_cpulist_x86_64()
256 | {
257 |     let cpu_max_4bits=$1/4+1           #need +1 in case cputhreads is not a multiple of 4.
258 |     #create base string arrays
259 |     for (( cpu=0; cpu<cpu_max_4bits; cpu++ ));
260 |         do
261 |             x02[cpu]="0"
262 |         done
263 | 
264 |     let idx=cpu_max_4bits-1
265 |     for (( cpu=0; cpu<cpu_max_4bits; cpu++ ));
266 |         do
267 |             x02[idx]="1"
268 |             thds=`join_ws '' ${x02[@]}`
269 |             cpulist+=("$thds")
270 |             x02[idx]="3"
271 |             thds=`join_ws '' ${x02[@]}`
272 |             cpulist+=("$thds")
273 |             x02[idx]="7"
274 |             thds=`join_ws '' ${x02[@]}`
275 |             cpulist+=("$thds")
276 |             x02[idx]="F"
277 |             thds=`join_ws '' ${x02[@]}`
278 |             cpulist+=("$thds")
279 |             let idx=idx-1
280 |         done
281 |     #for i in "${cpulist[@]}";
282 |     #    do
283 |     #        echo "$i"
284 |     #    done
285 | }
286 | 
287 | create_taskset_cpulist_arm64()
288 | {
289 |     cc=$1
290 |     cpu_max_4bits_rem=$( expr ${cc} % 4 )       #in case its not a multiple of 4.
291 |     let cpu_max_4bits=$cc/4
292 |     #echo "4b=$cpu_max_4bits, 4br=$cpu_max_4bits_rem"
293 | 
294 |     #create base string arrays
295 |     for (( cpu=0; cpu<(cpu_max_4bits); cpu++ ));
296 |         do
297 |             x00[cpu]="0"
298 |         done
299 |     if [ $cpu_max_4bits_rem -eq 2 ]; then
300 |         x00[cpu]="0"
301 |         let idx=cpu_max_4bits
302 |     else
303 |         let idx=cpu_max_4bits-1
304 |     fi
305 |     #echo "Initialize x00=${x00[@]}"
306 |     #echo "--------------"
307 | 
308 |     for (( cpu=0; cpu<cpu_max_4bits; cpu++ ));
309 |         do
310 |             x00[idx]="2"
311 |             thds=`join_ws '' ${x00[@]}`
312 |             cpulist+=("$thds")
313 |             #echo "idx=$idx, ${cpulist[@]}"
314 |             x00[idx]="A"
315 |             thds=`join_ws '' ${x00[@]}`
316 |             cpulist+=("$thds")
317 |             #echo "idx=$idx, ${cpulist[@]}"
318 |             let idx=idx-1
319 |         done
320 |     if [ $cpu_max_4bits_rem -eq 2 ]; then
321 |         x00[idx]="2"
322 |         thds=`join_ws '' ${x00[@]}`
323 |         cpulist+=("$thds")
324 |         #echo "idx=$idx, ${cpulist[@]}"
325 |         let idx=cpu_max_4bits
326 |     else
327 |         let idx=cpu_max_4bits-1
328 |     fi
329 |     #echo "--------------"
330 | 
331 |     for (( cpu=0; cpu<cpu_max_4bits; cpu++ ));
332 |         do
333 |             x00[idx]="B"
334 |             thds=`join_ws '' ${x00[@]}`
335 |             cpulist+=("$thds")
336 |             #echo "idx=$idx, ${cpulist[@]}"
337 |             x00[idx]="F"
338 |             thds=`join_ws '' ${x00[@]}`
339 |             cpulist+=("$thds")
340 |             #echo "idx=$idx, ${cpulist[@]}"
341 |             let idx=idx-1
342 |         done
343 |     if [ $cpu_max_4bits_rem -eq 2 ]; then
344 |             x00[idx]="3"
345 |             thds=`join_ws '' ${x00[@]}`
346 |             cpulist+=("$thds")
347 |             #echo "idx=$idx, ${cpulist[@]}"
348 |     fi
349 |     #echo "--------------"
350 | 
351 |     #for i in "${cpulist[@]}";
352 |     #    do
353 |     #        echo "$i"
354 |     #    done
355 | }
356 | 
357 | create_numactl_cpulist_x86_64(){
358 |     let phycore_end=phycore_num*SOCKET_EVAL-1
359 |     let ht_thread_start=phycore_num*socket_num
360 |     let ht_thread_end=ht_thread_start+phycore_end
361 |     #echo "SOCKET_EVAL=$SOCKET_EVAL, Available cpus are: -C 0-$phycore_end,$ht_thread_start-$ht_thread_end, cputhread_num=$cputhread_num, request=$1"
362 |     if [ $1 -gt $cputhread_num ]; then
363 |         echo "ERROR: $1 is too many threads, Max CPU threads allowed is $cputhread_num"
364 |         exit
365 |     fi
366 | 
367 |     for (( cpu=0; cpu<$1; cpu++ ));
368 |         do
369 |             if [ $cpu -gt $phycore_end ]; then
370 |                 let cpu_bind_end=$cpu-phycore_end+ht_thread_start-1
371 |                 stream_bind=0-${phycore_end},${ht_thread_start}-${cpu_bind_end}
372 |                 cpulist+=("$stream_bind")
373 |                 #echo "Using hyperthreads: -C $stream_bind"
374 |             else
375 |                 stream_bind=0-${cpu}
376 |                 cpulist+=("$stream_bind")
377 |                 #echo "Using threads: -C $stream_bind"
378 |             fi
379 |         done
380 | }
381 | 
382 | create_numactl_cpulist_arm64(){
383 |         #echo "FIXME: early arm designs need every other cpu alogithm to test non-shared L2 followed by shared L2"
384 |     create_numactl_cpulist_x86_64 $1
385 | }
386 | 
387 | create_thdlist()
388 | {
389 |     thdtemp=$PSTEP_START
390 |     #echo "create_thdlist start=$thdtemp, end=$PSTEP_END, by $PSTEP_INC"
391 | 
392 |     thdcount_testlist+=($thdtemp)
393 |         if [ $thdtemp -eq "1" ]; then
394 |                 if [ $PSTEP_INC -eq "1" ]; then
395 |                         let thdtemp=2
396 |                 else
397 |                         let thdtemp=PSTEP_INC
398 |                 fi
399 |         else
400 |                 let thdtemp=thdtemp+PSTEP_INC
401 |         fi
402 | 
403 |     while [ "$thdtemp" -le "$PSTEP_END" ] ; do
404 |         thdcount_testlist+=($thdtemp)
405 |                 let thdtemp=thdtemp+PSTEP_INC
406 |     done
407 | }
408 | 
409 | create_buffer_list_by_2x()
410 | {
411 |     #Algorithm below just doubles the mem size each time.
412 |     size=$MEM_SIZE_START_B
413 |     bufsize_testlist=($size)
414 |     #echo "create_buffer_list_by_2x size=$size, end=$MEM_SIZE_END_B"
415 | 
416 |     while [ "$size" -lt "$MEM_SIZE_END_B" ] ; do
417 |         size=$((size*2))
418 |         bufsize_testlist+=($size)
419 |     done
420 | }
421 | 
422 | create_buffer_list_lmbench()
423 | {
424 |     #Algorithm below duplicates lmbench lat-mem-rd list.
425 |     size=$MEM_SIZE_START_B
426 |     bufsize_testlist=($size)
427 |     #echo "create_buffer_list_lmbench, start=$size, end=$MEM_SIZE_END_B bytes"
428 | 
429 |     while [ "$size" -lt "$MEM_SIZE_END_B" ] ; do
430 |         if [ "$size" -lt "1024" ]; then
431 |             let size=$((size*2))
432 |         elif [ "$size" -lt "4096" ]; then
433 |             let size=size+1024
434 |         else
435 |             for (( temps=4096; temps<=size; temps=temps*2 ));
436 |             do
437 |                 nothing=1
438 |             done
439 |             let tempss=temps/4
440 |             let size=size+tempss
441 |         fi
442 |         bufsize_testlist+=($size)
443 |     done
444 | }
445 | 
446 | create_buffer_list_custom()
447 | {
448 |     for i in "${buflist_custom[@]}"; do
449 |         bufsize_testlist+=($i)
450 |     done
451 |     for i in "${bufsize_testlist[@]}"; do echo "$i"; done;
452 | }
453 | 
454 | run_test(){
455 |     log "Starting ${1} run..."
456 |     SECONDS=0
457 |         for a in $LOAD_ALGORITHM_LIST; do
458 |             if [ "$a" == "none" ]; then
459 |                 LOAD_COMMAND=""
460 |             else
461 |                 LOAD_COMMAND="-l ${a}"
462 |             fi
463 |             for t in "${thdcount_testlist[@]}"; do
464 |                 for j in "${bufsize_testlist[@]}"; do
465 |                     if [ "$THREAD_AFFINITY_ENABLED" == "0" ]; then
466 |                         profiling_start "Run $ITERATIONS iterations, $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND} -X"
467 |                         for i in $(seq 1 $ITERATIONS) ; do
468 |                             log "Run iter $i/$ITERATIONS, $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND} -X"
469 |                             ${BASE_MULTILOAD_CMD} -t $t -m $j ${LOAD_COMMAND} -X | tee -a ${OUTPUT_DIR}/$FILENAME.txt 2>&1
470 |                         done
471 |                     elif [ $NUMA_NODES -eq 1 ] ; then
472 |                         profiling_start "Run $ITERATIONS iterations, taskset ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}"
473 |                         for i in $(seq 1 $ITERATIONS) ; do
474 |                             log             "Iteration $i of $ITERATIONS, taskset ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}"
475 |                             taskset ${cpulist[$t-1]} ${BASE_MULTILOAD_CMD} -t $t -m $j ${LOAD_COMMAND} | tee -a ${OUTPUT_DIR}/$FILENAME.txt 2>&1
476 |                         done
477 |                     else
478 |                         profiling_start "Run $ITERATIONS iterations, numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}"
479 |                         for i in $(seq 1 $ITERATIONS) ; do
480 |                             log "Run iter $i/$ITERATIONS, numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}"
481 |                             numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} ${BASE_MULTILOAD_CMD} -t $t -m $j ${LOAD_COMMAND} | tee -a ${OUTPUT_DIR}/$FILENAME.txt 2>&1
482 |                         done
483 |                     fi
484 |                     profiling_end &> /dev/null
485 |                     duration "Total"
486 |                 done
487 |             done
488 |         done
489 | }
490 | 
491 | parse(){
492 |     first=1
493 |     rm -f out.txt
494 |     while IFS= read -r line; do
495 |         #Only keep 1st header line
496 |         if [ "$first" -eq "1" ]; then
497 |             echo "$line" > out.txt
498 |             first=0
499 |         elif [[ ! $line =~ "ample" ]]; then
500 |             echo "$line" >> out.txt
501 |         fi
502 |     done < ${OUTPUT_DIR}/$1.txt
503 | 
504 |     #delete all spaces and tabs
505 |     tr -d '[[:blank:]]' < out.txt > ${OUTPUT_DIR}/$1.csv
506 |     rm -f out.txt
507 | }
508 | 
509 | ################################################################################
510 | # Main
511 | ################################################################################
512 | get_hardware_config
513 | create_thdlist
514 | if [ $BUFLIST_TYPE == "0" ]; then
515 |     create_buffer_list_lmbench
516 | else
517 |     create_buffer_list_custom
518 | fi
519 | 
520 | if [ $CHASE_ALGORITHM == "none" ]; then
521 |     BASE_MULTILOAD_CMD="${BINDIR}/multiload -s ${RAND_STRIDE} -T 16g -n ${SAMPLES}"
522 |     FILENAME="multiload_${DATE}_-s_${RAND_STRIDE}_-T_16g_-n_${SAMPLES}_-m_${bufsize_testlist[0]}-${bufsize_testlist[-1]}_-t_${thdcount_testlist[0]}-${thdcount_testlist[-1]}_EVAL_${SOCKET_EVAL}P"
523 | else
524 |     BASE_MULTILOAD_CMD="${BINDIR}/multiload -s ${RAND_STRIDE} -T 16g -n ${SAMPLES} -c ${CHASE_ALGORITHM}"
525 |     FILENAME="multiload_${DATE}_-s_${RAND_STRIDE}_-T_16g_-n_${SAMPLES}_-c_${CHASE_ALGORITHM}_-m_${bufsize_testlist[0]}-${bufsize_testlist[-1]}_-t_${thdcount_testlist[0]}-${thdcount_testlist[-1]}_EVAL_${SOCKET_EVAL}P"
526 | fi
527 | if [ $USE_REMOTE_MEMNODE == "1" ] && [ $NUMA_NODES -gt "1" ] && [ $THREAD_AFFINITY_ENABLED == "1" ]; then
528 |     FILENAME="${FILENAME}_REMOTE"
529 | fi
530 | LOG_VMSTATS_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_vmstats.log"
531 | LOG_MPSTATS_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_mpstats.log"
532 | mpstat_pid=""
533 | vmstat_pid=""
534 | 
535 | log_without_date "Test Parameters"
536 | log_without_date "  Date: $DATE"
537 | log_without_date "  Output Directory: $OUTPUT_DIR"
538 | log_without_date "  Data File: $FILENAME.txt"
539 | log_without_date "  Data File: $FILENAME.csv"
540 | log_without_date "  Log File: $LOG_FILE"
541 | if [ $MPSTAT_PROFILE_ENABLE -eq 1 ]; then
542 |     log_without_date "  Stat File: $LOG_MPSTATS_FILE"
543 | fi
544 | if [ $VMSTAT_PROFILE_ENABLE -eq 1 ]; then
545 |     log_without_date "  Stat File: $LOG_VMSTATS_FILE"
546 | fi
547 | log_without_date "  Iterations: $ITERATIONS"
548 | log_without_date "  Thread List: $( echo "${thdcount_testlist[@]}" )"
549 | log_without_date "  Mem Buf List: $( echo "${bufsize_testlist[@]}" )"
550 | log_without_date "  Random Stride: $RAND_STRIDE"
551 | if [ "$THREAD_AFFINITY_ENABLED" == "0" ]; then
552 |     log_without_date "  Thread affinity disabled"
553 |     run_test "Thread affinity disabled"
554 | elif [ "$NUMA_NODES" == "1" ] ; then
555 |     log_without_date "  Numa runs: No"
556 |     if [ "$ARCH" == "x86_64" ] ; then
557 |         create_taskset_cpulist_x86_64 $CPUTHREADS
558 |     else
559 |         create_taskset_cpulist_arm64 $CPUTHREADS
560 |     fi
561 |     run_test "NUMA=${NUMA_NODES}"
562 | else
563 |     if [ "$ARCH" == "x86_64" ] ; then
564 |         create_numactl_cpulist_x86_64 $CPUTHREADS
565 |     else
566 |         create_numactl_cpulist_arm64 $CPUTHREADS
567 |     fi
568 |     if [ $USE_REMOTE_MEMNODE == "1" ]; then
569 |         #MEMBIND_COMMAND="-m ${NODE_MEMBIND}"                        # --membind causes allocation to start on the 1st node, then the next so limited to 1 node of DDR bandwidth.
570 |         #MEMBIND_COMMAND_TEXT="m${NODE_MEMBIND}"
571 |         MEMBIND_COMMAND="-i ${NODE_MEMBIND}"                        # --interleave does round robin allocation between the nodes giving higher DDR bandwidth for multiple nodes.
572 |         MEMBIND_COMMAND_TEXT="i${NODE_MEMBIND}"
573 |     else
574 |         MEMBIND_COMMAND="--localalloc"                                        # localalloc allocates in same node as the process or thread calling a malloc() function
575 |         MEMBIND_COMMAND_TEXT="localalloc"
576 |     fi
577 |     log_without_date "  Numa runs: Yes"
578 |     log_without_date "  Numa nodes: $MEMBIND_LIST"
579 |     log_without_date "  Remote Memory: USE_REMOTE_MEMNODE=$USE_REMOTE_MEMNODE, MEMBIND_COMMAND=$MEMBIND_COMMAND"
580 |     log_without_date "  Socket eval: Testing cores from $SOCKET_EVAL out of $socket_num sockets"
581 |     run_test "NUMA=${NUMA_NODES}"
582 | fi
583 | 
584 | parse $FILENAME
585 | finish_time=$(date +'%s')
586 | log_without_date
587 | log "Total eval runtime = $((($finish_time-$start_time) / 3600)) hrs.. $(((($finish_time-$start_time) % 3600) / 60)) mins.. $((($finish_time-$start_time) % 60)) secs.."
588 | log_without_date
589 | exit
590 | 


--------------------------------------------------------------------------------
/timer.h:
--------------------------------------------------------------------------------
 1 | /* Copyright 2015 Google Inc. All Rights Reserved.
 2 |  * Licensed under the Apache License, Version 2.0 (the "License");
 3 |  * you may not use this file except in compliance with the License.
 4 |  * You may obtain a copy of the License at
 5 |  *
 6 |  *   http://www.apache.org/licenses/LICENSE-2.0
 7 |  *
 8 |  * Unless required by applicable law or agreed to in writing, software
 9 |  * distributed under the License is distributed on an "AS IS" BASIS,
10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 |  * See the License for the specific language governing permissions and
12 |  * limitations under the License.
13 |  */
14 | #ifndef TIMER_H_INCLUDED
15 | #define TIMER_H_INCLUDED
16 | 
17 | #include <stdint.h>
18 | #include <sys/time.h>
19 | #include <time.h>
20 | 
21 | static inline uint64_t now_nsec(void) {
22 |   struct timespec ts;
23 |   clock_gettime(CLOCK_MONOTONIC, &ts);
24 |   return ts.tv_sec * ((uint64_t)1000 * 1000 * 1000) + ts.tv_nsec;
25 | }
26 | 
27 | #endif
28 | 


--------------------------------------------------------------------------------
/util.c:
--------------------------------------------------------------------------------
 1 | /* Copyright 2015 Google Inc. All Rights Reserved.
 2 |  * Licensed under the Apache License, Version 2.0 (the "License");
 3 |  * you may not use this file except in compliance with the License.
 4 |  * You may obtain a copy of the License at
 5 |  *
 6 |  *   http://www.apache.org/licenses/LICENSE-2.0
 7 |  *
 8 |  * Unless required by applicable law or agreed to in writing, software
 9 |  * distributed under the License is distributed on an "AS IS" BASIS,
10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 |  * See the License for the specific language governing permissions and
12 |  * limitations under the License.
13 |  */
14 | #include "util.h"
15 | 
16 | #include <stdlib.h>
17 | 
18 | int parse_mem_arg(const char *str, size_t *result) {
19 |   size_t r;
20 |   char *p;
21 | 
22 |   r = strtoull(str, &p, 0);
23 |   switch (*p) {
24 |     case 'k':
25 |     case 'K':
26 |       r *= 1024;
27 |       ++p;
28 |       break;
29 |     case 'm':
30 |     case 'M':
31 |       r *= 1024 * 1024;
32 |       ++p;
33 |       break;
34 |     case 'g':
35 |     case 'G':
36 |       r *= 1024 * 1024 * 1024;
37 |       ++p;
38 |       break;
39 |   }
40 |   if (*p) {
41 |     return -1;
42 |   }
43 |   *result = r;
44 |   return 0;
45 | }
46 | 


--------------------------------------------------------------------------------
/util.h:
--------------------------------------------------------------------------------
 1 | /* Copyright 2015 Google Inc. All Rights Reserved.
 2 |  * Licensed under the Apache License, Version 2.0 (the "License");
 3 |  * you may not use this file except in compliance with the License.
 4 |  * You may obtain a copy of the License at
 5 |  *
 6 |  *   http://www.apache.org/licenses/LICENSE-2.0
 7 |  *
 8 |  * Unless required by applicable law or agreed to in writing, software
 9 |  * distributed under the License is distributed on an "AS IS" BASIS,
10 |  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 |  * See the License for the specific language governing permissions and
12 |  * limitations under the License.
13 |  */
14 | #ifndef UTIL_H_INCLUDED
15 | #define UTIL_H_INCLUDED
16 | 
17 | #include <stddef.h>
18 | 
19 | int parse_mem_arg(const char *str, size_t *result);
20 | 
21 | #endif
22 | 


--------------------------------------------------------------------------------