├── .gitignore ├── CONTRIBUTING ├── LICENSE ├── Makefile ├── README ├── _clang-format ├── arena.c ├── arena.h ├── br_asm.c ├── br_asm.h ├── cpu_util.h ├── fairness.c ├── gen_expand ├── multichase.c ├── multiload.c ├── permutation.c ├── permutation.h ├── pingpong.c ├── run_multiload.sh ├── timer.h ├── util.c └── util.h /.gitignore: -------------------------------------------------------------------------------- 1 | /*.o 2 | /expand.h 3 | 4 | /fairness 5 | /multichase 6 | /multiload 7 | /pingpong 8 | -------------------------------------------------------------------------------- /CONTRIBUTING: -------------------------------------------------------------------------------- 1 | Want to contribute? Great! First, read this page (including the small print at the end). 2 | 3 | ### Before you contribute 4 | Before we can use your code, you must sign the 5 | [Google Individual Contributor License Agreement] 6 | (https://cla.developers.google.com/about/google-individual) 7 | (CLA), which you can do online. The CLA is necessary mainly because you own the 8 | copyright to your changes, even after your contribution becomes part of our 9 | codebase, so we need your permission to use and distribute your code. We also 10 | need to be sure of various other things—for instance that you'll tell us if you 11 | know that your code infringes on other people's patents. You don't have to sign 12 | the CLA until after you've submitted your code for review and a member has 13 | approved it, but you must do it before we can put your code into our codebase. 14 | Before you start working on a larger contribution, you should get in touch with 15 | us first through the issue tracker with your idea so that we can help out and 16 | possibly guide you. Coordinating up front makes it much easier to avoid 17 | frustration later on. 18 | 19 | ### Code reviews 20 | All submissions, including submissions by project members, require review. We 21 | use Github pull requests for this purpose. 22 | 23 | ### The small print 24 | Contributions made by corporations are covered by a different agreement than 25 | the one above, the 26 | [Software Grant and Corporate Contributor License Agreement] 27 | (https://cla.developers.google.com/about/google-corporate). 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | 3 | Version 2.0, January 2004 4 | 5 | http://www.apache.org/licenses/ 6 | 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 8 | 9 | 1. Definitions. 10 | 11 | "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. 12 | "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. 13 | "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. 14 | "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. 15 | "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. 16 | "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. 17 | "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). 18 | "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. 19 | "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." 20 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 21 | 22 | 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 23 | 24 | 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 25 | 26 | 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: 27 | You must give any other recipients of the Work or Derivative Works a copy of this License; and 28 | You must cause any modified files to carry prominent notices stating that You changed the files; and 29 | You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 30 | If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 31 | You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 32 | 33 | 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 34 | 35 | 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 36 | 37 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 38 | 39 | 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 40 | 41 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. 42 | 43 | END OF TERMS AND CONDITIONS 44 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # Licensed under the Apache License, Version 2.0 (the "License"); 3 | # you may not use this file except in compliance with the License. 4 | # You may obtain a copy of the License at 5 | # 6 | # http://www.apache.org/licenses/LICENSE-2.0 7 | # 8 | # Unless required by applicable law or agreed to in writing, software 9 | # distributed under the License is distributed on an "AS IS" BASIS, 10 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | # See the License for the specific language governing permissions and 12 | # limitations under the License. 13 | # 14 | CFLAGS=-std=gnu99 -g -O3 -fomit-frame-pointer -fno-unroll-loops -Wall -Wstrict-prototypes -Wmissing-prototypes -Wshadow -Wmissing-declarations -Wnested-externs -Wpointer-arith -W -Wno-unused-parameter -Werror -pthread -Wno-tautological-compare 15 | LDFLAGS=-g -O3 -static -pthread 16 | LDLIBS=-lrt -lm 17 | 18 | ARCH ?= $(shell uname -m) 19 | 20 | ifeq ($(ARCH),aarch64) 21 | CAP ?= $(shell cat /proc/cpuinfo | grep -E 'atomics|sve' | head -1) 22 | ifneq (,$(findstring sve,$(CAP))) 23 | CFLAGS+=-march=armv8.2-a+sve 24 | else ifneq (,$(findstring atomics,$(CAP))) 25 | CFLAGS+=-march=armv8.1-a+lse 26 | endif 27 | endif 28 | 29 | EXE=multichase multiload fairness pingpong 30 | 31 | all: $(EXE) 32 | 33 | clean: 34 | rm -f $(EXE) *.o expand.h 35 | 36 | .c.s: 37 | $(CC) $(CFLAGS) -S -c $< 38 | 39 | multichase: multichase.o permutation.o arena.o br_asm.o util.o 40 | 41 | multiload: multiload.o permutation.o arena.o util.o 42 | 43 | fairness: LDLIBS += -lm 44 | 45 | expand.h: gen_expand 46 | ./gen_expand 200 >expand.h.tmp 47 | mv expand.h.tmp expand.h 48 | 49 | depend: 50 | makedepend -Y -- $(CFLAGS) -- *.c 51 | 52 | # DO NOT DELETE 53 | 54 | arena.o: arena.h 55 | multichase.o: cpu_util.h timer.h expand.h permutation.h arena.h util.h 56 | multiload.o: cpu_util.h timer.h expand.h permutation.h arena.h util.h 57 | permutation.o: permutation.h 58 | util.o: util.h 59 | fairness.o: cpu_util.h expand.h timer.h 60 | pingpong.o: cpu_util.h timer.h 61 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | Multichase - a pointer chaser benchmark 2 | Multiload - a superset of multichase which runs latency, memory bandwidth, and loaded-latency 3 | 4 | 1/ BUILD 5 | 6 | - just type: 7 | 8 | $ make 9 | 10 | 2/ INSTALL 11 | 12 | - just run from current directory or copy multichase wherever you need to 13 | 14 | 3.1/ RUN Multichase 15 | 16 | - To get help 17 | 18 | $ multichase -h 19 | 20 | - By default, multichase will perform a pointer chase through an array 21 | size of 256MB and a stride size of 256 bytes for 2.5 seconds on a single 22 | thread: 23 | 24 | $ multichase 25 | 26 | - Pointer chase through an array of 4MB with a stride size of 64 bytes: 27 | 28 | $ multichase -m 4m -s 64 29 | 30 | - Pointer chase through an array of 1GB for 10 seconds (-n is the number of 0.5 second samples): 31 | 32 | $ multichase -m 1g -n 20 33 | 34 | - Pointer chase through an array of 256KB with a stride size of 128 bytes on 2 threads. 35 | Thread 0 accesses every 128th byte, thread 1 accesses every 128th byte offset by sizeof(void*)=8 36 | on 64bit architectures: 37 | 38 | $ multichase -m 256k -s 128 -t 2 39 | 40 | 3.2/ RUN Multiload 41 | 42 | - Latency Only (simple pointer chase) 43 | In this mode, Multiload can run any of the multichase commands above. 44 | A "-c" chase arg (other than chaseload) can be used or it will default to "simple". 45 | Using either "-c chaseload" and/or the "-l" load arguments will choose a different test mode. 46 | 47 | $ multiload 48 | 49 | - Bandwidth Only 50 | Multiload can run a memory bandwidth test using the "-l" load argument. The "-c" chase argument MUST NOT be used. 51 | Below command runs 5 samples (~2.5 seconds each), using 16 threads, using the glibc memcpy() function, 52 | using a 512M buffer per thread. 53 | 54 | $ multiload -n 5 -t 16 -m 512M -l memcpy-libc 55 | 56 | - Loaded Latency. 57 | Multiload can run 1 pointer chaser thread on logical cpu0 with multiple memory bandwidth load threads. 58 | The "-c chaseload" arg MUST be used. The "-l" arg MUST be used with one of the memory load arguments. 59 | Below command runs 5 samples (~2.5 seconds each), on 16 threads (1 chase, 15 stream-sum bandwidth loads), 60 | using a 512M buffer per thread. The chase thread uses a stride=16. 61 | 62 | $ multiload -s 16 -n 5 -t 16 -m 512M -c chaseload -l stream-sum 63 | 64 | 3.3/ RUN Pingpong & fairness 65 | 66 | - Pingpong: measure latency of exchanging a line between cores. 67 | To run, simply do: 68 | $ pingpong -u 69 | 70 | - Fairness: measure fairness with N threads competing to increment an atomic variable. 71 | To run, simply do: 72 | $ fairness 73 | -------------------------------------------------------------------------------- /_clang-format: -------------------------------------------------------------------------------- 1 | BasedOnStyle: Google 2 | IndentWidth: 2 3 | Language: Cpp 4 | -------------------------------------------------------------------------------- /arena.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #include "arena.h" 15 | 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | 26 | #include "permutation.h" 27 | 28 | extern int verbosity; 29 | extern int is_weighted_mbind; 30 | extern uint16_t mbind_weights[MAX_MEM_NODES]; 31 | 32 | size_t get_native_page_size(void) { 33 | long sz; 34 | 35 | sz = sysconf(_SC_PAGESIZE); 36 | if (sz < 0) { 37 | perror("failed to get native page size"); 38 | exit(1); 39 | } 40 | 41 | return (size_t)sz; 42 | } 43 | 44 | bool page_size_is_huge(size_t page_size) { 45 | return page_size > get_native_page_size(); 46 | } 47 | 48 | void print_page_size(size_t page_size, bool use_thp) { 49 | FILE *f; 50 | size_t read; 51 | /* Big enough to fit UINT64_MAX + '\n' + '\0'. */ 52 | char buf[22]; 53 | 54 | if (!use_thp) { 55 | printf("page_size = %zu bytes\n", page_size); 56 | return; 57 | } 58 | 59 | f = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r"); 60 | if (!f) goto err; 61 | 62 | read = fread(buf, 1, sizeof(buf) - 1, f); 63 | if (!feof(f) || (ferror(f) && !feof(f))) goto err; 64 | 65 | if (fclose(f)) goto err; 66 | 67 | if (buf[read - 1] == '\n') --read; 68 | buf[read] = '\0'; 69 | 70 | printf("page_size = %s bytes (THP)\n", buf); 71 | return; 72 | 73 | err: 74 | perror( 75 | "page_size = "); 77 | } 78 | 79 | static inline int mbind(void *addr, unsigned long len, int mode, 80 | unsigned long *nodemask, unsigned long maxnode, 81 | unsigned flags) { 82 | return syscall(__NR_mbind, addr, len, mode, nodemask, maxnode, flags); 83 | } 84 | 85 | static void arena_weighted_mbind(size_t page_size, void *arena, 86 | size_t arena_size, uint16_t *weights, 87 | size_t nr_weights) { 88 | /* compute cumulative sum for weights 89 | * cumulative sum starts at -1 90 | * the method for determining a hit on a weight i is when the generated 91 | * random number (modulo sum of weights) <= weights_cumsum[i] 92 | */ 93 | int64_t *weights_cumsum = malloc(nr_weights * sizeof(int64_t)); 94 | if (!weights_cumsum) { 95 | fprintf(stderr, "Couldn't allocate memory for weights.\n"); 96 | exit(1); 97 | } 98 | weights_cumsum[0] = weights[0] - 1; 99 | for (unsigned int i = 1; i < nr_weights; i++) { 100 | weights_cumsum[i] = weights_cumsum[i - 1] + weights[i]; 101 | } 102 | const int32_t weight_sum = weights_cumsum[nr_weights - 1] + 1; 103 | 104 | uint64_t mask = 0; 105 | char *q = (char *)arena + arena_size; 106 | rng_init(1); 107 | for (char *p = arena; p < q; p += page_size) { 108 | uint32_t r = rng_int(1 << 31) % weight_sum; 109 | unsigned int node; 110 | for (node = 0; node < nr_weights; node++) { 111 | if (weights_cumsum[node] >= r) { 112 | break; 113 | } 114 | } 115 | mask = 1 << node; 116 | if (mbind(p, page_size, MPOL_BIND, &mask, nr_weights, MPOL_MF_STRICT)) { 117 | perror("mbind"); 118 | exit(1); 119 | } 120 | *p = 0; 121 | } 122 | free(weights_cumsum); 123 | } 124 | 125 | static int get_page_size_flags(size_t page_size) { 126 | int lg = 0; 127 | 128 | if (!page_size || (page_size & (page_size - 1))) { 129 | fprintf(stderr, "page size must be a power of 2: %zu\n", page_size); 130 | exit(1); 131 | } 132 | 133 | if (!page_size_is_huge(page_size)) { 134 | return 0; 135 | } 136 | 137 | /* 138 | * We need not just MAP_HUGETLB, but also a flag specifying the page size. 139 | * mmap(2) says that these flags are defined as: 140 | * log2(page size) << MAP_HUGE_SHIFT. 141 | */ 142 | while (page_size >>= 1) { 143 | ++lg; 144 | } 145 | return MAP_HUGETLB | (lg << MAP_HUGE_SHIFT); 146 | } 147 | 148 | /* 149 | * Reads a "state" file from sysfs at the given path, and returns the current 150 | * state. The caller must free() the returned pointer when finished with it. 151 | * 152 | * The file must be formatted like this: 153 | * 154 | * state1 state2 [state3] state4 155 | * 156 | * The state surrounded by []s is the currently active one. It is returned 157 | * as-is, including the surrounding []s. 158 | */ 159 | static char *read_sysfs_state_file(char const *path) { 160 | FILE *f = fopen(path, "r"); 161 | char *token = NULL; 162 | int ret; 163 | 164 | if (f == NULL) { 165 | perror("open sysfs state file"); 166 | exit(1); 167 | } 168 | 169 | while ((ret = fscanf(f, "%ms", &token)) == 1) { 170 | if (token[0] == '[') break; 171 | 172 | free(token); 173 | token = NULL; 174 | } 175 | 176 | if (ferror(f) && !feof(f)) { 177 | perror("read sysfs state file"); 178 | exit(1); 179 | } 180 | 181 | if (fclose(f)) { 182 | perror("close sysfs state file"); 183 | exit(1); 184 | } 185 | 186 | return token; 187 | } 188 | 189 | static void write_sysfs_file(char const *path, char const *value) { 190 | FILE *f = fopen(path, "w"); 191 | 192 | if (f == NULL) { 193 | perror("open sysfs file for write"); 194 | exit(1); 195 | } 196 | 197 | if (fprintf(f, "%s\n", value) < 0) { 198 | perror("write value to sysfs file"); 199 | exit(1); 200 | } 201 | 202 | if (fclose(f)) { 203 | perror("close sysfs file"); 204 | exit(1); 205 | } 206 | } 207 | 208 | /* 209 | * In order for MADV_HUGEPAGE to work, THP configuration must be in one of 210 | * several acceptable states. Check if the existing system configuration is 211 | * acceptable, and if not, try to change the configuration. 212 | */ 213 | static void check_thp_state(void) { 214 | char *enabled = 215 | read_sysfs_state_file("/sys/kernel/mm/transparent_hugepage/enabled"); 216 | char *defrag = 217 | read_sysfs_state_file("/sys/kernel/mm/transparent_hugepage/defrag"); 218 | 219 | if (strcmp(enabled, "[always]") && strcmp(enabled, "[madvise]")) { 220 | write_sysfs_file("/sys/kernel/mm/transparent_hugepage/enabled", "madvise"); 221 | } 222 | 223 | if (strcmp(defrag, "[always]") && strcmp(defrag, "[defer+madvise]") && 224 | strcmp(defrag, "[madvise]")) { 225 | write_sysfs_file("/sys/kernel/mm/transparent_hugepage/defrag", "madvise"); 226 | } 227 | 228 | free(enabled); 229 | free(defrag); 230 | } 231 | 232 | void *alloc_arena_mmap(size_t page_size, bool use_thp, size_t arena_size, int fd) { 233 | void *arena; 234 | size_t pagemask = page_size - 1; 235 | int flags; 236 | 237 | if (fd == -1) 238 | flags = MAP_PRIVATE | MAP_ANONYMOUS | get_page_size_flags(page_size); 239 | else 240 | flags = MAP_SHARED; 241 | 242 | arena_size = (arena_size + pagemask) & ~pagemask; 243 | arena = mmap(0, arena_size, PROT_READ | PROT_WRITE, flags, fd, 0); 244 | if (arena == MAP_FAILED) { 245 | perror("mmap"); 246 | exit(1); 247 | } 248 | 249 | if (use_thp && fd == -1) 250 | check_thp_state(); 251 | 252 | /* Explicitly disable THP for small pages. */ 253 | if (!page_size_is_huge(page_size) && fd == -1) { 254 | if (madvise(arena, arena_size, use_thp ? MADV_HUGEPAGE : MADV_NOHUGEPAGE)) { 255 | perror("madvise"); 256 | } 257 | } else if (use_thp) { 258 | fprintf(stderr, 259 | "Can't use transparent hugepages with a non-native page size.\n"); 260 | exit(1); 261 | } 262 | 263 | if (is_weighted_mbind) { 264 | arena_weighted_mbind(page_size, arena, arena_size, mbind_weights, 265 | MAX_MEM_NODES); 266 | } 267 | return arena; 268 | } 269 | 270 | void make_buffer_executable(void *buf, size_t len) { 271 | if (mprotect(buf, len, PROT_EXEC)) { 272 | perror("mprotect(PROT_EXEC)"); 273 | exit(1); 274 | } 275 | } 276 | 277 | -------------------------------------------------------------------------------- /arena.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #ifndef ARENA_H_INCLUDED 15 | #define ARENA_H_INCLUDED 16 | 17 | #include 18 | #include 19 | #include 20 | 21 | #define MAX_MEM_NODES (8 * sizeof(uint64_t)) 22 | 23 | size_t get_native_page_size(void); 24 | bool page_size_is_huge(size_t page_size); 25 | void print_page_size(size_t page_size, bool use_thp); 26 | 27 | void *alloc_arena_mmap(size_t page_size, bool use_thp, size_t arena_size, int fd); 28 | 29 | void make_buffer_executable(void *buf, size_t len); 30 | #endif 31 | -------------------------------------------------------------------------------- /br_asm.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #include "br_asm.h" 7 | 8 | static int cycle_len(void *p) { 9 | int count = 0; 10 | char *next = (char *)p; 11 | do { 12 | ++count; 13 | next = *(char **)next; 14 | } while (next != p); 15 | return count; 16 | } 17 | 18 | #if defined(__aarch64__) 19 | 20 | static uint64_t rbits(uint64_t val, int rmask) { 21 | return val & ((1ULL << rmask) - 1); 22 | } 23 | 24 | static uint64_t rmask_lsh(uint64_t val, int rmask, int left_shift) { 25 | return (val & ((1ULL << rmask) - 1)) << left_shift; 26 | } 27 | 28 | static void arm_emit_br(char *p, int rs) { 29 | *(int *)p = 0b11010110000111110000000000000000 | rmask_lsh(rs, 5, 5); 30 | } 31 | 32 | static void arm_emit_movk(char *p, int rd, int imm16, int shift16) { 33 | *(int *)p = (0b111100101 << 23) | rmask_lsh(shift16, 16, 21) | 34 | rmask_lsh(imm16, 16, 5) | rmask_lsh(rd, 5, 0); 35 | } 36 | 37 | static void arm_emit_movz(char *p, int rd, int imm16, int shift16) { 38 | *(int *)p = (0b110100101 << 23) | rmask_lsh(shift16, 16, 21) | 39 | rmask_lsh(imm16, 16, 5) | rmask_lsh(rd, 5, 0); 40 | } 41 | 42 | static void arm_emit_ret(char *p) { *(int *)p = 0xd65f03c0; } 43 | 44 | // Convert a pointer chase to a branch chase, returning after chunk_size 45 | // branches with a function pointer to the next branch in the chase. 46 | int convert_pointers_to_branches(void *head, int chunk_size) { 47 | int remain = cycle_len(head); 48 | chunk_size = (remain < chunk_size) 49 | ? remain 50 | : remain / (1 << lround(log2(1.0 * remain / chunk_size))); 51 | int base_chunk_size = chunk_size; 52 | int chunks_remaining = remain / chunk_size; 53 | int chunk_count = 0; 54 | const int ptr_reg = 0; // Address of next branch and return value. 55 | char *p = (char *)head; 56 | do { 57 | if (!chunk_count) chunk_count = remain / chunks_remaining; 58 | char *next = *((char **)p); 59 | const int br_code_len = 16; 60 | for (int i = 8; i < br_code_len; i++) { 61 | if (p[i]) { 62 | fprintf(stderr, "not enough space to convert a pointer to branches"); 63 | exit(1); 64 | } 65 | } 66 | // VA is 48 bits max. 67 | uint64_t shift00 = rbits((intptr_t)next, 16); 68 | uint64_t shift16 = rbits((intptr_t)next >> 16, 16); 69 | uint64_t shift32 = rbits((intptr_t)next >> 32, 16); 70 | arm_emit_movz(p, ptr_reg, shift00, 0); 71 | arm_emit_movk(p + 4, ptr_reg, shift16, 1); 72 | arm_emit_movk(p + 8, ptr_reg, shift32, 2); 73 | // At end of branch chunk, return with a pointer to the next chunk. 74 | --remain; 75 | if (--chunk_count == 0) { 76 | arm_emit_ret(p + 12); 77 | --chunks_remaining; 78 | } else { 79 | arm_emit_br(p + 12, ptr_reg); 80 | } 81 | p = next; 82 | } while (p != head); 83 | return base_chunk_size; 84 | } 85 | 86 | #elif defined(__x86_64__) 87 | 88 | static char *x64_emit_mov_imm64_rax(char *p, uint64_t imm64) { 89 | *p++ = 0x48; // movabs 90 | *p++ = 0xb8; 91 | for (int i = 0; i < 8; i++) 92 | *p++ = imm64 >> (8 * i); // immediate value (8 bytes) 93 | return p; 94 | } 95 | 96 | static char *x64_emit_jmp_to_rax(char *p) { 97 | *p++ = 0xff; // jmp *rax 98 | *p++ = 0xe0; 99 | return p; 100 | } 101 | 102 | static char *x64_emit_ret(char *p) { 103 | *p++ = 0xc3; 104 | return p; 105 | } 106 | 107 | // Convert a pointer chase to a branch chase, returning after chunk_size 108 | // branches with a function pointer to the next branch in the chase. 109 | int convert_pointers_to_branches(void *head, int chunk_size) { 110 | int remain = cycle_len(head); 111 | chunk_size = (remain < chunk_size) 112 | ? remain 113 | : remain / (1 << lround(log2(1.0 * remain / chunk_size))); 114 | int base_chunk_size = chunk_size; 115 | int chunks_remaining = remain / chunk_size; 116 | int chunk_count = 0; 117 | const int br_code_len = 12; // len(mov_imm64) + max(len(jmp), len(ret)) 118 | char *p = (char *)head; 119 | do { 120 | if (!chunk_count) chunk_count = remain / chunks_remaining; 121 | char *next = *((char **)p); 122 | for (int i = 8; i < br_code_len; i++) { 123 | if (p[i]) { 124 | fprintf(stderr, "not enough space to convert a pointer to branches\n"); 125 | exit(1); 126 | } 127 | } 128 | p = x64_emit_mov_imm64_rax(p, (intptr_t)next); 129 | // At end of branch chunk, return with a pointer to the next chunk. 130 | --remain; 131 | if (--chunk_count == 0) { 132 | p = x64_emit_ret(p); 133 | --chunks_remaining; 134 | } else { 135 | p = x64_emit_jmp_to_rax(p); 136 | } 137 | p = next; 138 | } while (p != head); 139 | return base_chunk_size; 140 | } 141 | 142 | #else 143 | int convert_pointers_to_branches(void *head, int chunk_size) { 144 | fprintf(stderr, "Not implemented on this architecture.\n"); 145 | exit(1); 146 | } 147 | #endif 148 | -------------------------------------------------------------------------------- /br_asm.h: -------------------------------------------------------------------------------- 1 | #ifndef ASM_H_INCLUDED 2 | #define ASM_H_INCLUDED 3 | 4 | int convert_pointers_to_branches(void *head, int chunk_size); 5 | #endif 6 | -------------------------------------------------------------------------------- /cpu_util.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #ifndef PORT_H_INCLUDED 15 | #define PORT_H_INCLUDED 16 | 17 | // We assume 1024 bytes is good enough alignment to avoid false sharing on all 18 | // architectures. 19 | #define AVOID_FALSE_SHARING (1024) 20 | #ifndef CACHELINE_SIZE 21 | #define CACHELINE_SIZE 64 22 | #endif 23 | #ifndef SWEEP_MAX 24 | #define SWEEP_MAX 256 25 | #endif 26 | #ifndef SWEEP_SPACER 27 | #define SWEEP_SPACER (CACHELINE_SIZE - sizeof(unsigned)) 28 | #endif 29 | 30 | #if defined(__x86_64__) || defined(__i386__) 31 | static inline void cpu_relax(void) { asm volatile("rep; nop"); } 32 | #elif defined __powerpc__ 33 | static inline void cpu_relax(void) { 34 | // HMT_low() 35 | asm volatile("or 1,1,1"); 36 | // HMT_medium() 37 | asm volatile("or 2,2,2"); 38 | // barrier() 39 | asm volatile("" : : : "memory"); 40 | } 41 | #elif defined(__aarch64__) 42 | #define cpu_relax() asm volatile("yield" ::: "memory") 43 | #else 44 | #warning "no cpu_relax for your cpu" 45 | #define cpu_relax() \ 46 | do { \ 47 | } while (0) 48 | #endif 49 | 50 | #endif 51 | -------------------------------------------------------------------------------- /fairness.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #define _GNU_SOURCE 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | 25 | #include "cpu_util.h" 26 | #include "expand.h" 27 | #include "timer.h" 28 | 29 | typedef unsigned atomic_t; 30 | 31 | typedef union { 32 | struct { 33 | atomic_t count; 34 | int cpu; 35 | } x; 36 | char pad[AVOID_FALSE_SHARING]; 37 | } per_thread_t; 38 | 39 | typedef struct { 40 | struct { 41 | atomic_t count; 42 | char spacer[SWEEP_SPACER]; 43 | } x[SWEEP_MAX]; 44 | int sweep_id; 45 | } sync_count; 46 | 47 | sync_count global_counter; 48 | 49 | static volatile int relaxed; 50 | 51 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER; 52 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER; 53 | static size_t nr_to_startup; 54 | static uint64_t delay_mask; 55 | 56 | static void wait_for_startup(void) { 57 | // wait for everyone to spawn 58 | pthread_mutex_lock(&wait_mutex); 59 | --nr_to_startup; 60 | if (nr_to_startup) { 61 | pthread_cond_wait(&wait_cond, &wait_mutex); 62 | } else { 63 | pthread_cond_broadcast(&wait_cond); 64 | } 65 | pthread_mutex_unlock(&wait_mutex); 66 | } 67 | 68 | static void *worker(void *_args) { 69 | per_thread_t *args = _args; 70 | 71 | // move to our target cpu 72 | cpu_set_t cpu; 73 | CPU_ZERO(&cpu); 74 | CPU_SET(args->x.cpu, &cpu); 75 | if (sched_setaffinity(0, sizeof(cpu), &cpu)) { 76 | perror("sched_setaffinity"); 77 | exit(1); 78 | } 79 | 80 | wait_for_startup(); 81 | 82 | if (delay_mask & (1u << args->x.cpu)) { 83 | sleep(1); 84 | } 85 | while (!relaxed) { 86 | atomic_t *target = &(global_counter.x[global_counter.sweep_id].count); 87 | x50(__sync_fetch_and_add(target, 1);); 88 | __sync_fetch_and_add(&args->x.count, 50); 89 | } 90 | if (delay_mask & (1u << args->x.cpu)) { 91 | sleep(1); 92 | } 93 | while (relaxed) { 94 | atomic_t *target = &(global_counter.x[global_counter.sweep_id].count); 95 | x50(__sync_fetch_and_add(target, 1); cpu_relax();); 96 | __sync_fetch_and_add(&args->x.count, 50); 97 | } 98 | return NULL; 99 | } 100 | 101 | int main(int argc, char **argv) { 102 | int c; 103 | int sweep_max = 1; 104 | size_t time_slice = 500000; 105 | char sep = ' '; 106 | size_t sample_size = 6; 107 | 108 | delay_mask = 0; 109 | while ((c = getopt(argc, argv, "d:n:s:t:S:")) != -1) { 110 | switch (c) { 111 | case 'd': 112 | delay_mask = strtoul(optarg, 0, 0); 113 | break; 114 | case 'n': 115 | sample_size = strtof(optarg, 0); 116 | break; 117 | case 's': 118 | sweep_max = strtoul(optarg, 0, 0); 119 | break; 120 | case 't': 121 | time_slice = strtof(optarg, 0) * 1000000; 122 | break; 123 | case 'S': 124 | sep = *optarg; 125 | break; 126 | default: 127 | goto usage; 128 | } 129 | } 130 | 131 | if (argc - optind != 0) { 132 | usage: 133 | fprintf( 134 | stderr, 135 | "usage: %s \n" 136 | " [-d delay_mask]\n" 137 | " [-n sample_nr]\n" 138 | " [-s sweep_max]\n" 139 | " [-t time]\n" 140 | "By default runs one thread on each cpu, use taskset(1) to " 141 | "restrict operation to fewer cpus/threads.\n" 142 | "The optional delay_mask specifies a mask of cpus on which to delay " 143 | "the startup.\n" 144 | "The optional sample_nr determines the number of samples taken for " 145 | "each test point.\n" 146 | "The optional sweep_max causes testing across multiple different " 147 | "cache lines.\n" 148 | "The optional time determines how often to poll results (float in " 149 | "seconds).\n", 150 | argv[0]); 151 | exit(1); 152 | } 153 | 154 | setvbuf(stdout, NULL, _IONBF, BUFSIZ); 155 | 156 | // find the active cpus 157 | cpu_set_t cpus; 158 | if (sched_getaffinity(getpid(), sizeof(cpus), &cpus)) { 159 | perror("sched_getaffinity"); 160 | exit(1); 161 | } 162 | 163 | // could do this more efficiently, but whatever 164 | size_t nr_threads = 0; 165 | int i; 166 | for (i = 0; i < CPU_SETSIZE; ++i) { 167 | if (CPU_ISSET(i, &cpus)) { 168 | ++nr_threads; 169 | } 170 | } 171 | 172 | per_thread_t *thread_args = calloc(nr_threads, sizeof(*thread_args)); 173 | nr_to_startup = nr_threads + 1; 174 | size_t u; 175 | i = 0; 176 | for (u = 0; u < nr_threads; ++u) { 177 | while (!CPU_ISSET(i, &cpus)) { 178 | ++i; 179 | } 180 | thread_args[u].x.cpu = i; 181 | ++i; 182 | thread_args[u].x.count = 0; 183 | pthread_t dummy; 184 | if (pthread_create(&dummy, NULL, worker, &thread_args[u])) { 185 | perror("pthread_create"); 186 | exit(1); 187 | } 188 | } 189 | 190 | wait_for_startup(); 191 | 192 | atomic_t *samples = calloc(nr_threads, sizeof(*samples)); 193 | 194 | printf( 195 | "results are avg latency per locked increment in ns, one column per " 196 | "thread\n"); 197 | char *fmt, *tail = ""; 198 | if (sep == ',') { 199 | printf("relaxed,sweep"); 200 | fmt = ",cpu-%u"; 201 | tail = ",avg,stdev,min,max"; 202 | } else { 203 | fmt = "%6u "; 204 | printf("cpu:"); 205 | } 206 | for (u = 0; u < nr_threads; ++u) { 207 | printf(fmt, thread_args[u].x.cpu); 208 | } 209 | printf("%s\n", tail); 210 | global_counter.sweep_id = 0; 211 | for (relaxed = 0; relaxed < 2; ++relaxed) { 212 | if (sep != ',') printf(relaxed ? "relaxed:\n" : "unrelaxed:\n"); 213 | for (int sweep = 0; sweep < sweep_max; sweep++) { 214 | global_counter.sweep_id = sweep; 215 | uint64_t last_stamp = now_nsec(); 216 | size_t sample_nr; 217 | // The first sample is thrown out, so increment sample_size by 1 to get 218 | // the requested number of samples. 219 | for (sample_nr = 0; sample_nr < sample_size + 1; ++sample_nr) { 220 | double min = 1.0 / 0., max = 0.; 221 | usleep(time_slice); 222 | for (u = 0; u < nr_threads; ++u) { 223 | samples[u] = __sync_lock_test_and_set(&thread_args[u].x.count, 0); 224 | } 225 | uint64_t stamp = now_nsec(); 226 | int64_t time_delta = stamp - last_stamp; 227 | last_stamp = stamp; 228 | 229 | // throw away the first sample to avoid race issues at startup / mode 230 | // switch 231 | if (sample_nr == 0) continue; 232 | if (sep == ',') 233 | printf("%d,%p", relaxed, 234 | &(global_counter.x[global_counter.sweep_id].count)); 235 | 236 | if (sep == ',') { 237 | fmt = ",%.1f"; 238 | } else { 239 | fmt = " %6.1f"; 240 | printf(" "); 241 | } 242 | double sum = 0.; 243 | double sum_squared = 0.; 244 | for (u = 0; u < nr_threads; ++u) { 245 | double s = time_delta / (double)samples[u]; 246 | printf(fmt, s); 247 | min = min < s ? min : s; 248 | max = max > s ? max : s; 249 | sum += s; 250 | sum_squared += s * s; 251 | } 252 | if (sep == ',') { 253 | fmt = ",%.1f,%.1f,%.1f,%.1f\n"; 254 | } else { 255 | fmt = " : avg %6.1f sdev %6.1f min %6.1f max %6.1f\n"; 256 | } 257 | printf(fmt, sum / nr_threads, 258 | sqrt((sum_squared - sum * sum / nr_threads) / (nr_threads - 1)), 259 | min, max); 260 | fflush(stdout); 261 | } 262 | } 263 | } 264 | return 0; 265 | } 266 | -------------------------------------------------------------------------------- /gen_expand: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | # Copyright 2015 Google Inc. All Rights Reserved. 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | use warnings; 16 | use strict; 17 | 18 | defined(my $n = shift) 19 | or die "usage: $0 n\n"; 20 | 21 | # special cases i can't be bothered to handle otherwise 22 | print "#define x1(x) x\n"; 23 | print "#define x2(x) x x\n"; 24 | print "#define x3(x) x2(x) x\n"; 25 | 26 | sub find_largest_factors($) { 27 | my ($x) = @_; 28 | for (my $y = int(sqrt($x)); $y >= 2; --$y) { 29 | if ($x % $y == 0) { 30 | my $z = $x/$y; 31 | return "x$y(x$z(x))"; 32 | } 33 | } 34 | return undef; 35 | } 36 | 37 | foreach my $i (4..$n) { 38 | if (my $factors = find_largest_factors($i)) { 39 | print "#define x$i(x) $factors\n"; 40 | } 41 | else { 42 | my $factors = find_largest_factors($i-1); 43 | print "#define x$i(x) $factors x\n"; 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /multichase.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #define _GNU_SOURCE 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include "arena.h" 29 | #include "br_asm.h" 30 | #include "cpu_util.h" 31 | #include "expand.h" 32 | #include "permutation.h" 33 | #include "timer.h" 34 | #include "util.h" 35 | 36 | // The total memory, stride, and TLB locality have been chosen carefully for 37 | // the current generation of CPUs: 38 | // 39 | // - at stride of 64 bytes the L2 next-line prefetch on p-m/core/core2 gives a 40 | // helping hand 41 | // 42 | // - at stride of 128 bytes the stream prefetcher on various p4 decides the 43 | // random accesses sometimes look like a stream and gives a helping hand. 44 | // 45 | // - the TLB locality could have been raised beyond 4 pages to defeat various 46 | // stream prefetchers, but you need to get out well past 32 pages before 47 | // all existing hw prefetchers are defeated, and then you start exceding the 48 | // TLB locality on several CPUs and incurring some TLB overhead. 49 | // Hence, the default has been changed from 16 pages to 64 pages. 50 | // 51 | #define DEF_TOTAL_MEMORY ((size_t)256 * 1024 * 1024) 52 | #define DEF_STRIDE ((size_t)256) 53 | #define DEF_NR_SAMPLES ((size_t)5) 54 | #define DEF_TLB_LOCALITY ((size_t)64) 55 | #define DEF_NR_THREADS ((size_t)1) 56 | #define DEF_CACHE_FLUSH ((size_t)64 * 1024 * 1024) 57 | #define DEF_OFFSET ((size_t)0) 58 | #define DELAY_USECS 500000 59 | #define NSECS_PER_USEC 1000 60 | #define NSECS_PER_MSEC (1000 * 1000) 61 | 62 | int verbosity; 63 | int print_timestamp; 64 | int is_weighted_mbind; 65 | uint16_t mbind_weights[MAX_MEM_NODES]; 66 | 67 | #ifdef __i386__ 68 | #define MAX_PARALLEL (6) // maximum number of chases in parallel 69 | #else 70 | #define MAX_PARALLEL (10) 71 | #endif 72 | 73 | // forward declare 74 | typedef struct chase_t chase_t; 75 | 76 | // the arguments for the chase threads 77 | typedef union { 78 | char pad[AVOID_FALSE_SHARING]; 79 | struct { 80 | unsigned thread_num; // which thread is this 81 | unsigned count; // count of number of iterations 82 | void *cycle[MAX_PARALLEL]; // initial address for the chases 83 | const char *extra_args; 84 | int dummy; // useful for confusing the compiler 85 | 86 | const struct generate_chase_common_args *genchase_args; 87 | size_t nr_threads; 88 | const chase_t *chase; 89 | void *flush_arena; 90 | size_t cache_flush_size; 91 | bool use_longer_chase; 92 | int branch_chunk_size; 93 | } x; 94 | } per_thread_t; 95 | 96 | int always_zero; 97 | 98 | static void chase_simple(per_thread_t *t) { 99 | void *p = t->x.cycle[0]; 100 | 101 | do { 102 | x200(p = *(void **)p;) 103 | } while (__sync_add_and_fetch(&t->x.count, 200)); 104 | 105 | // we never actually reach here, but the compiler doesn't know that 106 | t->x.dummy = (uintptr_t)p; 107 | } 108 | 109 | // parallel chases 110 | 111 | #define declare(i) void *p##i = start[i]; 112 | #define cleanup(i) tmp += (uintptr_t)p##i; 113 | #if MAX_PARALLEL == 6 114 | #define parallel(foo) foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) 115 | #else 116 | #define parallel(foo) \ 117 | foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) foo(6) foo(7) foo(8) foo(9) 118 | #endif 119 | 120 | #define template(n, expand, inner) \ 121 | static void chase_parallel##n(per_thread_t *t) { \ 122 | void **start = t->x.cycle; \ 123 | parallel(declare) do { x##expand(inner) } \ 124 | while (__sync_add_and_fetch(&t->x.count, n * expand)) \ 125 | ; \ 126 | \ 127 | uintptr_t tmp = 0; \ 128 | parallel(cleanup) t->x.dummy = tmp; \ 129 | } 130 | 131 | #if defined(__x86_64__) || defined(__i386__) 132 | #define D(n) asm volatile("mov (%1),%0" : "=r"(p##n) : "r"(p##n)); 133 | #else 134 | #define D(n) p##n = *(void **)p##n; 135 | #endif 136 | template(2, 100, D(0) D(1)); 137 | template(3, 66, D(0) D(1) D(2)); 138 | template(4, 50, D(0) D(1) D(2) D(3)); 139 | template(5, 40, D(0) D(1) D(2) D(3) D(4)); 140 | template(6, 32, D(0) D(1) D(2) D(3) D(4) D(5)); 141 | #if MAX_PARALLEL > 6 142 | template(7, 28, D(0) D(1) D(2) D(3) D(4) D(5) D(6)); 143 | template(8, 24, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7)); 144 | template(9, 22, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8)); 145 | template(10, 20, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8) D(9)); 146 | #endif 147 | #undef D 148 | #undef parallel 149 | #undef cleanup 150 | #undef declare 151 | 152 | static void chase_work(per_thread_t *t) { 153 | void *p = t->x.cycle[0]; 154 | size_t extra_work = strtoul(t->x.extra_args, 0, 0); 155 | size_t work = 0; 156 | size_t i; 157 | 158 | // the extra work is intended to be overlapped with a dereference, 159 | // but we don't want it to skip past the next dereference. so 160 | // we fold in the value of the pointer, and launch the deref then 161 | // go into a loop performing extra work, hopefully while the 162 | // deref occurs. 163 | do { 164 | x25(work += (uintptr_t)p; p = *(void **)p; 165 | for (i = 0; i < extra_work; ++i) { work ^= i; }) 166 | } while (__sync_add_and_fetch(&t->x.count, 25)); 167 | 168 | // we never actually reach here, but the compiler doesn't know that 169 | t->x.cycle[0] = p; 170 | t->x.dummy = work; 171 | } 172 | 173 | struct incr_struct { 174 | struct incr_struct *next; 175 | unsigned incme; 176 | }; 177 | 178 | static void chase_incr(per_thread_t *t) { 179 | struct incr_struct *p = t->x.cycle[0]; 180 | 181 | do { 182 | x50(++p->incme; p = *(void **)p;) 183 | } while (__sync_add_and_fetch(&t->x.count, 50)); 184 | 185 | // we never actually reach here, but the compiler doesn't know that 186 | t->x.cycle[0] = p; 187 | } 188 | 189 | static void chase_branch(per_thread_t *t) { 190 | void *p = t->x.cycle[0]; 191 | typedef void* (*fptr)(void); 192 | fptr fp = p; 193 | 194 | do { 195 | fp = (fptr)(fp)(); 196 | } while (__sync_add_and_fetch(&t->x.count, t->x.branch_chunk_size)); 197 | 198 | // we never actually reach here, but the compiler doesn't know that 199 | t->x.dummy = (uintptr_t)p; 200 | } 201 | 202 | #if defined(__x86_64__) || defined(__i386__) 203 | #define chase_prefetch(type) \ 204 | static void chase_prefetch##type(per_thread_t *t) { \ 205 | void *p = t->x.cycle[0]; \ 206 | \ 207 | do { \ 208 | x100(asm volatile("prefetch" #type " %0" ::"m"(*(void **)p)); \ 209 | p = *(void **)p;) \ 210 | } while (__sync_add_and_fetch(&t->x.count, 100)); \ 211 | \ 212 | /* we never actually reach here, but the compiler doesn't know that */ \ 213 | t->x.cycle[0] = p; \ 214 | } 215 | chase_prefetch(t0); 216 | chase_prefetch(t1); 217 | chase_prefetch(t2); 218 | chase_prefetch(nta); 219 | #undef chase_prefetch 220 | #endif 221 | 222 | #if defined(__x86_64__) 223 | static void chase_movdqa(per_thread_t *t) { 224 | void *p = t->x.cycle[0]; 225 | 226 | do { 227 | x100(asm volatile("\n movdqa (%%rax),%%xmm0" 228 | "\n movdqa 16(%%rax),%%xmm1" 229 | "\n paddq %%xmm1,%%xmm0" 230 | "\n movdqa 32(%%rax),%%xmm2" 231 | "\n paddq %%xmm2,%%xmm0" 232 | "\n movdqa 48(%%rax),%%xmm3" 233 | "\n paddq %%xmm3,%%xmm0" 234 | "\n movq %%xmm0,%%rax" 235 | : "=a"(p) 236 | : "0"(p));) 237 | } while (__sync_add_and_fetch(&t->x.count, 100)); 238 | t->x.cycle[0] = p; 239 | } 240 | 241 | static void chase_movntdqa(per_thread_t *t) { 242 | void *p = t->x.cycle[0]; 243 | 244 | do { 245 | x100(asm volatile( 246 | #ifndef BINUTILS_HAS_MOVNTDQA 247 | "\n .byte 0x66,0x0f,0x38,0x2a,0x00" 248 | "\n .byte 0x66,0x0f,0x38,0x2a,0x48,0x10" 249 | "\n paddq %%xmm1,%%xmm0" 250 | "\n .byte 0x66,0x0f,0x38,0x2a,0x50,0x20" 251 | "\n paddq %%xmm2,%%xmm0" 252 | "\n .byte 0x66,0x0f,0x38,0x2a,0x58,0x30" 253 | "\n paddq %%xmm3,%%xmm0" 254 | "\n movq %%xmm0,%%rax" 255 | #else 256 | "\n movntdqa (%%rax),%%xmm0" 257 | "\n movntdqa 16(%%rax),%%xmm1" 258 | "\n paddq %%xmm1,%%xmm0" 259 | "\n movntdqa 32(%%rax),%%xmm2" 260 | "\n paddq %%xmm2,%%xmm0" 261 | "\n movntdqa 48(%%rax),%%xmm3" 262 | "\n paddq %%xmm3,%%xmm0" 263 | "\n movq %%xmm0,%%rax" 264 | #endif 265 | : "=a"(p) 266 | : "0"(p));) 267 | } while (__sync_add_and_fetch(&t->x.count, 100)); 268 | t->x.cycle[0] = p; 269 | } 270 | 271 | static void chase_critword2(per_thread_t *t) { 272 | void *p = t->x.cycle[0]; 273 | size_t offset = strtoul(t->x.extra_args, 0, 0); 274 | void *q = (char *)p + offset; 275 | 276 | do { 277 | x100(asm volatile("mov (%1),%0" 278 | : "=r"(p) 279 | : "r"(p)); 280 | asm volatile("mov (%1),%0" 281 | : "=r"(q) 282 | : "r"(q));) 283 | } while (__sync_add_and_fetch(&t->x.count, 100)); 284 | 285 | t->x.cycle[0] = (void *)((uintptr_t)p + (uintptr_t)q); 286 | } 287 | #endif 288 | 289 | struct chase_t { 290 | void (*fn)(per_thread_t *t); 291 | size_t base_object_size; 292 | const char *name; 293 | const char *usage1; 294 | const char *usage2; 295 | int requires_arg; 296 | unsigned parallelism; // number of parallel chases (at least 1) 297 | }; 298 | static const chase_t chases[] = { 299 | // the default must be first 300 | { 301 | .fn = chase_simple, 302 | .base_object_size = sizeof(void *), 303 | .name = "simple", 304 | .usage1 = "simple", 305 | .usage2 = "no frills pointer dereferencing", 306 | .requires_arg = 0, 307 | .parallelism = 1, 308 | }, 309 | { 310 | .fn = chase_work, 311 | .base_object_size = sizeof(void *), 312 | .name = "work", 313 | .usage1 = "work:N", 314 | .usage2 = "loop simple computation N times in between derefs", 315 | .requires_arg = 1, 316 | .parallelism = 1, 317 | }, 318 | { 319 | .fn = chase_incr, 320 | .base_object_size = sizeof(struct incr_struct), 321 | .name = "incr", 322 | .usage1 = "incr", 323 | .usage2 = "modify the cache line after each deref", 324 | .requires_arg = 0, 325 | .parallelism = 1, 326 | }, 327 | { 328 | .fn = chase_branch, 329 | .base_object_size = 16, 330 | .name = "branch", 331 | .usage1 = "branch", 332 | .usage2 = "convert pointers to branches", 333 | .requires_arg = 0, 334 | .parallelism = 1, 335 | }, 336 | #if defined(__x86_64__) || defined(__i386__) 337 | #define chase_prefetch(type) \ 338 | { \ 339 | .fn = chase_prefetch##type, .base_object_size = sizeof(void *), \ 340 | .name = #type, .usage1 = #type, \ 341 | .usage2 = "perform prefetch" #type " before each deref", \ 342 | .requires_arg = 0, .parallelism = 1, \ 343 | } 344 | chase_prefetch(t0), 345 | chase_prefetch(t1), 346 | chase_prefetch(t2), 347 | chase_prefetch(nta), 348 | #endif 349 | #if defined(__x86_64__) 350 | { 351 | .fn = chase_movdqa, 352 | .base_object_size = 64, 353 | .name = "movdqa", 354 | .usage1 = "movdqa", 355 | .usage2 = "use movdqa to read from memory", 356 | .requires_arg = 0, 357 | .parallelism = 1, 358 | }, 359 | { 360 | .fn = chase_movntdqa, 361 | .base_object_size = 64, 362 | .name = "movntdqa", 363 | .usage1 = "movntdqa", 364 | .usage2 = "use movntdqa to read from memory", 365 | .requires_arg = 0, 366 | .parallelism = 1, 367 | }, 368 | #endif 369 | #define PAR(n) \ 370 | { \ 371 | .fn = chase_parallel##n, .base_object_size = sizeof(void *), \ 372 | .name = "parallel" #n, .usage1 = "parallel" #n, \ 373 | .usage2 = "alternate " #n " non-dependent chases in each thread", \ 374 | .parallelism = n, \ 375 | } 376 | PAR(2), 377 | PAR(3), 378 | PAR(4), 379 | PAR(5), 380 | PAR(6), 381 | #if MAX_PARALLEL > 6 382 | PAR(7), 383 | PAR(8), 384 | PAR(9), 385 | PAR(10), 386 | #endif 387 | #undef PAR 388 | #if defined(__x86_64__) 389 | { 390 | .fn = chase_critword2, 391 | .base_object_size = 64, 392 | .name = "critword2", 393 | .usage1 = "critword2:N", 394 | .usage2 = "a two-parallel chase which reads at X and X+N", 395 | .requires_arg = 1, 396 | .parallelism = 1, 397 | }, 398 | #endif 399 | { 400 | .fn = chase_simple, 401 | .base_object_size = 64, 402 | .name = "critword", 403 | .usage1 = "critword:N", 404 | .usage2 = "a non-parallel chase which reads at X and X+N", 405 | .requires_arg = 1, 406 | .parallelism = 1, 407 | }, 408 | }; 409 | 410 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER; 411 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER; 412 | static size_t nr_to_startup; 413 | static int set_thread_affinity = 1; 414 | 415 | static void *thread_start(void *data) { 416 | per_thread_t *args = data; 417 | 418 | // ensure every thread has a different RNG 419 | rng_init(args->x.thread_num); 420 | 421 | if (set_thread_affinity) { 422 | // find out which cpus we can run on and move us to an appropriate cpu 423 | cpu_set_t cpus; 424 | if (sched_getaffinity(0, sizeof(cpus), &cpus)) { 425 | perror("sched_getaffinity"); 426 | exit(1); 427 | } 428 | int my_cpu; 429 | unsigned num = args->x.thread_num; 430 | for (my_cpu = 0; my_cpu < CPU_SETSIZE; ++my_cpu) { 431 | if (!CPU_ISSET(my_cpu, &cpus)) continue; 432 | if (num == 0) break; 433 | --num; 434 | } 435 | if (my_cpu == CPU_SETSIZE) { 436 | fprintf(stderr, "error: more threads than cpus available\n"); 437 | exit(1); 438 | } 439 | CPU_ZERO(&cpus); 440 | CPU_SET(my_cpu, &cpus); 441 | if (sched_setaffinity(0, sizeof(cpus), &cpus)) { 442 | perror("sched_setaffinity"); 443 | exit(1); 444 | } 445 | } 446 | 447 | // generate chases -- using a different mixer index for every 448 | // thread and for every parallel chase within a thread 449 | unsigned parallelism = args->x.chase->parallelism; 450 | for (unsigned par = 0; par < parallelism; ++par) { 451 | if (args->x.use_longer_chase) { 452 | args->x.cycle[par] = generate_chase_long(args->x.genchase_args, 453 | parallelism * args->x.thread_num + par, 454 | parallelism * args->x.nr_threads); 455 | } else { 456 | args->x.cycle[par] = generate_chase(args->x.genchase_args, 457 | parallelism * args->x.thread_num + par); 458 | } 459 | } 460 | 461 | // handle critword2 chases 462 | if (strcmp(args->x.chase->name, "critword2") == 0) { 463 | size_t offset = strtoul(args->x.extra_args, 0, 0); 464 | char *p = args->x.cycle[0]; 465 | char *q = p; 466 | do { 467 | char *next = *(char **)p; 468 | *(void **)(p + offset) = next + offset; 469 | p = next; 470 | } while (p != q); 471 | } 472 | 473 | // handle critword chases 474 | if (strcmp(args->x.chase->name, "critword") == 0) { 475 | size_t offset = strtoul(args->x.extra_args, 0, 0); 476 | char *p = args->x.cycle[0]; 477 | char *q = p; 478 | do { 479 | char *next = *(char **)p; 480 | *(void **)(p + offset) = next; 481 | *(void **)p = p + offset; 482 | p = next; 483 | } while (p != q); 484 | } 485 | 486 | // handle branch chases 487 | if (!strcmp(args->x.chase->name, "branch")) { 488 | void *p = args->x.cycle[0]; 489 | args->x.branch_chunk_size = convert_pointers_to_branches(p, 200); 490 | #if defined(__aarch64__) 491 | __builtin___clear_cache( 492 | args->x.genchase_args->arena, 493 | args->x.genchase_args->arena + args->x.genchase_args->total_memory); 494 | #endif 495 | } 496 | 497 | // now flush our caches 498 | if (args->x.cache_flush_size) { 499 | size_t nr_elts = args->x.cache_flush_size / sizeof(size_t); 500 | size_t *p = args->x.flush_arena; 501 | size_t sum = 0; 502 | while (nr_elts) { 503 | sum += *p; 504 | ++p; 505 | --nr_elts; 506 | } 507 | args->x.dummy += sum; 508 | } 509 | 510 | // wait and/or wake up everyone if we're all ready 511 | pthread_mutex_lock(&wait_mutex); 512 | if (nr_to_startup == 1) { 513 | if (!strcmp(args->x.chase->name, "branch")) { 514 | // Change buffer to executable before letting the chases run. 515 | make_buffer_executable(args->x.genchase_args->arena, 516 | args->x.genchase_args->total_memory); 517 | } 518 | } 519 | --nr_to_startup; 520 | if (nr_to_startup) { 521 | pthread_cond_wait(&wait_cond, &wait_mutex); 522 | } else { 523 | pthread_cond_broadcast(&wait_cond); 524 | } 525 | pthread_mutex_unlock(&wait_mutex); 526 | 527 | args->x.chase->fn(data); 528 | return NULL; 529 | } 530 | 531 | static void timestamp(void) { 532 | if (!print_timestamp) return; 533 | struct timeval tv; 534 | gettimeofday(&tv, NULL); 535 | printf("%.6f ", tv.tv_sec + tv.tv_usec / 1000000.); 536 | } 537 | 538 | int main(int argc, char **argv) { 539 | char *p; 540 | int c; 541 | size_t i; 542 | size_t default_page_size = get_native_page_size(); 543 | size_t page_size = default_page_size; 544 | bool use_thp = false; 545 | bool use_malloc = false; 546 | bool use_longer_chase = false; 547 | size_t nr_threads = DEF_NR_THREADS; 548 | size_t nr_samples = DEF_NR_SAMPLES; 549 | size_t cache_flush_size = DEF_CACHE_FLUSH; 550 | size_t offset = DEF_OFFSET; 551 | int print_average = 0; 552 | const char *extra_args = NULL; 553 | const char *chase_optarg = chases[0].name; 554 | const chase_t *chase = &chases[0]; 555 | struct generate_chase_common_args genchase_args; 556 | int fd = -1; 557 | uint64_t duration_nsecs = -1; 558 | 559 | genchase_args.total_memory = DEF_TOTAL_MEMORY; 560 | genchase_args.stride = DEF_STRIDE; 561 | genchase_args.tlb_locality = DEF_TLB_LOCALITY * default_page_size; 562 | genchase_args.gen_permutation = gen_random_permutation; 563 | 564 | setvbuf(stdout, NULL, _IOLBF, BUFSIZ); 565 | 566 | while ((c = getopt(argc, argv, "ac:F:p:HLm:Nn:oO:S:s:T:t:vXyW:f:M")) != -1) { 567 | switch (c) { 568 | case 'a': 569 | print_average = 1; 570 | break; 571 | case 'c': 572 | chase_optarg = optarg; 573 | p = strchr(optarg, ':'); 574 | if (p == NULL) p = optarg + strlen(optarg); 575 | for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) { 576 | if (strncmp(optarg, chases[i].name, p - optarg) == 0) { 577 | break; 578 | } 579 | } 580 | if (i == sizeof(chases) / sizeof(chases[0])) { 581 | fprintf(stderr, "not a recognized chase name: %s\n", optarg); 582 | goto usage; 583 | } 584 | chase = &chases[i]; 585 | if (chase->requires_arg) { 586 | if (p[0] != ':' || p[1] == 0) { 587 | fprintf(stderr, "that chase requires an argument:\n-c %s\t%s\n", 588 | chase->usage1, chase->usage2); 589 | exit(1); 590 | } 591 | extra_args = p + 1; 592 | } else if (*p != 0) { 593 | fprintf(stderr, "that chase does not take an argument:\n-c %s\t%s\n", 594 | chase->usage1, chase->usage2); 595 | exit(1); 596 | } 597 | break; 598 | case 'f': 599 | if (fd != -1) { 600 | fprintf(stderr, "only one file can be provided\n"); 601 | exit(1); 602 | } 603 | fd = open(optarg, O_RDWR); 604 | if (fd == -1) { 605 | perror("open"); 606 | exit(1); 607 | } 608 | break; 609 | case 'F': 610 | if (parse_mem_arg(optarg, &cache_flush_size)) { 611 | fprintf(stderr, 612 | "cache_flush_size must be a non-negative integer (suffixed " 613 | "with k, m, or g)\n"); 614 | exit(1); 615 | } 616 | break; 617 | case 'p': 618 | if (parse_mem_arg(optarg, &page_size)) { 619 | fprintf(stderr, 620 | "page size must be a non-negative integer (suffixed with k, " 621 | "m, or g)\n"); 622 | exit(1); 623 | } 624 | break; 625 | case 'H': 626 | use_thp = true; 627 | break; 628 | case 'M': 629 | use_malloc = true; 630 | break; 631 | case 'L': 632 | use_longer_chase = true; 633 | break; 634 | case 'm': 635 | if (parse_mem_arg(optarg, &genchase_args.total_memory) || 636 | genchase_args.total_memory == 0) { 637 | fprintf(stderr, 638 | "total_memory must be a positive integer (suffixed with k, m " 639 | "or g)\n"); 640 | exit(1); 641 | } 642 | break; 643 | case 'N': 644 | duration_nsecs = 0; 645 | break; 646 | case 'n': 647 | nr_samples = strtoul(optarg, &p, 0); 648 | if (*p) { 649 | fprintf(stderr, "nr_samples must be a non-negative integer\n"); 650 | exit(1); 651 | } 652 | break; 653 | case 'O': 654 | if (parse_mem_arg(optarg, &offset)) { 655 | fprintf(stderr, 656 | "offset must be a non-negative integer (suffixed with k, m, " 657 | "or g)\n"); 658 | exit(1); 659 | } 660 | break; 661 | case 'o': 662 | genchase_args.gen_permutation = gen_ordered_permutation; 663 | break; 664 | case 's': 665 | if (parse_mem_arg(optarg, &genchase_args.stride)) { 666 | fprintf( 667 | stderr, 668 | "stride must be a positive integer (suffixed with k, m, or g)\n"); 669 | exit(1); 670 | } 671 | break; 672 | case 'T': 673 | if (parse_mem_arg(optarg, &genchase_args.tlb_locality)) { 674 | fprintf(stderr, 675 | "tlb locality must be a positive integer (suffixed with k, " 676 | "m, or g)\n"); 677 | exit(1); 678 | } 679 | break; 680 | case 't': 681 | nr_threads = strtoul(optarg, &p, 0); 682 | if (*p || nr_threads == 0) { 683 | fprintf(stderr, "nr_threads must be positive integer\n"); 684 | exit(1); 685 | } 686 | break; 687 | case 'v': 688 | ++verbosity; 689 | break; 690 | case 'W': 691 | is_weighted_mbind = 1; 692 | char *tok = NULL, *saveptr = NULL; 693 | tok = strtok_r(optarg, ",", &saveptr); 694 | while (tok != NULL) { 695 | uint16_t node_id; 696 | uint16_t weight; 697 | int count = sscanf(tok, "%hu:%hu", &node_id, &weight); 698 | if (count != 2) { 699 | fprintf(stderr, "Expecting node_id:weight\n"); 700 | exit(1); 701 | } 702 | if (node_id >= sizeof(mbind_weights) / sizeof(mbind_weights[0])) { 703 | fprintf(stderr, "Maximum node_id is %lu\n", 704 | sizeof(mbind_weights) / sizeof(mbind_weights[0]) - 1); 705 | exit(1); 706 | } 707 | mbind_weights[node_id] = weight; 708 | tok = strtok_r(NULL, ",", &saveptr); 709 | } 710 | break; 711 | case 'X': 712 | set_thread_affinity = 0; 713 | break; 714 | case 'y': 715 | print_timestamp = 1; 716 | break; 717 | default: 718 | goto usage; 719 | } 720 | } 721 | 722 | if (argc - optind != 0) { 723 | usage: 724 | fprintf(stderr, "usage: %s [options]\n", argv[0]); 725 | fprintf(stderr, 726 | "-a print average latency (default is best latency)\n"); 727 | fprintf(stderr, "-c chase select one of several different chases:\n"); 728 | for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) { 729 | fprintf(stderr, " %-12s%s\n", chases[i].usage1, chases[i].usage2); 730 | } 731 | fprintf(stderr, " default: %s\n", chases[0].name); 732 | fprintf(stderr, "-m nnnn[kmg] total memory size (default %zu)\n", 733 | DEF_TOTAL_MEMORY); 734 | fprintf(stderr, 735 | " NOTE: memory size will be rounded down to a " 736 | "multiple of -T option\n"); 737 | fprintf(stderr, 738 | "-N timed run (stops at nr_samples/2 seconds)\n"); 739 | fprintf(stderr, 740 | "-n nr_samples nr of 0.5 second samples to use (default %zu, 0 = " 741 | "infinite)\n", 742 | DEF_NR_SAMPLES); 743 | fprintf( 744 | stderr, 745 | "-o perform an ordered traversal (rather than random)\n"); 746 | fprintf(stderr, "-O nnnn[kmg] offset the entire chase by nnnn bytes\n"); 747 | fprintf(stderr, "-s nnnn[kmg] stride size (default %zu)\n", DEF_STRIDE); 748 | fprintf(stderr, "-T nnnn[kmg] TLB locality in bytes (default %zu)\n", 749 | DEF_TLB_LOCALITY * default_page_size); 750 | fprintf(stderr, 751 | " NOTE: TLB locality will be rounded down to a " 752 | "multiple of stride\n"); 753 | fprintf(stderr, "-t nr_threads number of threads (default %zu)\n", 754 | DEF_NR_THREADS); 755 | fprintf(stderr, "-p page_size backing page size to use (default %zu)\n", 756 | default_page_size); 757 | fprintf(stderr, "-f file mmap memory using the provided file\n"); 758 | fprintf(stderr, 759 | "-H use transparent hugepages (leave page size at " 760 | "default)\n"); 761 | fprintf(stderr, 762 | "-L use longer chase\n"); 763 | fprintf(stderr, 764 | "-M use malloc instead of mmap to allocate arena\n"); 765 | fprintf(stderr, 766 | "-F nnnn[kmg] amount of memory to use to flush the caches after " 767 | "constructing\n" 768 | " the chase and before starting the benchmark (use " 769 | "with nta)\n" 770 | " default: %zu\n", 771 | DEF_CACHE_FLUSH); 772 | fprintf( 773 | stderr, 774 | "-W mbind list list of node:weight,... pairs for allocating memory\n" 775 | " has no effect if -H flag is specified\n" 776 | " 0:10,1:90 weights it as 10%% on 0 and 90%% on 1\n"); 777 | fprintf(stderr, "-X do not set thread affinity\n"); 778 | fprintf(stderr, "-y print timestamp in front of each line\n"); 779 | exit(1); 780 | } 781 | 782 | if (genchase_args.stride < sizeof(void *)) { 783 | fprintf(stderr, "stride must be at least %zu\n", sizeof(void *)); 784 | exit(1); 785 | } 786 | 787 | // ensure some sanity in the various arguments 788 | if (genchase_args.tlb_locality < genchase_args.stride) { 789 | genchase_args.tlb_locality = genchase_args.stride; 790 | } else { 791 | genchase_args.tlb_locality -= 792 | genchase_args.tlb_locality % genchase_args.stride; 793 | } 794 | 795 | if (genchase_args.total_memory < genchase_args.tlb_locality) { 796 | if (genchase_args.total_memory < genchase_args.stride) { 797 | genchase_args.total_memory = genchase_args.stride; 798 | } else { 799 | genchase_args.total_memory -= 800 | genchase_args.total_memory % genchase_args.stride; 801 | } 802 | genchase_args.tlb_locality = genchase_args.total_memory; 803 | } else { 804 | genchase_args.total_memory -= 805 | genchase_args.total_memory % genchase_args.tlb_locality; 806 | } 807 | 808 | genchase_args.nr_mixer_indices = 809 | genchase_args.stride / chase->base_object_size; 810 | if (genchase_args.nr_mixer_indices < nr_threads * chase->parallelism) { 811 | fprintf(stderr, 812 | "the stride is too small to interleave that many threads, need at " 813 | "least %zu bytes\n", 814 | nr_threads * chase->parallelism * chase->base_object_size); 815 | exit(1); 816 | } 817 | 818 | if (verbosity > 0) { 819 | printf("nr_mixer_indices = %zu\n", genchase_args.nr_mixer_indices); 820 | printf("base_object_size = %zu\n", chase->base_object_size); 821 | printf("nr_threads = %zu\n", nr_threads); 822 | print_page_size(page_size, use_thp); 823 | printf("total_memory = %zu (%.1f MiB)\n", genchase_args.total_memory, 824 | genchase_args.total_memory / (1024. * 1024.)); 825 | printf("stride = %zu\n", genchase_args.stride); 826 | printf("tlb_locality = %zu\n", genchase_args.tlb_locality); 827 | printf("chase = %s\n", chase_optarg); 828 | if (use_malloc) printf("malloc allocation\n"); 829 | } 830 | 831 | rng_init(1); 832 | 833 | generate_chase_mixer(&genchase_args, nr_threads * chase->parallelism); 834 | 835 | // generate the chases by launching multiple threads 836 | if (use_malloc) { 837 | genchase_args.arena = (char *)malloc(genchase_args.total_memory + offset) + 838 | offset; 839 | if (!genchase_args.arena) { 840 | perror("malloc"); 841 | exit(1); 842 | } 843 | } else { 844 | genchase_args.arena = 845 | (char *)alloc_arena_mmap(page_size, use_thp, 846 | genchase_args.total_memory + offset, fd) + 847 | offset; 848 | } 849 | per_thread_t *thread_data = alloc_arena_mmap( 850 | default_page_size, false, nr_threads * sizeof(per_thread_t), -1); 851 | void *flush_arena = NULL; 852 | if (cache_flush_size) { 853 | flush_arena = alloc_arena_mmap(default_page_size, false, cache_flush_size, 854 | -1); 855 | memset(flush_arena, 1, cache_flush_size); // ensure pages are mapped 856 | } 857 | 858 | pthread_t thread; 859 | nr_to_startup = nr_threads; 860 | for (i = 0; i < nr_threads; ++i) { 861 | thread_data[i].x.genchase_args = &genchase_args; 862 | thread_data[i].x.nr_threads = nr_threads; 863 | thread_data[i].x.thread_num = i; 864 | thread_data[i].x.extra_args = extra_args; 865 | thread_data[i].x.chase = chase; 866 | thread_data[i].x.flush_arena = flush_arena; 867 | thread_data[i].x.cache_flush_size = cache_flush_size; 868 | thread_data[i].x.use_longer_chase = use_longer_chase; 869 | if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) { 870 | perror("pthread_create"); 871 | exit(1); 872 | } 873 | } 874 | 875 | // now wait for them all to finish generating their chases and start chasing 876 | pthread_mutex_lock(&wait_mutex); 877 | if (nr_to_startup) { 878 | pthread_cond_wait(&wait_cond, &wait_mutex); 879 | } 880 | pthread_mutex_unlock(&wait_mutex); 881 | 882 | // now start sampling their progress 883 | if (!duration_nsecs) 884 | duration_nsecs = nr_samples * DELAY_USECS * NSECS_PER_USEC; 885 | nr_samples = nr_samples + 1; // we drop the first sample 886 | uint64_t *cur_samples = alloca(nr_threads * sizeof(*cur_samples)); 887 | uint64_t last_sample_time = now_nsec(); 888 | uint64_t start_time = last_sample_time; 889 | uint64_t total = 0; 890 | uint64_t stalls = 0; 891 | double best = 1. / 0.; 892 | double running_sum = 0.; 893 | if (verbosity > 0) 894 | printf("samples (one column per thread, one row per sample):\n"); 895 | for (size_t sample_no = 0; nr_samples == 1 || sample_no < nr_samples; 896 | ++sample_no) { 897 | usleep(DELAY_USECS); 898 | 899 | uint64_t sum = 0; 900 | for (i = 0; i < nr_threads; ++i) { 901 | cur_samples[i] = __sync_lock_test_and_set(&thread_data[i].x.count, 0); 902 | sum += cur_samples[i]; 903 | } 904 | 905 | uint64_t cur_sample_time = now_nsec(); 906 | uint64_t time_delta = cur_sample_time - last_sample_time; 907 | last_sample_time = cur_sample_time; 908 | 909 | // we drop the first sample because it's fairly likely one 910 | // thread had some advantage initially due to still having 911 | // portions of the chase in a cache. 912 | if (sample_no == 0) continue; 913 | 914 | if (verbosity > 0) { 915 | timestamp(); 916 | for (i = 0; i < nr_threads; ++i) { 917 | double z = time_delta / (double)cur_samples[i]; 918 | printf(" %6.*f", z < 100. ? 3 : 1, z); 919 | } 920 | } 921 | 922 | total += sum; 923 | if (!sum) 924 | stalls++; 925 | double t = time_delta / (double)sum; 926 | running_sum += t; 927 | if (t < best) { 928 | best = t; 929 | } 930 | if (verbosity > 0) { 931 | double z = t * nr_threads; 932 | printf(" avg=%.*f\n", z < 100. ? 3 : 1, z); 933 | } 934 | 935 | if (last_sample_time - start_time > duration_nsecs) { 936 | printf("timed out: %lu samples, %lu stalls, %lu iterations per msec\n", 937 | sample_no, stalls, total * NSECS_PER_MSEC / duration_nsecs); 938 | break; 939 | } 940 | } 941 | timestamp(); 942 | double res; 943 | if (print_average) { 944 | res = running_sum * nr_threads / (nr_samples - 1); 945 | } else { 946 | res = best * nr_threads; 947 | } 948 | printf("%6.*f\n", res < 100. ? 3 : 1, res); 949 | 950 | exit(0); 951 | } 952 | -------------------------------------------------------------------------------- /multiload.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #define _GNU_SOURCE 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | 29 | #include "arena.h" 30 | #include "cpu_util.h" 31 | #include "expand.h" 32 | #include "permutation.h" 33 | #include "timer.h" 34 | #include "util.h" 35 | 36 | #if defined(__x86_64__) 37 | #include 38 | #endif 39 | 40 | // The total memory, stride, and TLB locality have been chosen carefully for 41 | // the current generation of CPUs: 42 | // 43 | // - at stride of 64 bytes the L2 next-line prefetch on p-m/core/core2 gives a 44 | // helping hand 45 | // 46 | // - at stride of 128 bytes the stream prefetcher on various p4 decides the 47 | // random accesses sometimes look like a stream and gives a helping hand. 48 | // 49 | // - the TLB locality could have been raised beyond 4 pages to defeat various 50 | // stream prefetchers, but you need to get out well past 32 pages before 51 | // all existing hw prefetchers are defeated, and then you start exceding the 52 | // TLB locality on several CPUs and incurring some TLB overhead. 53 | // Hence, the default has been changed from 16 pages to 64 pages. 54 | // 55 | #define DEF_TOTAL_MEMORY ((size_t)256 * 1024 * 1024) 56 | #define DEF_STRIDE ((size_t)256) 57 | #define DEF_NR_SAMPLES ((size_t)5) 58 | #define DEF_TLB_LOCALITY ((size_t)64) 59 | #define DEF_NR_THREADS ((size_t)1) 60 | #define DEF_CACHE_FLUSH ((size_t)64 * 1024 * 1024) 61 | #define DEF_OFFSET ((size_t)0) 62 | #define DEF_DELAY ((size_t)1) 63 | #define DEF_CACHELINE ((size_t)64) 64 | 65 | #define LOAD_DELAY_WARMUP_uS \ 66 | 4000000 // Latency & Load thread warmup before data sampling starts 67 | #define LOAD_DELAY_RUN_uS 2000000 // Data sampling request frequency 68 | #define LOAD_DELAY_SAMPLE_uS \ 69 | 10000 // Data sample polling loop delay waiting for load threads to update 70 | // mutex variable 71 | 72 | typedef enum { RUN_CHASE, RUN_BANDWIDTH, RUN_CHASE_LOADED } test_type_t; 73 | static volatile uint64_t use_result_dummy = 0x0123456789abcdef; 74 | 75 | static size_t default_page_size; 76 | static size_t page_size; 77 | static bool use_thp; 78 | int verbosity; 79 | int print_timestamp; 80 | int is_weighted_mbind; 81 | uint16_t mbind_weights[MAX_MEM_NODES]; 82 | 83 | #ifdef __i386__ 84 | #define MAX_PARALLEL (6) // maximum number of chases in parallel 85 | #else 86 | #define MAX_PARALLEL (10) 87 | #endif 88 | 89 | // forward declare 90 | typedef struct chase_t chase_t; 91 | 92 | // the arguments for the chase threads 93 | typedef union { 94 | char pad[AVOID_FALSE_SHARING]; 95 | struct { 96 | unsigned thread_num; // which thread is this 97 | volatile uint64_t count; // return thread measurement - need 64 bits when 98 | // passing bandwidth 99 | void *cycle[MAX_PARALLEL]; // initial address for the chases 100 | const char *extra_args; 101 | int dummy; // useful for confusing the compiler 102 | 103 | const struct generate_chase_common_args *genchase_args; 104 | size_t nr_threads; 105 | const chase_t *chase; 106 | void *flush_arena; 107 | size_t cache_flush_size; 108 | bool use_longer_chase; 109 | test_type_t run_test_type; // test type: chase or memory bandwidth 110 | const chase_t *memload; // memory bandwidth function 111 | char *load_arena; // load memory buffer used by this thread 112 | size_t load_total_memory; // load size of the arena 113 | size_t load_offset; // load offset of the arena 114 | size_t load_tlb_locality; // group accesses within this range in order to 115 | // amortize TLB fills 116 | volatile size_t sample_no; // flag from main thread to tell bandwdith 117 | // thread to start the next sample. 118 | size_t delay; // injection delay for load 119 | } x; 120 | } per_thread_t; 121 | 122 | int always_zero; 123 | 124 | static void chase_simple(per_thread_t *t) { 125 | void *p = t->x.cycle[0]; 126 | 127 | do { 128 | x200(p = *(void **)p;) 129 | } while (__sync_add_and_fetch(&t->x.count, 200)); 130 | 131 | // we never actually reach here, but the compiler doesn't know that 132 | t->x.dummy = (uintptr_t)p; 133 | } 134 | 135 | // parallel chases 136 | 137 | #define declare(i) void *p##i = start[i]; 138 | #define cleanup(i) tmp += (uintptr_t)p##i; 139 | #if MAX_PARALLEL == 6 140 | #define parallel(foo) foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) 141 | #else 142 | #define parallel(foo) \ 143 | foo(0) foo(1) foo(2) foo(3) foo(4) foo(5) foo(6) foo(7) foo(8) foo(9) 144 | #endif 145 | 146 | #define template(n, expand, inner) \ 147 | static void chase_parallel##n(per_thread_t *t) { \ 148 | void **start = t->x.cycle; \ 149 | parallel(declare) do { x##expand(inner) } \ 150 | while (__sync_add_and_fetch(&t->x.count, n * expand)) \ 151 | ; \ 152 | \ 153 | uintptr_t tmp = 0; \ 154 | parallel(cleanup) t->x.dummy = tmp; \ 155 | } 156 | 157 | #if defined(__x86_64__) || defined(__i386__) 158 | #define D(n) asm volatile("mov (%1),%0" : "=r"(p##n) : "r"(p##n)); 159 | #else 160 | #define D(n) p##n = *(void **)p##n; 161 | #endif 162 | template(2, 100, D(0) D(1)); 163 | template(3, 66, D(0) D(1) D(2)); 164 | template(4, 50, D(0) D(1) D(2) D(3)); 165 | template(5, 40, D(0) D(1) D(2) D(3) D(4)); 166 | template(6, 32, D(0) D(1) D(2) D(3) D(4) D(5)); 167 | #if MAX_PARALLEL > 6 168 | template(7, 28, D(0) D(1) D(2) D(3) D(4) D(5) D(6)); 169 | template(8, 24, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7)); 170 | template(9, 22, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8)); 171 | template(10, 20, D(0) D(1) D(2) D(3) D(4) D(5) D(6) D(7) D(8) D(9)); 172 | #endif 173 | #undef D 174 | #undef parallel 175 | #undef cleanup 176 | #undef declare 177 | 178 | static void chase_work(per_thread_t *t) { 179 | void *p = t->x.cycle[0]; 180 | size_t extra_work = strtoul(t->x.extra_args, 0, 0); 181 | size_t work = 0; 182 | size_t i; 183 | 184 | // the extra work is intended to be overlapped with a dereference, 185 | // but we don't want it to skip past the next dereference. so 186 | // we fold in the value of the pointer, and launch the deref then 187 | // go into a loop performing extra work, hopefully while the 188 | // deref occurs. 189 | do { 190 | x25(work += (uintptr_t)p; p = *(void **)p; 191 | for (i = 0; i < extra_work; ++i) { work ^= i; }) 192 | } while (__sync_add_and_fetch(&t->x.count, 25)); 193 | 194 | // we never actually reach here, but the compiler doesn't know that 195 | t->x.cycle[0] = p; 196 | t->x.dummy = work; 197 | } 198 | 199 | struct incr_struct { 200 | struct incr_struct *next; 201 | unsigned incme; 202 | }; 203 | 204 | static void chase_incr(per_thread_t *t) { 205 | struct incr_struct *p = t->x.cycle[0]; 206 | 207 | do { 208 | x50(++p->incme; p = *(void **)p;) 209 | } while (__sync_add_and_fetch(&t->x.count, 50)); 210 | 211 | // we never actually reach here, but the compiler doesn't know that 212 | t->x.cycle[0] = p; 213 | } 214 | 215 | #if defined(__x86_64__) || defined(__i386__) 216 | #define chase_prefetch(type) \ 217 | static void chase_prefetch##type(per_thread_t *t) { \ 218 | void *p = t->x.cycle[0]; \ 219 | \ 220 | do { \ 221 | x100(asm volatile("prefetch" #type " %0" ::"m"(*(void **)p)); \ 222 | p = *(void **)p;) \ 223 | } while (__sync_add_and_fetch(&t->x.count, 100)); \ 224 | \ 225 | /* we never actually reach here, but the compiler doesn't know that */ \ 226 | t->x.cycle[0] = p; \ 227 | } 228 | chase_prefetch(t0); 229 | chase_prefetch(t1); 230 | chase_prefetch(t2); 231 | chase_prefetch(nta); 232 | #undef chase_prefetch 233 | #endif 234 | 235 | #if defined(__x86_64__) 236 | static void chase_movdqa(per_thread_t *t) { 237 | void *p = t->x.cycle[0]; 238 | 239 | do { 240 | x100(asm volatile("\n movdqa (%%rax),%%xmm0" 241 | "\n movdqa 16(%%rax),%%xmm1" 242 | "\n paddq %%xmm1,%%xmm0" 243 | "\n movdqa 32(%%rax),%%xmm2" 244 | "\n paddq %%xmm2,%%xmm0" 245 | "\n movdqa 48(%%rax),%%xmm3" 246 | "\n paddq %%xmm3,%%xmm0" 247 | "\n movq %%xmm0,%%rax" 248 | : "=a"(p) 249 | : "0"(p));) 250 | } while (__sync_add_and_fetch(&t->x.count, 100)); 251 | t->x.cycle[0] = p; 252 | } 253 | 254 | static void chase_movntdqa(per_thread_t *t) { 255 | void *p = t->x.cycle[0]; 256 | 257 | do { 258 | x100(asm volatile( 259 | #ifndef BINUTILS_HAS_MOVNTDQA 260 | "\n .byte 0x66,0x0f,0x38,0x2a,0x00" 261 | "\n .byte 0x66,0x0f,0x38,0x2a,0x48,0x10" 262 | "\n paddq %%xmm1,%%xmm0" 263 | "\n .byte 0x66,0x0f,0x38,0x2a,0x50,0x20" 264 | "\n paddq %%xmm2,%%xmm0" 265 | "\n .byte 0x66,0x0f,0x38,0x2a,0x58,0x30" 266 | "\n paddq %%xmm3,%%xmm0" 267 | "\n movq %%xmm0,%%rax" 268 | #else 269 | "\n movntdqa (%%rax),%%xmm0" 270 | "\n movntdqa 16(%%rax),%%xmm1" 271 | "\n paddq %%xmm1,%%xmm0" 272 | "\n movntdqa 32(%%rax),%%xmm2" 273 | "\n paddq %%xmm2,%%xmm0" 274 | "\n movntdqa 48(%%rax),%%xmm3" 275 | "\n paddq %%xmm3,%%xmm0" 276 | "\n movq %%xmm0,%%rax" 277 | #endif 278 | : "=a"(p) 279 | : "0"(p));) 280 | } while (__sync_add_and_fetch(&t->x.count, 100)); 281 | t->x.cycle[0] = p; 282 | } 283 | 284 | static void chase_critword2(per_thread_t *t) { 285 | void *p = t->x.cycle[0]; 286 | size_t offset = strtoul(t->x.extra_args, 0, 0); 287 | void *q = (char *)p + offset; 288 | 289 | do { 290 | x100(asm volatile("mov (%1),%0" 291 | : "=r"(p) 292 | : "r"(p)); 293 | asm volatile("mov (%1),%0" 294 | : "=r"(q) 295 | : "r"(q));) 296 | } while (__sync_add_and_fetch(&t->x.count, 100)); 297 | 298 | t->x.cycle[0] = (void *)((uintptr_t)p + (uintptr_t)q); 299 | } 300 | #endif 301 | 302 | struct chase_t { 303 | void (*fn)(per_thread_t *t); 304 | size_t base_object_size; 305 | const char *name; 306 | const char *usage1; 307 | const char *usage2; 308 | int requires_arg; 309 | unsigned parallelism; // number of parallel chases (at least 1) 310 | }; 311 | static const chase_t chases[] = { 312 | // the default must be first 313 | { 314 | .fn = chase_simple, 315 | .base_object_size = sizeof(void *), 316 | .name = "simple ", 317 | .usage1 = "simple", 318 | .usage2 = "no frills pointer dereferencing", 319 | .requires_arg = 0, 320 | .parallelism = 1, 321 | }, 322 | { 323 | .fn = chase_simple, 324 | .base_object_size = sizeof(void *), 325 | .name = "chaseload", 326 | .usage1 = "chaseload", 327 | .usage2 = "runs simple chase with multiple memory bandwidth loads", 328 | .requires_arg = 0, 329 | .parallelism = 1, 330 | }, 331 | { 332 | .fn = chase_work, 333 | .base_object_size = sizeof(void *), 334 | .name = "work ", 335 | .usage1 = "work:N", 336 | .usage2 = "loop simple computation N times in between derefs", 337 | .requires_arg = 1, 338 | .parallelism = 1, 339 | }, 340 | { 341 | .fn = chase_incr, 342 | .base_object_size = sizeof(struct incr_struct), 343 | .name = "incr ", 344 | .usage1 = "incr", 345 | .usage2 = "modify the cache line after each deref", 346 | .requires_arg = 0, 347 | .parallelism = 1, 348 | }, 349 | #if defined(__x86_64__) || defined(__i386__) 350 | #define chase_prefetch(type) \ 351 | { \ 352 | .fn = chase_prefetch##type, .base_object_size = sizeof(void *), \ 353 | .name = #type, .usage1 = #type, \ 354 | .usage2 = "perform prefetch" #type " before each deref", \ 355 | .requires_arg = 0, .parallelism = 1, \ 356 | } 357 | chase_prefetch(t0), 358 | chase_prefetch(t1), 359 | chase_prefetch(t2), 360 | chase_prefetch(nta), 361 | #endif 362 | #if defined(__x86_64__) 363 | { 364 | .fn = chase_movdqa, 365 | .base_object_size = 64, 366 | .name = "movdqa", 367 | .usage1 = "movdqa", 368 | .usage2 = "use movdqa to read from memory", 369 | .requires_arg = 0, 370 | .parallelism = 1, 371 | }, 372 | { 373 | .fn = chase_movntdqa, 374 | .base_object_size = 64, 375 | .name = "movntdqa", 376 | .usage1 = "movntdqa", 377 | .usage2 = "use movntdqa to read from memory", 378 | .requires_arg = 0, 379 | .parallelism = 1, 380 | }, 381 | #endif 382 | #define PAR(n) \ 383 | { \ 384 | .fn = chase_parallel##n, .base_object_size = sizeof(void *), \ 385 | .name = "parallel" #n, .usage1 = "parallel" #n, \ 386 | .usage2 = "alternate " #n " non-dependent chases in each thread", \ 387 | .parallelism = n, \ 388 | } 389 | PAR(2), 390 | PAR(3), 391 | PAR(4), 392 | PAR(5), 393 | PAR(6), 394 | #if MAX_PARALLEL > 6 395 | PAR(7), 396 | PAR(8), 397 | PAR(9), 398 | PAR(10), 399 | #endif 400 | #undef PAR 401 | #if defined(__x86_64__) 402 | { 403 | .fn = chase_critword2, 404 | .base_object_size = 64, 405 | .name = "critword2", 406 | .usage1 = "critword2:N", 407 | .usage2 = "a two-parallel chase which reads at X and X+N", 408 | .requires_arg = 1, 409 | .parallelism = 1, 410 | }, 411 | #endif 412 | { 413 | .fn = chase_simple, 414 | .base_object_size = 64, 415 | .name = "critword", 416 | .usage1 = "critword:N", 417 | .usage2 = "a non-parallel chase which reads at X and X+N", 418 | .requires_arg = 1, 419 | .parallelism = 1, 420 | }, 421 | }; 422 | 423 | //======================================================================================================== 424 | // Memory Bandwidth load generation 425 | //======================================================================================================== 426 | #define LOAD_MEMORY_INIT_MIBPS \ 427 | register uint64_t loops = 0; \ 428 | size_t cur_sample = -1, nxt_sample = 0; \ 429 | double time0, time1, timetot; \ 430 | double bite_sum = 0, mibps = 0; /*mibps = MiB per sec.*/ \ 431 | time0 = (double)now_nsec(); 432 | 433 | #define LOAD_MEMORY_SAMPLE_MIBPS \ 434 | loops++; \ 435 | /* Main thread will increment x.sample_no when it wants a sample. */ \ 436 | nxt_sample = t->x.sample_no; \ 437 | /* printf("T(%d)%ld:%ld,%ld ", t->x.thread_num, cur_sample, nxt_sample, \ 438 | * t->x.count ); */ \ 439 | /* if ( (loops&0xF)==0xf ) printf("T(%d)%ld:%ld,%ld ", t->x.thread_num, \ 440 | * cur_sample, nxt_sample, t->x.count ); */ \ 441 | if ((cur_sample != nxt_sample) && (t->x.count == 0)) { \ 442 | time1 = (double)now_nsec(); \ 443 | bite_sum = loops * load_bites; \ 444 | timetot = time1 - time0; \ 445 | mibps = (bite_sum * 1000000000.0) / (timetot * 1024 * 1024); \ 446 | /* printf(" T(%d)=%.0f:%ld:%ld ", t->x.thread_num, mibps, cur_sample, \ 447 | * nxt_sample ); */ \ 448 | /* printf(" %ld,%ld,%ld,%.0f:%.0f(%.1fMiBs)\n", cur_sample, loops, \ 449 | * t->x.count, bite_sum, timetot, mibps); */ \ 450 | /* update the MiB/s count. Main thread will read and set to 0 so we know \ 451 | * this sample is done. */ \ 452 | __sync_add_and_fetch(&t->x.count, (uint64_t)mibps); \ 453 | cur_sample = nxt_sample; \ 454 | loops = 0; \ 455 | time0 = (double)now_nsec(); \ 456 | } 457 | 458 | //-------------------------------------------------------------------------------------------------------- 459 | static void load_memcpy_libc(per_thread_t *t) { 460 | #define LOOP_OPS 2 461 | uint64_t load_loop = t->x.load_total_memory / LOOP_OPS; 462 | uint64_t load_bites = load_loop * LOOP_OPS; 463 | register char *a = (char *)t->x.load_arena; 464 | register char *b = a + load_loop; 465 | register char *tmp; 466 | 467 | LOAD_MEMORY_INIT_MIBPS 468 | do { 469 | tmp = a; 470 | a = b; 471 | b = tmp; 472 | memcpy((void *)a, (void *)b, load_loop); 473 | LOAD_MEMORY_SAMPLE_MIBPS 474 | } while (1); 475 | #undef LOOP_OPS 476 | } 477 | 478 | //-------------------------------------------------------------------------------------------------------- 479 | static void load_memset_libc(per_thread_t *t) { 480 | #define LOOP_OPS 1 481 | uint64_t load_bites = t->x.load_total_memory * LOOP_OPS; 482 | register char *a = (char *)t->x.load_arena; 483 | 484 | LOAD_MEMORY_INIT_MIBPS 485 | do { 486 | memset((void *)a, 0xdeadbeef, load_bites); 487 | LOAD_MEMORY_SAMPLE_MIBPS 488 | } while (1); 489 | #undef LOOP_OPS 490 | } 491 | 492 | //-------------------------------------------------------------------------------------------------------- 493 | static void load_memsetz_libc(per_thread_t *t) { 494 | #define LOOP_OPS 1 495 | uint64_t load_bites = t->x.load_total_memory * LOOP_OPS; 496 | register char *a = (char *)t->x.load_arena; 497 | 498 | LOAD_MEMORY_INIT_MIBPS 499 | do { 500 | memset((void *)a, 0, load_bites); 501 | LOAD_MEMORY_SAMPLE_MIBPS 502 | } while (1); 503 | #undef LOOP_OPS 504 | } 505 | 506 | //-------------------------------------------------------------------------------------------------------- 507 | static void load_stream_triad(per_thread_t *t) { 508 | #define LOOP_OPS 3 509 | #define LOOP_ALIGN 16 510 | uint64_t load_loop, load_bites; 511 | register uint64_t N, i; 512 | register double *a; 513 | register double *b; 514 | register double *c; 515 | register double *tmp; 516 | 517 | load_loop = 518 | t->x.load_total_memory - 519 | (LOOP_OPS * LOOP_ALIGN); // subract to allow aligning count/addresses 520 | load_loop = (load_loop / LOOP_OPS) & 521 | ~(LOOP_ALIGN - 1); // divide by 3 buffers and align byte count on 522 | // LOOP_ALIGN byte multiple 523 | N = load_loop / sizeof(double); 524 | load_bites = N * sizeof(double) * LOOP_OPS; 525 | size_t aa = (((size_t)t->x.load_arena + LOOP_ALIGN) & 526 | ~(LOOP_ALIGN)); // align on 16 byte address 527 | a = (double *)aa; 528 | b = a + N; 529 | c = b + N; 530 | if (verbosity > 1) { 531 | printf( 532 | "load_arena=%p, load_total_memory=0x%lX, load_loop=0x%lX, N=0x%lX, " 533 | "a=%p, b=%p, c=%p\n", 534 | (char *)t->x.load_arena, t->x.load_total_memory, load_loop, N, a, b, c); 535 | } 536 | 537 | LOAD_MEMORY_INIT_MIBPS 538 | do { 539 | tmp = a; 540 | a = b; 541 | b = c; 542 | c = tmp; 543 | 544 | for (i = 0; i < N; ++i) { 545 | a[i] = b[i] + c[i]; 546 | } 547 | LOAD_MEMORY_SAMPLE_MIBPS 548 | } while (1); 549 | #undef LOOP_OPS 550 | #undef LOOP_ALIGN 551 | } 552 | 553 | static inline void delay_until_iteration(uint64_t iteration) { 554 | while (iteration--) { 555 | asm volatile("nop"); 556 | } 557 | } 558 | 559 | //-------------------------------------------------------------------------------------------------------- 560 | // Stream triad pattern with nontermpoal hint and adjustable injection delay 561 | static void load_stream_triad_nontemporal_injection_delay(per_thread_t *t) { 562 | #define LOOP_OPS 3 563 | #define LOOP_ALIGN 16 564 | uint64_t load_loop, load_bites; 565 | register uint64_t N, i; 566 | register uint64_t *a; 567 | register uint64_t *b; 568 | register uint64_t *c; 569 | register uint64_t *tmp; 570 | const uint64_t num_elem_twocachelines = DEF_CACHELINE / sizeof(uint64_t) * 2; 571 | 572 | load_loop = 573 | t->x.load_total_memory - 574 | (LOOP_OPS * LOOP_ALIGN); // subtract to allow aligning count/addresses 575 | load_loop = (load_loop / LOOP_OPS) & 576 | ~(LOOP_ALIGN - 1); // divide by 3 buffers and align byte count on 577 | // LOOP_ALIGN byte multiple 578 | N = load_loop / sizeof(uint64_t); 579 | load_bites = N * sizeof(uint64_t) * LOOP_OPS; 580 | size_t aa = (((size_t)t->x.load_arena + LOOP_ALIGN) & 581 | ~(LOOP_ALIGN)); // align on 16 byte address 582 | a = (uint64_t *)aa; 583 | b = a + N; 584 | c = b + N; 585 | 586 | if (verbosity > 1) { 587 | printf( 588 | "load_arena=%p, load_total_memory=0x%lX, load_loop=0x%lX, N=0x%lX, " 589 | "a=%p, b=%p, c=%p\n", 590 | (char *)t->x.load_arena, t->x.load_total_memory, load_loop, N, a, b, c); 591 | } 592 | 593 | LOAD_MEMORY_INIT_MIBPS 594 | do { 595 | tmp = a; 596 | a = b; 597 | b = c; 598 | c = tmp; 599 | 600 | for (i = 0; i < N; i += 2) { 601 | if (i % num_elem_twocachelines == 0) delay_until_iteration(t->x.delay); 602 | #if defined(__aarch64__) 603 | asm volatile ("stnp %0, %1, [%2]" :: "r"(b[i]+c[i]), "r"(b[i+1]+c[i+1]), "r" (a+i)); 604 | #elif defined(__x86_64__) 605 | _mm_stream_si64(&((long long*)a)[i], b[i]+c[i]); 606 | _mm_stream_si64(&((long long*)a)[i+1], b[i+1]+c[i+1]); 607 | #else 608 | a[i] = b[i] + c[i]; 609 | a[i+1] = b[i+1] + c[i+1]; 610 | #endif 611 | } 612 | LOAD_MEMORY_SAMPLE_MIBPS 613 | } while (1); 614 | #undef LOOP_OPS 615 | #undef LOOP_ALIGN 616 | } 617 | 618 | //-------------------------------------------------------------------------------------------------------- 619 | static void load_stream_copy(per_thread_t *t) { 620 | #define LOOP_OPS 2 621 | uint64_t load_loop = t->x.load_total_memory / LOOP_OPS; 622 | register uint64_t N = load_loop / sizeof(double); 623 | uint64_t load_bites = N * sizeof(double) * LOOP_OPS; 624 | register uint64_t i; 625 | register double *a = (double *)t->x.load_arena; 626 | register double *b = a + N; 627 | register double *tmp; 628 | 629 | LOAD_MEMORY_INIT_MIBPS 630 | do { 631 | tmp = a; 632 | a = b; 633 | b = tmp; 634 | for (i = 0; i < N; ++i) { 635 | b[i] = a[i]; 636 | } 637 | LOAD_MEMORY_SAMPLE_MIBPS 638 | } while (1); 639 | #undef LOOP_OPS 640 | } 641 | 642 | //-------------------------------------------------------------------------------------------------------- 643 | static void load_stream_sum(per_thread_t *t) { 644 | #define LOOP_OPS 1 645 | uint64_t load_loop = t->x.load_total_memory / LOOP_OPS; 646 | register uint64_t N = load_loop / sizeof(uint64_t); 647 | uint64_t load_bites = N * sizeof(uint64_t) * LOOP_OPS; 648 | register uint64_t i; 649 | register uint64_t *a = (uint64_t *)t->x.load_arena; 650 | register uint64_t s = 0; 651 | 652 | LOAD_MEMORY_INIT_MIBPS 653 | do { 654 | for (i = 0; i < N; ++i) { 655 | s += a[i]; 656 | } 657 | LOAD_MEMORY_SAMPLE_MIBPS 658 | use_result_dummy += s; 659 | } while (1); 660 | #undef LOOP_OPS 661 | } 662 | 663 | //-------------------------------------------------------------------------------------------------------- 664 | static const chase_t memloads[] = { 665 | // the default must be first 666 | { 667 | .fn = load_memcpy_libc, 668 | .base_object_size = sizeof(void *), 669 | .name = "memcpy-libc", 670 | .usage1 = "memcpy-libc", 671 | .usage2 = "1:1 rd:wr - memcpy()", 672 | .requires_arg = 0, 673 | .parallelism = 0, 674 | }, 675 | { 676 | .fn = load_memset_libc, 677 | .base_object_size = sizeof(void *), 678 | .name = "memset-libc", 679 | .usage1 = "memset-libc", 680 | .usage2 = "0:1 rd:wr - memset() non-zero data", 681 | .requires_arg = 0, 682 | .parallelism = 0, 683 | }, 684 | { 685 | .fn = load_memsetz_libc, 686 | .base_object_size = sizeof(void *), 687 | .name = "memsetz-libc", 688 | .usage1 = "memsetz-libc", 689 | .usage2 = "0:1 rd:wr - memset() zero data", 690 | .requires_arg = 0, 691 | .parallelism = 0, 692 | }, 693 | { 694 | .fn = load_stream_copy, 695 | .base_object_size = sizeof(void *), 696 | .name = "stream-copy", 697 | .usage1 = "stream-copy", 698 | .usage2 = "1:1 rd:wr - lmbench stream copy ", 699 | .requires_arg = 0, 700 | .parallelism = 0, 701 | }, 702 | { 703 | .fn = load_stream_sum, 704 | .base_object_size = sizeof(void *), 705 | .name = "stream-sum", 706 | .usage1 = "stream-sum", 707 | .usage2 = "1:0 rd:wr - lmbench stream sum ", 708 | .requires_arg = 0, 709 | .parallelism = 0, 710 | }, 711 | { 712 | .fn = load_stream_triad, 713 | .base_object_size = sizeof(void *), 714 | .name = "stream-triad", 715 | .usage1 = "stream-triad", 716 | .usage2 = "2:1 rd:wr - lmbench stream triad a[i]=b[i]+(scalar*c[i])", 717 | .requires_arg = 0, 718 | .parallelism = 0, 719 | }, 720 | { 721 | .fn = load_stream_triad_nontemporal_injection_delay, 722 | .base_object_size = sizeof(void *), 723 | .name = "stream-triad-nontemporal-injection-delay", 724 | .usage1 = "stream-triad-nontemporal-injection-delay", 725 | .usage2 = "2:1 rd:wr - lmbench stream triad with nontemporal hint", 726 | .requires_arg = 0, 727 | .parallelism = 0, 728 | }}; 729 | 730 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER; 731 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER; 732 | static size_t nr_to_startup; 733 | static int set_thread_affinity = 1; 734 | 735 | static void *thread_start(void *data) { 736 | per_thread_t *args = data; 737 | 738 | // ensure every thread has a different RNG 739 | rng_init(args->x.thread_num); 740 | 741 | if (set_thread_affinity) { 742 | // find out which cpus we can run on and move us to an appropriate cpu 743 | cpu_set_t cpus; 744 | if (sched_getaffinity(0, sizeof(cpus), &cpus)) { 745 | perror("sched_getaffinity"); 746 | exit(1); 747 | } 748 | int my_cpu; 749 | unsigned num = args->x.thread_num; 750 | for (my_cpu = 0; my_cpu < CPU_SETSIZE; ++my_cpu) { 751 | if (!CPU_ISSET(my_cpu, &cpus)) continue; 752 | if (num == 0) break; 753 | --num; 754 | } 755 | if (my_cpu == CPU_SETSIZE) { 756 | fprintf(stderr, "error: more threads than cpus available\n"); 757 | exit(1); 758 | } 759 | CPU_ZERO(&cpus); 760 | CPU_SET(my_cpu, &cpus); 761 | if (sched_setaffinity(0, sizeof(cpus), &cpus)) { 762 | perror("sched_setaffinity"); 763 | exit(1); 764 | } 765 | } 766 | if (args->x.run_test_type == RUN_CHASE) { 767 | // generate chases -- using a different mixer index for every 768 | // thread and for every parallel chase within a thread 769 | unsigned parallelism = args->x.chase->parallelism; 770 | for (unsigned par = 0; par < parallelism; ++par) { 771 | if (args->x.use_longer_chase) { 772 | args->x.cycle[par] = generate_chase_long( 773 | args->x.genchase_args, parallelism * args->x.thread_num + par, 774 | parallelism); 775 | } else { 776 | args->x.cycle[par] = generate_chase( 777 | args->x.genchase_args, parallelism * args->x.thread_num + par); 778 | } 779 | } 780 | // handle critword2 chases 781 | if (strcmp(args->x.chase->name, "critword2") == 0) { 782 | size_t offset = strtoul(args->x.extra_args, 0, 0); 783 | char *p = args->x.cycle[0]; 784 | char *q = p; 785 | do { 786 | char *next = *(char **)p; 787 | *(void **)(p + offset) = next + offset; 788 | p = next; 789 | } while (p != q); 790 | } 791 | // handle critword chases 792 | if (strcmp(args->x.chase->name, "critword") == 0) { 793 | size_t offset = strtoul(args->x.extra_args, 0, 0); 794 | char *p = args->x.cycle[0]; 795 | char *q = p; 796 | do { 797 | char *next = *(char **)p; 798 | *(void **)(p + offset) = next; 799 | *(void **)p = p + offset; 800 | p = next; 801 | } while (p != q); 802 | } 803 | // now flush our caches 804 | if (args->x.cache_flush_size) { 805 | size_t nr_elts = args->x.cache_flush_size / sizeof(size_t); 806 | size_t *p = args->x.flush_arena; 807 | size_t sum = 0; 808 | while (nr_elts) { 809 | sum += *p; 810 | ++p; 811 | --nr_elts; 812 | } 813 | args->x.dummy += sum; 814 | } 815 | } else { 816 | if (verbosity > 2) 817 | printf("thread_start(%d) memload generate buffers\n", args->x.thread_num); 818 | // generate buffers 819 | args->x.load_arena = (char *)alloc_arena_mmap( 820 | page_size, use_thp, 821 | args->x.load_total_memory + args->x.load_offset, 822 | -1) + args->x.load_offset; 823 | memset(args->x.load_arena, 1, 824 | args->x.load_total_memory); // ensure pages are mapped 825 | } 826 | 827 | if (verbosity > 2) 828 | printf("thread_start(%d) wait and/or wake up everyone\n", 829 | args->x.thread_num); 830 | // wait and/or wake up everyone if we're all ready 831 | pthread_mutex_lock(&wait_mutex); 832 | --nr_to_startup; 833 | if (nr_to_startup) { 834 | pthread_cond_wait(&wait_cond, &wait_mutex); 835 | } else { 836 | pthread_cond_broadcast(&wait_cond); 837 | } 838 | pthread_mutex_unlock(&wait_mutex); 839 | 840 | if (args->x.run_test_type == RUN_CHASE) { 841 | if (verbosity > 2) printf("thread_start: C(%d)\n", args->x.thread_num); 842 | args->x.chase->fn(data); 843 | } else { 844 | if (verbosity > 2) printf("thread_start: M(%d)\n", args->x.thread_num); 845 | args->x.memload->fn(data); 846 | } 847 | return NULL; 848 | } 849 | 850 | static void timestamp(void) { 851 | if (!print_timestamp) return; 852 | struct timeval tv; 853 | gettimeofday(&tv, NULL); 854 | printf("%.6f ", tv.tv_sec + tv.tv_usec / 1000000.); 855 | } 856 | 857 | int main(int argc, char **argv) { 858 | char *p; 859 | int c; 860 | size_t i; 861 | size_t nr_threads = DEF_NR_THREADS; 862 | size_t nr_samples = DEF_NR_SAMPLES; 863 | size_t cache_flush_size = DEF_CACHE_FLUSH; 864 | size_t delay = DEF_DELAY; 865 | size_t offset = DEF_OFFSET; 866 | bool use_longer_chase = false; 867 | int print_average = 0; 868 | const char *extra_args = NULL; 869 | const char *chase_optarg = chases[0].name; 870 | const chase_t *chase = &chases[0]; 871 | const char *memload_optarg = memloads[0].name; 872 | const chase_t *memload = &memloads[0]; 873 | test_type_t run_test_type = 874 | RUN_CHASE; // RUN_CHASE, RUN_BANDWIDTH, RUN_CHASE_LOADED 875 | struct generate_chase_common_args genchase_args; 876 | 877 | default_page_size = page_size = get_native_page_size(); 878 | 879 | genchase_args.total_memory = DEF_TOTAL_MEMORY; 880 | genchase_args.stride = DEF_STRIDE; 881 | genchase_args.tlb_locality = DEF_TLB_LOCALITY * default_page_size; 882 | genchase_args.gen_permutation = gen_random_permutation; 883 | 884 | setvbuf(stdout, NULL, _IOLBF, BUFSIZ); 885 | 886 | while ((c = getopt(argc, argv, "ac:d:l:F:p:HLm:n:oO:S:s:T:t:vXyW:")) != -1) { 887 | switch (c) { 888 | case 'a': 889 | print_average = 1; 890 | break; 891 | case 'c': 892 | chase_optarg = optarg; 893 | p = strchr(optarg, ':'); 894 | if (p == NULL) p = optarg + strlen(optarg); 895 | for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) { 896 | if (strncmp(optarg, chases[i].name, p - optarg) == 0) { 897 | break; 898 | } 899 | } 900 | if (i == sizeof(chases) / sizeof(chases[0])) { 901 | fprintf(stderr, "Error: not a recognized chase name: %s\n", optarg); 902 | goto usage; 903 | } 904 | chase = &chases[i]; 905 | if (strncmp("chaseload", chases[i].name, 12) == 0) { 906 | run_test_type = RUN_CHASE_LOADED; 907 | if (verbosity > 0) { 908 | fprintf(stdout, 909 | "Info: Loaded Latency chase selected. A -l memload can be " 910 | "used to select a specific memory load\n"); 911 | } 912 | break; 913 | } 914 | if (run_test_type == RUN_BANDWIDTH) { 915 | fprintf(stderr, 916 | "Error: When using -l memload, the only valid -c selection " 917 | "is chaseload (ie. loaded latency)\n"); 918 | goto usage; 919 | } 920 | if (chase->requires_arg) { 921 | if (p[0] != ':' || p[1] == 0) { 922 | fprintf(stderr, 923 | "Error: that chase requires an argument:\n-c %s\t%s\n", 924 | chase->usage1, chase->usage2); 925 | exit(1); 926 | } 927 | extra_args = p + 1; 928 | } else if (*p != 0) { 929 | fprintf(stderr, 930 | "Error: that chase does not take an argument:\n-c %s\t%s\n", 931 | chase->usage1, chase->usage2); 932 | exit(1); 933 | } 934 | break; 935 | case 'd': 936 | delay = strtoul(optarg, &p, 0); 937 | if (*p) { 938 | fprintf(stderr, "Error: delay must be a non-negative integer\n"); 939 | exit(1); 940 | } 941 | break; 942 | case 'F': 943 | if (parse_mem_arg(optarg, &cache_flush_size)) { 944 | fprintf(stderr, 945 | "Error: cache_flush_size must be a non-negative integer " 946 | "(suffixed with k, m, or g)\n"); 947 | exit(1); 948 | } 949 | break; 950 | case 'p': 951 | if (parse_mem_arg(optarg, &page_size)) { 952 | fprintf(stderr, 953 | "Error: page_size must be a non-negative integer (suffixed " 954 | "with k, m, or g)\n"); 955 | exit(1); 956 | } 957 | break; 958 | case 'H': 959 | use_thp = true; 960 | break; 961 | case 'l': 962 | memload_optarg = optarg; 963 | p = strchr(optarg, ':'); 964 | if (p == NULL) p = optarg + strlen(optarg); 965 | for (i = 0; i < sizeof(memloads) / sizeof(memloads[0]); ++i) { 966 | if (strncmp(optarg, memloads[i].name, p - optarg) == 0) { 967 | break; 968 | } 969 | } 970 | if (i == sizeof(memloads) / sizeof(memloads[0])) { 971 | fprintf(stderr, "Error: not a recognized memload name: %s\n", optarg); 972 | goto usage; 973 | } 974 | memload = &memloads[i]; 975 | if (run_test_type != RUN_CHASE_LOADED) { 976 | run_test_type = RUN_BANDWIDTH; 977 | if (verbosity > 0) { 978 | fprintf(stdout, 979 | "Memory Bandwidth test selected. For loaded latency, -c " 980 | "chaseload must also be selected\n"); 981 | } 982 | } 983 | if (memload->requires_arg) { 984 | if (p[0] != ':' || p[1] == 0) { 985 | fprintf(stderr, 986 | "Error: that memload requires an argument:\n-c %s\t%s\n", 987 | memload->usage1, memload->usage2); 988 | exit(1); 989 | } 990 | extra_args = p + 1; 991 | } else if (*p != 0) { 992 | fprintf(stderr, 993 | "Error: that memload does not take an argument:\n-c %s\t%s\n", 994 | memload->usage1, memload->usage2); 995 | exit(1); 996 | } 997 | break; 998 | case 'L': 999 | use_longer_chase = true; 1000 | break; 1001 | case 'm': 1002 | if (parse_mem_arg(optarg, &genchase_args.total_memory) || 1003 | genchase_args.total_memory == 0) { 1004 | fprintf(stderr, 1005 | "Error: total_memory must be a positive integer (suffixed " 1006 | "with k, m or g)\n"); 1007 | exit(1); 1008 | } 1009 | break; 1010 | case 'n': 1011 | nr_samples = strtoul(optarg, &p, 0); 1012 | if (*p) { 1013 | fprintf(stderr, "Error: nr_samples must be a non-negative integer\n"); 1014 | exit(1); 1015 | } 1016 | break; 1017 | case 'O': 1018 | if (parse_mem_arg(optarg, &offset)) { 1019 | fprintf(stderr, 1020 | "Error: offset must be a non-negative integer (suffixed with " 1021 | "k, m, or g)\n"); 1022 | exit(1); 1023 | } 1024 | break; 1025 | case 'o': 1026 | genchase_args.gen_permutation = gen_ordered_permutation; 1027 | break; 1028 | case 's': 1029 | if (parse_mem_arg(optarg, &genchase_args.stride)) { 1030 | fprintf(stderr, 1031 | "Error: stride must be a positive integer (suffixed with k, " 1032 | "m, or g)\n"); 1033 | exit(1); 1034 | } 1035 | break; 1036 | case 'T': 1037 | if (parse_mem_arg(optarg, &genchase_args.tlb_locality)) { 1038 | fprintf(stderr, 1039 | "Error: tlb locality must be a positive integer (suffixed " 1040 | "with k, m, or g)\n"); 1041 | exit(1); 1042 | } 1043 | break; 1044 | case 't': 1045 | nr_threads = strtoul(optarg, &p, 0); 1046 | if (*p || nr_threads == 0) { 1047 | fprintf(stderr, "Error: nr_threads must be positive integer\n"); 1048 | exit(1); 1049 | } 1050 | break; 1051 | case 'v': 1052 | ++verbosity; 1053 | break; 1054 | case 'W': 1055 | is_weighted_mbind = 1; 1056 | char *tok = NULL, *saveptr = NULL; 1057 | tok = strtok_r(optarg, ",", &saveptr); 1058 | while (tok != NULL) { 1059 | uint16_t node_id; 1060 | uint16_t weight; 1061 | int count = sscanf(tok, "%hu:%hu", &node_id, &weight); 1062 | if (count != 2) { 1063 | fprintf(stderr, "Error: Expecting node_id:weight\n"); 1064 | exit(1); 1065 | } 1066 | if (node_id >= sizeof(mbind_weights) / sizeof(mbind_weights[0])) { 1067 | fprintf(stderr, "Error: Maximum node_id is %lu\n", 1068 | sizeof(mbind_weights) / sizeof(mbind_weights[0]) - 1); 1069 | exit(1); 1070 | } 1071 | mbind_weights[node_id] = weight; 1072 | tok = strtok_r(NULL, ",", &saveptr); 1073 | } 1074 | break; 1075 | case 'X': 1076 | set_thread_affinity = 0; 1077 | break; 1078 | case 'y': 1079 | print_timestamp = 1; 1080 | break; 1081 | default: 1082 | goto usage; 1083 | } 1084 | } 1085 | 1086 | if (argc - optind != 0) { 1087 | usage: 1088 | fprintf(stderr, "usage: %s [options]\n", argv[0]); 1089 | fprintf(stderr, 1090 | "This program can run either read latency, memory bandwidth, or " 1091 | "loaded-latency:\n"); 1092 | fprintf(stderr, 1093 | " Latency only; -c MUST NOT be chaseload. -l memload MUST NOT " 1094 | "be used\n"); 1095 | fprintf(stderr, 1096 | " Bandwidth only: -c MUST NOT be used. -l memload MUST be " 1097 | "used\n"); 1098 | fprintf(stderr, 1099 | " Loaded-latency: -c MUST be chaseload, -l memload MUST be " 1100 | "used\n"); 1101 | fprintf(stderr, 1102 | "-a print average latency (default is best latency)\n"); 1103 | fprintf(stderr, "-c chase select one of several different chases:\n"); 1104 | for (i = 0; i < sizeof(chases) / sizeof(chases[0]); ++i) { 1105 | fprintf(stderr, " %-12s%s\n", chases[i].usage1, chases[i].usage2); 1106 | } 1107 | fprintf(stderr, " default: %s\n", chases[0].name); 1108 | fprintf(stderr, 1109 | "-l memload select one of several different memloads:\n"); 1110 | for (i = 0; i < sizeof(memloads) / sizeof(memloads[0]); ++i) { 1111 | fprintf(stderr, " %-12s%s\n", memloads[i].usage1, memloads[i].usage2); 1112 | } 1113 | fprintf(stderr, " default: %s\n", memloads[0].name); 1114 | fprintf(stderr, 1115 | "-d delay delay used between loads; only effecitve if used with " 1116 | "load pattern with suffix injection_delay. (default %zu)\n", 1117 | DEF_DELAY); 1118 | fprintf(stderr, 1119 | "-F nnnn[kmg] amount of memory to use to flush the caches after " 1120 | "constructing\n" 1121 | " the chase/memload and before starting the benchmark (use " 1122 | "with nta)\n" 1123 | " default: %zu\n", 1124 | DEF_CACHE_FLUSH); 1125 | fprintf(stderr, "-p nnnn[kmg] backing page size to use (default %zu)\n", 1126 | default_page_size); 1127 | fprintf( 1128 | stderr, 1129 | "-H use transparent hugepages (leave page size at default)\n"); 1130 | fprintf(stderr, "-m nnnn[kmg] total memory size (default %zu)\n", 1131 | DEF_TOTAL_MEMORY); 1132 | fprintf(stderr, 1133 | " NOTE: memory size will be rounded down to a multiple of " 1134 | "-T option\n"); 1135 | fprintf(stderr, 1136 | "-L use longer chase\n"); 1137 | fprintf(stderr, 1138 | "-n nr_samples nr of 0.5 second samples to use (default %zu, 0 = " 1139 | "infinite)\n", 1140 | DEF_NR_SAMPLES); 1141 | fprintf(stderr, 1142 | "-o perform an ordered traversal (rather than random)\n"); 1143 | fprintf( 1144 | stderr, 1145 | "-O nnnn[kmg] offset the entire chase by nnnn bytes (default %zu)\n", 1146 | DEF_OFFSET); 1147 | fprintf(stderr, "-s nnnn[kmg] stride size (default %zu)\n", DEF_STRIDE); 1148 | fprintf(stderr, "-T nnnn[kmg] TLB locality in bytes (default %zu)\n", 1149 | DEF_TLB_LOCALITY * default_page_size); 1150 | fprintf(stderr, 1151 | " NOTE: TLB locality will be rounded down to a multiple of " 1152 | "stride\n"); 1153 | fprintf(stderr, "-t nr_threads number of threads (default %zu)\n", 1154 | DEF_NR_THREADS); 1155 | fprintf(stderr, "-v verbose output (default %u)\n", verbosity); 1156 | fprintf( 1157 | stderr, 1158 | "-W mbind list list of node:weight,... pairs for allocating memory\n" 1159 | " has no effect if -H flag is specified\n" 1160 | " 0:10,1:90 weights it as 10%% on 0 and 90%% on 1\n"); 1161 | fprintf(stderr, "-X do not set thread affinity (default %u)\n", 1162 | set_thread_affinity); 1163 | fprintf(stderr, 1164 | "-y print timestamp in front of each line (default %u)\n", 1165 | print_timestamp); 1166 | exit(1); 1167 | } 1168 | 1169 | if (genchase_args.stride < sizeof(void *)) { 1170 | fprintf(stderr, "stride must be at least %zu\n", sizeof(void *)); 1171 | exit(1); 1172 | } 1173 | 1174 | // ensure some sanity in the various arguments 1175 | if (genchase_args.tlb_locality < genchase_args.stride) { 1176 | genchase_args.tlb_locality = genchase_args.stride; 1177 | } else { 1178 | genchase_args.tlb_locality -= 1179 | genchase_args.tlb_locality % genchase_args.stride; 1180 | } 1181 | 1182 | if (genchase_args.total_memory < genchase_args.tlb_locality) { 1183 | if (genchase_args.total_memory < genchase_args.stride) { 1184 | genchase_args.total_memory = genchase_args.stride; 1185 | } else { 1186 | genchase_args.total_memory -= 1187 | genchase_args.total_memory % genchase_args.stride; 1188 | } 1189 | genchase_args.tlb_locality = genchase_args.total_memory; 1190 | } else { 1191 | genchase_args.total_memory -= 1192 | genchase_args.total_memory % genchase_args.tlb_locality; 1193 | } 1194 | 1195 | genchase_args.nr_mixer_indices = 1196 | genchase_args.stride / chase->base_object_size; 1197 | if ((run_test_type == RUN_CHASE) && 1198 | (genchase_args.nr_mixer_indices < nr_threads * chase->parallelism)) { 1199 | fprintf(stderr, 1200 | "the stride is too small to interleave that many threads, need at " 1201 | "least %zu bytes\n", 1202 | nr_threads * chase->parallelism * chase->base_object_size); 1203 | exit(1); 1204 | } 1205 | 1206 | if (verbosity > 0) { 1207 | printf("nr_threads = %zu\n", nr_threads); 1208 | print_page_size(page_size, use_thp); 1209 | printf("total_memory = %zu (%.1f MiB)\n", genchase_args.total_memory, 1210 | genchase_args.total_memory / (1024. * 1024.)); 1211 | printf("stride = %zu\n", genchase_args.stride); 1212 | printf("tlb_locality = %zu\n", genchase_args.tlb_locality); 1213 | printf("chase = %s\n", chase_optarg); 1214 | printf("memload = %s\n", memload_optarg); 1215 | if (run_test_type == RUN_CHASE) printf("run_test_type = RUN_CHASE\n"); 1216 | if (run_test_type == RUN_BANDWIDTH) 1217 | printf("run_test_type = RUN_BANDWIDTH\n"); 1218 | if (run_test_type == RUN_CHASE_LOADED) 1219 | printf("run_test_type = RUN_CHASE_LOADED\n"); 1220 | } 1221 | 1222 | rng_init(1); 1223 | 1224 | if (run_test_type != RUN_BANDWIDTH) { 1225 | generate_chase_mixer(&genchase_args, nr_threads * chase->parallelism); 1226 | 1227 | // generate the chases by launching multiple threads 1228 | if (verbosity > 2) printf("allocate genchase_args.arena\n"); 1229 | genchase_args.arena = 1230 | (char *)alloc_arena_mmap(page_size, use_thp, 1231 | genchase_args.total_memory + offset, -1) + 1232 | offset; 1233 | } 1234 | per_thread_t *thread_data = alloc_arena_mmap( 1235 | default_page_size, false, nr_threads * sizeof(per_thread_t), -1); 1236 | void *flush_arena = NULL; 1237 | if (verbosity > 2) printf("allocate cache flush\n"); 1238 | if (cache_flush_size) { 1239 | flush_arena = alloc_arena_mmap(default_page_size, false, cache_flush_size, 1240 | -1); 1241 | memset(flush_arena, 1, cache_flush_size); // ensure pages are mapped 1242 | } 1243 | 1244 | pthread_t thread; 1245 | size_t nr_chase_threads = 0, nr_load_threads = 0; 1246 | nr_to_startup = nr_threads; 1247 | for (i = 0; i < nr_threads; ++i) { 1248 | thread_data[i].x.genchase_args = &genchase_args; 1249 | thread_data[i].x.nr_threads = nr_threads; 1250 | thread_data[i].x.thread_num = i; 1251 | thread_data[i].x.extra_args = extra_args; 1252 | thread_data[i].x.chase = chase; 1253 | thread_data[i].x.flush_arena = flush_arena; 1254 | thread_data[i].x.cache_flush_size = cache_flush_size; 1255 | thread_data[i].x.memload = memload; 1256 | thread_data[i].x.load_arena = NULL; // memory buffer used by this thread 1257 | thread_data[i].x.load_total_memory = 1258 | genchase_args.total_memory; // size of the arena 1259 | thread_data[i].x.load_offset = offset; // memory buffer offset 1260 | thread_data[i].x.delay = delay; 1261 | thread_data[i].x.use_longer_chase = use_longer_chase; 1262 | 1263 | if (run_test_type == RUN_CHASE_LOADED) { 1264 | if (i == 0) { 1265 | thread_data[i].x.run_test_type = RUN_CHASE; 1266 | nr_chase_threads++; 1267 | if (verbosity > 2) printf("main: Starting C[%ld]\n", i); 1268 | if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) { 1269 | perror("pthread_create"); 1270 | exit(1); 1271 | } 1272 | } else { 1273 | thread_data[i].x.run_test_type = RUN_BANDWIDTH; 1274 | nr_load_threads++; 1275 | if (verbosity > 2) printf("main: Starting M[%ld]\n", i); 1276 | if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) { 1277 | perror("pthread_create"); 1278 | exit(1); 1279 | } 1280 | } 1281 | 1282 | } else if (run_test_type == RUN_CHASE) { 1283 | thread_data[i].x.run_test_type = RUN_CHASE; 1284 | nr_chase_threads++; 1285 | if (verbosity > 2) printf("main: Starting C[%ld]\n", i); 1286 | if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) { 1287 | perror("pthread_create"); 1288 | exit(1); 1289 | } 1290 | } else { 1291 | nr_load_threads++; 1292 | thread_data[i].x.run_test_type = RUN_BANDWIDTH; 1293 | if (verbosity > 2) printf("main: Starting M[%ld]\n", i); 1294 | if (pthread_create(&thread, NULL, thread_start, &thread_data[i])) { 1295 | perror("pthread_create"); 1296 | exit(1); 1297 | } 1298 | } 1299 | } 1300 | 1301 | // now wait for them all to finish generating their chases/memloads and start 1302 | // testing 1303 | if (verbosity > 2) printf("main: waiting for threads to initialize\n"); 1304 | pthread_mutex_lock(&wait_mutex); 1305 | if (nr_to_startup) { 1306 | pthread_cond_wait(&wait_cond, &wait_mutex); 1307 | } 1308 | pthread_mutex_unlock(&wait_mutex); 1309 | usleep(LOAD_DELAY_WARMUP_uS); // Give OS scheduler thread migrations time to 1310 | // settle down. 1311 | 1312 | if (verbosity > 2) printf("main: start sampling thread progress\n"); 1313 | // now start sampling their progress 1314 | nr_samples = nr_samples + 1; // we drop the first sample 1315 | double *cur_samples = alloca(nr_threads * sizeof(*cur_samples)); 1316 | uint64_t last_sample_time, cur_sample_time; 1317 | double chase_min = 1. / 0., chase_max = 0.; 1318 | double chase_running_sum = 0., load_running_sum = 0., 1319 | chase_running_geosum = 0.; 1320 | double load_max_mibps = 0, load_min_mibps = 1. / 0.; 1321 | double chase_thd_sum = 0, load_thd_sum = 0; 1322 | uint64_t time_delta = 0; 1323 | int ready; 1324 | 1325 | last_sample_time = now_nsec(); 1326 | for (size_t sample_no = 0; nr_samples == 1 || sample_no < nr_samples; 1327 | ++sample_no) { 1328 | if (verbosity > 0) printf("main: sample_no=%ld ", sample_no); 1329 | usleep(LOAD_DELAY_RUN_uS); 1330 | // Request threads to update their sample 1331 | for (i = 0; i < nr_threads; ++i) { 1332 | thread_data[i].x.sample_no = sample_no; 1333 | } 1334 | 1335 | chase_thd_sum = 0.; 1336 | load_thd_sum = 0.; 1337 | usleep(LOAD_DELAY_SAMPLE_uS); // Give load threads time to update sample 1338 | // count. Chase threads are always updating 1339 | for (i = 0; i < nr_threads; ++i) { 1340 | if (verbosity > 2) printf("-"); 1341 | ready = 0; 1342 | while (ready == 0) { 1343 | cur_samples[i] = 1344 | (double)__sync_lock_test_and_set(&thread_data[i].x.count, 0); 1345 | if (cur_samples[i] != 0) { 1346 | // Chase threads start at thread 0 and should always be ready, 1347 | // therefore we read chase timestamp as soon as finished reading the 1348 | // last chase thread Load threads return pre-calculated MiB/s so don't 1349 | // use this timer 1350 | if ((i + 1) == nr_chase_threads) { 1351 | cur_sample_time = now_nsec(); 1352 | time_delta = cur_sample_time - last_sample_time; 1353 | last_sample_time = cur_sample_time; 1354 | } 1355 | ready = 1; 1356 | } else { 1357 | if (verbosity > 2) printf("*"); 1358 | usleep(LOAD_DELAY_SAMPLE_uS); 1359 | } 1360 | } 1361 | } 1362 | 1363 | for (i = 0; i < nr_threads; ++i) { 1364 | // printf("main: thread[%d], run_mode=%i\n", t->x.thread_num, 1365 | // t->x.run_test_type); 1366 | if (thread_data[i].x.run_test_type == RUN_CHASE) { 1367 | chase_thd_sum += (double)cur_samples[i]; 1368 | if (verbosity > 1) { 1369 | double z = time_delta / (double)cur_samples[i]; 1370 | double mibps = sizeof(void *) / (z / 1000000000.0) / (1024 * 1024); 1371 | printf(" MC(%ld)%.3f, %6.1f(ns), %.3f(MiB/s)", i, cur_samples[i], z, 1372 | mibps); 1373 | } 1374 | } else { 1375 | load_thd_sum += (double)cur_samples[i]; 1376 | if (verbosity > 1) { 1377 | printf(" ML(%ld)%.0f(MiB/s)", i, cur_samples[i]); 1378 | } 1379 | } 1380 | } 1381 | 1382 | // we drop the first sample because it's fairly likely one 1383 | // thread had some advantage initially due to still having 1384 | // portions of the chase in a cache. 1385 | if (sample_no == 0) { 1386 | if (verbosity > 0) printf("\n"); 1387 | continue; 1388 | } 1389 | 1390 | // Calcuate chase overall thread stats. 1391 | if (chase_thd_sum != 0) { 1392 | double t = time_delta / (double)chase_thd_sum; 1393 | chase_running_sum += t; 1394 | chase_running_geosum += log(t); 1395 | if (t < chase_min) chase_min = t; 1396 | if (t > chase_max) chase_max = t; 1397 | if (verbosity > 0) { 1398 | double z = t * nr_chase_threads; 1399 | printf(" avg=%.1f(ns)\n", z); 1400 | } 1401 | } 1402 | 1403 | // Calculate memory load overall thread stats 1404 | if (load_thd_sum != 0) { 1405 | if (load_thd_sum > load_max_mibps) load_max_mibps = load_thd_sum; 1406 | if (load_thd_sum < load_min_mibps) load_min_mibps = load_thd_sum; 1407 | load_running_sum += load_thd_sum; 1408 | if (verbosity > 0) { 1409 | printf(" main: threads=%ld, Total(MiB/s)=%.*f, PerThread=%.f\n", 1410 | nr_load_threads, load_thd_sum < 100. ? 3 : 1, load_thd_sum, 1411 | load_thd_sum / nr_load_threads); 1412 | } 1413 | } 1414 | } 1415 | 1416 | // printf("sample_sum=%.f\n", sample_sum); 1417 | // printf("main: float=%li, void*=%li, size_t=%li, uint64_t=%li, double=%li, 1418 | // long=%li, int=%li\n", 1419 | // sizeof(float), sizeof(void*), sizeof(size_t), sizeof(uint64_t), 1420 | // sizeof(double), sizeof(long), sizeof(int) ); 1421 | double ChasNS = 0, ChasDEV = 0, ChasBEST = 0, ChasWORST = 0, ChasAVG = 0, 1422 | ChasMibs = 0, ChasGEO = 0; 1423 | double LdAvgMibs = 0, LdMibsDEV = 0; 1424 | 1425 | if (nr_chase_threads != 0) { 1426 | ChasAVG = chase_running_sum * nr_chase_threads / (nr_samples - 1); 1427 | ChasGEO = 1428 | nr_chase_threads * exp(chase_running_geosum / ((double)nr_samples - 1)); 1429 | ChasBEST = chase_min * nr_chase_threads; 1430 | ChasWORST = chase_max * nr_chase_threads; 1431 | ChasDEV = ((ChasWORST - ChasBEST) / ChasAVG); 1432 | if (verbosity > 0) { 1433 | printf( 1434 | "ChasAVG=%-8f, ChasGEO=%-8f, ChasBEST=%-8f, ChasWORST=%-8f, " 1435 | "ChasDEV=%-8.3f\n", 1436 | ChasAVG, ChasGEO, ChasBEST, ChasWORST, ChasDEV); 1437 | } 1438 | // if (print_average) ChasNS = ChasAVG; 1439 | if (print_average) 1440 | ChasNS = ChasGEO; 1441 | else 1442 | ChasNS = ChasBEST; 1443 | ChasMibs = nr_chase_threads * 1444 | (sizeof(void *) / (ChasNS / 1000000000.0) / (1024 * 1024)); 1445 | } 1446 | 1447 | if (nr_load_threads != 0) { 1448 | LdAvgMibs = load_running_sum / (nr_samples - 1); 1449 | LdMibsDEV = ((load_max_mibps - load_min_mibps) / LdAvgMibs); 1450 | if (verbosity > 0) { 1451 | printf( 1452 | "LdAvgMibs=%-8f, LdMaxMibs=%-8f, LdMinMibs=%-8f, LdDevMibs=%-8.3f\n", 1453 | LdAvgMibs, load_max_mibps, load_min_mibps, LdMibsDEV); 1454 | } 1455 | } 1456 | 1457 | const char *not_used = "--------"; 1458 | printf( 1459 | "Samples\t, Byte/thd\t, ChaseThds\t, ChaseNS\t, ChaseMibs\t, " 1460 | "ChDeviate\t, LoadThds\t, LdMaxMibs\t, LdAvgMibs\t, LdDeviate\t, " 1461 | "ChaseArg\t, MemLdArg\n"); 1462 | printf( 1463 | "%-6ld\t, %-11ld\t, %-8ld\t, %-8.3f\t, %-8.f\t, %-8.3f\t, %-8.f\t, " 1464 | "%-8.f\t, %-8.f\t, %-8.3f", 1465 | nr_samples - 1, thread_data[0].x.load_total_memory, nr_chase_threads, 1466 | ChasNS, ChasMibs, ChasDEV, (double)nr_load_threads, load_max_mibps, 1467 | LdAvgMibs, LdMibsDEV); 1468 | switch (run_test_type) { 1469 | case RUN_CHASE_LOADED: 1470 | printf("\t, %s\t, %s\n", chase_optarg, memload_optarg); 1471 | break; 1472 | case RUN_BANDWIDTH: 1473 | printf("\t, %s\t, %s\n", not_used, memload_optarg); 1474 | break; 1475 | default: 1476 | printf("\t, %s\t, %s\n", chase_optarg, not_used); 1477 | } 1478 | 1479 | timestamp(); 1480 | exit(0); 1481 | } 1482 | -------------------------------------------------------------------------------- /permutation.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #include "permutation.h" 15 | 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | 23 | // some asserts are more expensive than we want in general use, but there are a 24 | // few i want active even in general use. 25 | #if 1 26 | #define dassert(x) \ 27 | do { \ 28 | } while (0) 29 | #else 30 | #define dassert(x) assert(x) 31 | #endif 32 | 33 | // XXX: declare this somewhere 34 | extern int verbosity; 35 | 36 | __thread char *rng_buf; // buf size (32 for now) determines the "randomness" 37 | __thread struct random_data *rand_state; // per_thread state for random_r 38 | 39 | //============================================================================ 40 | // a random permutation generator. i think this algorithm is from Knuth. 41 | 42 | void gen_random_permutation(perm_t *perm, size_t nr, size_t base) { 43 | size_t i; 44 | 45 | for (i = 0; i < nr; ++i) { 46 | size_t t = rng_int(i); 47 | perm[i] = perm[t]; 48 | perm[t] = base + i; 49 | } 50 | } 51 | 52 | void gen_ordered_permutation(perm_t *perm, size_t nr, size_t base) { 53 | size_t i; 54 | 55 | for (i = 0; i < nr; ++i) { 56 | perm[i] = base + i; 57 | } 58 | } 59 | 60 | int is_a_permutation(const perm_t *perm, size_t nr_elts) { 61 | uint8_t *vec; 62 | size_t vec_len = (nr_elts + 7) / 8; 63 | size_t i; 64 | 65 | vec = malloc(vec_len); 66 | memset(vec, 0, vec_len); 67 | 68 | for (i = 0; i < nr_elts; ++i) { 69 | size_t vec_elt = perm[i] / 8; 70 | size_t test_bit = 1u << (perm[i] % 8); 71 | if (vec[vec_elt] & test_bit) { 72 | free(vec); 73 | return 0; 74 | } 75 | vec[vec_elt] |= test_bit; 76 | } 77 | for (i = 0; i < nr_elts / 8; ++i) { 78 | if (vec[i] != 0xff) { 79 | free(vec); 80 | return 0; 81 | } 82 | } 83 | if (nr_elts % 8) { 84 | if (vec[vec_len - 1] != ((1u << (nr_elts % 8)) - 1)) { 85 | free(vec); 86 | return 0; 87 | } 88 | } 89 | free(vec); 90 | return 1; 91 | } 92 | 93 | void generate_chase_mixer(struct generate_chase_common_args *args, 94 | size_t nr_mixers) { 95 | size_t nr_mixer_indices = args->nr_mixer_indices; 96 | void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation; 97 | 98 | /* Set number of mixers rounded up to the power of two */ 99 | if (nr_mixer > 1) { 100 | args->nr_mixers = 1 << (CHAR_BIT * sizeof(long) - 101 | __builtin_clzl(nr_mixers - 1)); 102 | } 103 | 104 | if (args->nr_mixers < 64) { 105 | args->nr_mixers = 64; 106 | } 107 | 108 | if (verbosity > 1) 109 | printf("nr_mixers = %zu\n", args->nr_mixers); 110 | perm_t *t = malloc(nr_mixer_indices * sizeof(*t)); 111 | if (t == NULL) { 112 | fprintf(stderr, "Could not allocate %lu bytes, check stride/memory size?\n", 113 | nr_mixer_indices * sizeof(*t)); 114 | exit(1); 115 | } 116 | perm_t *r = malloc(nr_mixer_indices * args->nr_mixers * sizeof(*r)); 117 | if (r == NULL) { 118 | fprintf(stderr, "Could not allocate %lu bytes, check stride/memory size?\n", 119 | nr_mixer_indices * args->nr_mixers * sizeof(*r)); 120 | exit(1); 121 | } 122 | size_t i; 123 | size_t j; 124 | 125 | // we arrange r in a transposed manner so that all of the 126 | // data for a particular mixer_idx is packed together. 127 | for (i = 0; i < args->nr_mixers; ++i) { 128 | gen_permutation(t, nr_mixer_indices, 0); 129 | for (j = 0; j < nr_mixer_indices; ++j) { 130 | r[j * args->nr_mixers + i] = t[j]; 131 | } 132 | } 133 | free(t); 134 | 135 | args->mixer = r; 136 | } 137 | 138 | // Generate a pointer chasing sequence according to chase args. 139 | void *generate_chase(const struct generate_chase_common_args *args, 140 | size_t mixer_idx) { 141 | char *arena = args->arena; 142 | size_t total_memory = args->total_memory; 143 | size_t stride = args->stride; 144 | size_t tlb_locality = args->tlb_locality; 145 | void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation; 146 | const perm_t *mixer = args->mixer + mixer_idx * args->nr_mixers; 147 | size_t nr_mixer_indices = args->nr_mixer_indices; 148 | 149 | size_t nr_tlb_groups = total_memory / tlb_locality; 150 | size_t nr_elts_per_tlb = tlb_locality / stride; 151 | size_t nr_elts = total_memory / stride; 152 | perm_t *tlb_perm; 153 | perm_t *perm; 154 | size_t i; 155 | perm_t *perm_inverse; 156 | size_t mixer_scale = stride / nr_mixer_indices; 157 | 158 | if (verbosity > 1) 159 | printf("generating permutation of %zu elements (in %zu TLB groups)\n", 160 | nr_elts, nr_tlb_groups); 161 | tlb_perm = malloc(nr_tlb_groups * sizeof(*tlb_perm)); 162 | gen_permutation(tlb_perm, nr_tlb_groups, 0); 163 | perm = malloc(nr_elts * sizeof(*perm)); 164 | for (i = 0; i < nr_tlb_groups; ++i) { 165 | gen_permutation(&perm[i * nr_elts_per_tlb], nr_elts_per_tlb, 166 | tlb_perm[i] * nr_elts_per_tlb); 167 | } 168 | free(tlb_perm); 169 | 170 | dassert(is_a_permutation(perm, nr_elts)); 171 | 172 | if (verbosity > 1) printf("generating inverse permtuation\n"); 173 | perm_inverse = malloc(nr_elts * sizeof(*perm)); 174 | for (i = 0; i < nr_elts; ++i) { 175 | perm_inverse[perm[i]] = i; 176 | } 177 | 178 | dassert(is_a_permutation(perm_inverse, nr_elts)); 179 | 180 | #define MIXED(x) ((x)*stride + mixer[(x) & (args->nr_mixers - 1)] * mixer_scale) 181 | 182 | if (verbosity > 1) 183 | printf("threading the chase (mixer_idx = %zu)\n", mixer_idx); 184 | for (i = 0; i < nr_elts; ++i) { 185 | size_t next; 186 | dassert(perm[perm_inverse[i]] == i); 187 | next = perm_inverse[i] + 1; 188 | next = (next == nr_elts) ? 0 : next; 189 | *(void **)(arena + MIXED(i)) = (void *)(arena + MIXED(perm[next])); 190 | } 191 | 192 | free(perm); 193 | free(perm_inverse); 194 | 195 | return arena + MIXED(0); 196 | } 197 | 198 | // Generates nr_mixer_indices/total_par number of permutations and switch to 199 | // the next permutation in each iteration of the chase. 200 | // This modification is effective in getting around CMC prefetcher. 201 | void *generate_chase_long(const struct generate_chase_common_args *args, 202 | size_t mixer_idx, size_t total_par) { 203 | char *arena = args->arena; 204 | size_t total_memory = args->total_memory; 205 | size_t stride = args->stride; 206 | size_t tlb_locality = args->tlb_locality; 207 | void (*gen_permutation)(perm_t *, size_t, size_t) = args->gen_permutation; 208 | size_t nr_mixer_indices = args->nr_mixer_indices; 209 | size_t nr_iteration = nr_mixer_indices / total_par; 210 | const perm_t *mixer = args->mixer + mixer_idx * nr_iteration * args->nr_mixers; 211 | 212 | size_t nr_tlb_groups = total_memory / tlb_locality; 213 | size_t nr_elts_per_tlb = tlb_locality / stride; 214 | size_t nr_elts = total_memory / stride; 215 | perm_t *tlb_perm; 216 | perm_t *perm; 217 | size_t i; 218 | size_t j; 219 | size_t base; 220 | perm_t *perm_inverse; 221 | size_t mixer_scale = stride / nr_mixer_indices; 222 | 223 | if (verbosity > 1) 224 | printf("generating permutation of %zu elements (in %zu TLB groups)\n", 225 | nr_elts, nr_tlb_groups); 226 | 227 | perm = malloc(nr_iteration * nr_elts * sizeof(*perm)); 228 | if (perm == NULL) { 229 | fprintf(stderr, "Could not allocate %lu bytes\n", 230 | nr_iteration * nr_elts * sizeof(*perm)); 231 | exit(1); 232 | } 233 | 234 | // Generate nr_iteration number of permutations. 235 | for (j = 0; j < nr_iteration; j++) { 236 | base = j * nr_elts; 237 | 238 | tlb_perm = malloc(nr_tlb_groups * sizeof(*tlb_perm)); 239 | if (tlb_perm == NULL) { 240 | fprintf(stderr, "Could not allocate %lu bytes\n", 241 | nr_tlb_groups * sizeof(*tlb_perm)); 242 | exit(1); 243 | } 244 | 245 | gen_permutation(tlb_perm, nr_tlb_groups, 0); 246 | 247 | for (i = 0; i < nr_tlb_groups; ++i) { 248 | gen_permutation(&perm[j * nr_elts + i * nr_elts_per_tlb], nr_elts_per_tlb, 249 | base + tlb_perm[i] * nr_elts_per_tlb); 250 | } 251 | free(tlb_perm); 252 | 253 | dassert(is_a_permutation(perm, nr_elts)); 254 | if (verbosity > 1) 255 | printf("generating inverse permutation\n"); 256 | } 257 | 258 | dassert(is_a_permutation(perm, nr_iteration * nr_elts)); 259 | 260 | perm_inverse = malloc(nr_iteration * nr_elts * sizeof(*perm_inverse)); 261 | if (perm_inverse == NULL) { 262 | fprintf(stderr, "Could not allocate %lu bytes\n", 263 | nr_iteration * nr_elts * sizeof(*perm_inverse)); 264 | exit(1); 265 | } 266 | 267 | for (i = 0; i < nr_iteration * nr_elts; ++i) { 268 | perm_inverse[perm[i]] = i; 269 | } 270 | 271 | dassert(is_a_permutation(perm_inverse, nr_iteration, nr_elts)); 272 | 273 | // Get the [(x mod NR_MIXER)th element in the jth row of mixer]th element 274 | // in the xth stride of the array. 275 | #define MIXED_2(x,j,n) ((x)*stride + (mixer + j*n)[(x) & (n-1)] * mixer_scale) 276 | 277 | if (verbosity > 1) 278 | printf("threading the chase (mixer_idx = %zu)\n", mixer_idx); 279 | 280 | // Generate the final permutation, which connects nr_iteration * nr_elts 281 | // number of permutations together into one. 282 | for (i = 0; i < nr_elts * nr_iteration; ++i) { 283 | size_t next; 284 | dassert(perm[perm_inverse[i]] == i); 285 | assert(*(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) == NULL); 286 | next = perm_inverse[i] + 1; 287 | // If next is the position representing the start of a new iteration of 288 | // permutation, set next to be the position representing the start of 289 | // current iteration, because we want to finish current permutation before 290 | // proceeding to the next. 291 | next = (next%nr_elts == 0 && next/nr_elts > i/nr_elts) ? i/nr_elts*nr_elts : next; 292 | 293 | if (perm[next]%nr_elts == 0) { 294 | // If current iteration of permutation is finished, 295 | // new position is the start of next iteration. 296 | size_t new = (i/nr_elts + 1) * nr_elts; 297 | new = (new == nr_iteration * nr_elts) ? 0 : new; 298 | *(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) = 299 | (void *)(arena + MIXED_2(new%nr_elts,new/nr_elts, args->nr_mixers)); 300 | } else { 301 | *(void **)(arena + MIXED_2(i%nr_elts,i/nr_elts, args->nr_mixers)) = 302 | (void *)(arena + MIXED_2(perm[next]%nr_elts,perm[next]/nr_elts, args->nr_mixers)); 303 | } 304 | } 305 | 306 | free(perm); 307 | free(perm_inverse); 308 | 309 | return arena + MIXED_2(0,0, args->nr_mixers); 310 | } 311 | -------------------------------------------------------------------------------- /permutation.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #ifndef PERMUTATION_H_INCLUDED 15 | #define PERMUTATION_H_INCLUDED 16 | 17 | #include 18 | #include 19 | #include 20 | #include 21 | 22 | typedef size_t perm_t; 23 | 24 | void gen_random_permutation(perm_t *perm, size_t nr, size_t base); 25 | void gen_ordered_permutation(perm_t *perm, size_t nr, size_t base); 26 | int is_a_permutation(const perm_t *perm, size_t nr_elts); 27 | 28 | // regarding the mixer: suppose we have a stride of 256 ... what we want 29 | // to avoid is having the entire chase at offset 0 into the 256 byte element. 30 | // otherwise we might favour one bank/branch/etc. of the memory system. 31 | // similarly when we perform a parallel chase with multiple threads we don't 32 | // want one of the many chases to favour a particular offset into the stride. 33 | // so we have a "mixer" permutation. 34 | // 35 | // consider the arena as a large set of elements of size stride, then the naive 36 | // "mixer" would use some fixed offset into each element for a particular 37 | // (thread number, parallel chase) index. the mixer is a function on (element 38 | // index, thread number, parallel chase) which makes the offset into the element 39 | // unpredictable. 40 | // 41 | // the actual mixer is implemented as a large set of permutations on the 42 | // low bits of the element number. the details are private to the 43 | // implementation. 44 | 45 | // these are the common args required for generating all chases (and the mixer). 46 | struct generate_chase_common_args { 47 | char *arena; // memory used for all chases 48 | size_t total_memory; // size of the arena 49 | size_t stride; // size of each element 50 | size_t tlb_locality; // group accesses within this range in order to 51 | // amortize TLB fills 52 | size_t nr_mixers; // Rounded up to power of two number of mixers: 53 | // nr_threads * parallelism rounded up to power of two 54 | void (*gen_permutation)(perm_t *, size_t, 55 | size_t); // function for generating 56 | // permutations 57 | // typically gen_random_permutation 58 | size_t nr_mixer_indices; // number of mixer indices 59 | // typically stride/sizeof(void*) 60 | const perm_t *mixer; // the mixer function itself 61 | }; 62 | 63 | // create the mixer table 64 | void generate_chase_mixer(struct generate_chase_common_args *args, 65 | size_t nr_mixers); 66 | 67 | // create a chase for the given mixer_idx and return its first pointer 68 | void *generate_chase(const struct generate_chase_common_args *args, 69 | size_t mixer_idx); 70 | 71 | // create a longer chase for the given mixer_idx and total_par and 72 | // return its first pointer 73 | void *generate_chase_long(const struct generate_chase_common_args *args, 74 | size_t mixer_idx, size_t total_par); 75 | 76 | //============================================================================ 77 | // Modern multicore CPUs have increasingly large caches, so the LCRNG code 78 | // that was previously used is not sufficiently random anymore. 79 | // Now using glibc's reentrant random number generator "random_r" 80 | // still reproducible on the same platform, although not across systems/libs. 81 | 82 | // RNG_BUF_SIZE sets the size of rng_buf below, which is used by initstate_r 83 | // to decide how sophisticated a random number generator it should use: the 84 | // larger the state array, the better the random numbers will be. 85 | // 32 bytes was deemed to generate sufficient entropy. 86 | #define RNG_BUF_SIZE 32 87 | extern __thread char *rng_buf; 88 | extern __thread struct random_data *rand_state; 89 | 90 | static inline void rng_init(unsigned thread_num) { 91 | rng_buf = (char *)calloc(1, RNG_BUF_SIZE); 92 | rand_state = (struct random_data *)calloc(1, sizeof(struct random_data)); 93 | assert(rand_state); 94 | if (initstate_r(thread_num, rng_buf, RNG_BUF_SIZE, rand_state) != 0) { 95 | perror("initstate_r"); 96 | exit(1); 97 | } 98 | } 99 | 100 | static inline perm_t rng_int(perm_t limit) { 101 | int r1, r2, r3, r4; 102 | uint64_t r; 103 | 104 | if (random_r(rand_state, &r1) || random_r(rand_state, &r2) 105 | || random_r(rand_state, &r3) || random_r(rand_state, &r4)) { 106 | perror("random_r"); 107 | exit(1); 108 | } 109 | // Assume that RAND_MAX is at least 16-bit long 110 | _Static_assert (RAND_MAX >= (1ul << 16), "RAND_MAX is too small"); 111 | r = (((uint64_t)r1 << 0) & 0x000000000000FFFFull) | 112 | (((uint64_t)r2 << 16) & 0x00000000FFFF0000ull) | 113 | (((uint64_t)r3 << 32) & 0x0000FFFF00000000ull) | 114 | (((uint64_t)r4 << 48) & 0xFFFF000000000000ull); 115 | 116 | return r % (limit + 1); 117 | } 118 | 119 | #endif 120 | -------------------------------------------------------------------------------- /pingpong.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #define _GNU_SOURCE 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | 25 | #include "cpu_util.h" 26 | #include "timer.h" 27 | 28 | #define NR_SAMPLES (5) 29 | #define SAMPLE_US (250000) 30 | 31 | static size_t nr_relax = 10; 32 | static size_t nr_tested_cores = ~0; 33 | 34 | typedef unsigned atomic_t; 35 | 36 | // this points to the mutex which will be pingponged back and forth 37 | // from core to core. it is allocated with mmap by the even thread 38 | // so that it should be local to at least one of the two cores (and 39 | // won't have any false sharing issues). 40 | static atomic_t *pingpong_mutex; 41 | 42 | // try avoid false sharing by padding out the atomic_t 43 | typedef union { 44 | atomic_t x; 45 | char pad[AVOID_FALSE_SHARING]; 46 | } big_atomic_t __attribute__((aligned(AVOID_FALSE_SHARING))); 47 | static big_atomic_t nr_pingpongs; 48 | 49 | // an array we optionally modify to examine the effect of passing 50 | // more dirty data between caches. 51 | size_t nr_array_elts = 0; 52 | size_t *communication_array; 53 | 54 | // 55 | static volatile int stop_loops; 56 | 57 | typedef struct { 58 | cpu_set_t cpus; 59 | atomic_t me; 60 | atomic_t buddy; 61 | } thread_args_t; 62 | 63 | static void common_setup(thread_args_t *args) { 64 | // move to our target cpu 65 | if (sched_setaffinity(0, sizeof(cpu_set_t), &args->cpus)) { 66 | perror("sched_setaffinity"); 67 | exit(1); 68 | } 69 | 70 | // test if we're supposed to allocate the pingpong_mutex memory 71 | if (args->me == 0) { 72 | pingpong_mutex = mmap(0, getpagesize(), PROT_READ | PROT_WRITE, 73 | MAP_ANON | MAP_PRIVATE, -1, 0); 74 | if (pingpong_mutex == MAP_FAILED) { 75 | perror("mmap"); 76 | exit(1); 77 | } 78 | *pingpong_mutex = args->me; 79 | } 80 | 81 | // ensure both threads are ready before we leave -- so that 82 | // both threads have a copy of pingpong_mutex. 83 | static pthread_mutex_t wait_mutex = PTHREAD_MUTEX_INITIALIZER; 84 | static pthread_cond_t wait_cond = PTHREAD_COND_INITIALIZER; 85 | static int wait_for_buddy = 1; 86 | pthread_mutex_lock(&wait_mutex); 87 | if (wait_for_buddy) { 88 | wait_for_buddy = 0; 89 | pthread_cond_wait(&wait_cond, &wait_mutex); 90 | } else { 91 | wait_for_buddy = 1; // for next invocation 92 | pthread_cond_broadcast(&wait_cond); 93 | } 94 | pthread_mutex_unlock(&wait_mutex); 95 | } 96 | 97 | #define template(name, xchg) \ 98 | static void *name(void *data) { \ 99 | thread_args_t *args = (thread_args_t *)data; \ 100 | \ 101 | common_setup(args); \ 102 | \ 103 | atomic_t nr = 0; \ 104 | atomic_t me = args->me; \ 105 | atomic_t buddy = args->buddy; \ 106 | atomic_t *cache_pingpong_mutex = pingpong_mutex; \ 107 | while (1) { \ 108 | if (stop_loops) { \ 109 | pthread_exit(0); \ 110 | } \ 111 | \ 112 | if (xchg(cache_pingpong_mutex, me, buddy)) { \ 113 | for (size_t x = 0; x < nr_array_elts; ++x) { \ 114 | ++communication_array[x]; \ 115 | } \ 116 | /* don't do the atomic_add every time... it costs too much */ \ 117 | ++nr; \ 118 | if (nr == 10000 && me == 0) { \ 119 | __sync_fetch_and_add(&nr_pingpongs.x, 2 * nr); \ 120 | nr = 0; \ 121 | } \ 122 | } \ 123 | for (size_t i = 0; i < nr_relax; ++i) { \ 124 | cpu_relax(); \ 125 | } \ 126 | } \ 127 | } 128 | 129 | template(locked_loop, __sync_bool_compare_and_swap) 130 | 131 | static inline int unlocked_xchg(atomic_t *p, atomic_t old, atomic_t new) { 132 | if (*(volatile atomic_t *)p == old) { 133 | *(volatile atomic_t *)p = new; 134 | return 1; 135 | } 136 | return 0; 137 | } 138 | 139 | template(unlocked_loop, unlocked_xchg) 140 | 141 | static void *xadd_loop(void *data) { 142 | thread_args_t *args = (thread_args_t *)data; 143 | 144 | common_setup(args); 145 | uint64_t *xadder = (uint64_t *)pingpong_mutex; 146 | atomic_t me = args->me; 147 | uint64_t add_amt = (me == 0) ? 1 : (1ull << 32); 148 | uint32_t last_lo = 0; 149 | atomic_t nr = 0; 150 | 151 | while (1) { 152 | if (stop_loops) { 153 | pthread_exit(0); 154 | } 155 | 156 | uint64_t swap = __sync_fetch_and_add(xadder, add_amt); 157 | if (me == 1 && last_lo != (uint32_t)swap) { 158 | last_lo = swap; 159 | ++nr; 160 | if (nr == 10000) { 161 | __sync_fetch_and_add(&nr_pingpongs.x, 2 * nr); 162 | nr = 0; 163 | } 164 | } 165 | for (size_t i = 0; i < nr_relax; ++i) { 166 | cpu_relax(); 167 | } 168 | } 169 | } 170 | 171 | int main(int argc, char **argv) { 172 | void *(*thread_fn)(void *data) = NULL; 173 | int c; 174 | char *p; 175 | 176 | while ((c = getopt(argc, argv, "c:lur:xs:")) != -1) { 177 | switch (c) { 178 | case 'l': 179 | if (thread_fn) goto thread_fn_error; 180 | thread_fn = locked_loop; 181 | break; 182 | case 'u': 183 | if (thread_fn) goto thread_fn_error; 184 | thread_fn = unlocked_loop; 185 | break; 186 | case 'x': 187 | if (thread_fn) goto thread_fn_error; 188 | thread_fn = xadd_loop; 189 | break; 190 | case 'r': 191 | nr_relax = strtoul(optarg, &p, 0); 192 | if (*p) { 193 | fprintf(stderr, "-r requires a numeric argument\n"); 194 | exit(1); 195 | } 196 | break; 197 | case 'c': 198 | nr_tested_cores = strtoul(optarg, &p, 0); 199 | if (*p) { 200 | fprintf(stderr, "-c requires a numeric argument\n"); 201 | exit(1); 202 | } 203 | break; 204 | case 's': 205 | nr_array_elts = strtoul(optarg, &p, 0); 206 | if (*p) { 207 | fprintf(stderr, "-s requires a numeric argument\n"); 208 | exit(1); 209 | } 210 | if (posix_memalign((void **)&communication_array, 1ull << 21, 211 | nr_array_elts * sizeof(*communication_array))) { 212 | fprintf(stderr, "posix_memalign failed\n"); 213 | exit(1); 214 | } 215 | break; 216 | default: 217 | fprintf(stderr, 218 | "usage: %s [-l | -u | -x] [-r nr_relax] [-s " 219 | "nr_array_elts_to_dirty] [-c nr_tested_cores]\n", 220 | argv[0]); 221 | exit(1); 222 | } 223 | } 224 | if (thread_fn == NULL) { 225 | thread_fn_error: 226 | fprintf(stderr, "must specify exactly one of -u, -l or -x\n"); 227 | exit(1); 228 | } 229 | 230 | setvbuf(stdout, NULL, _IONBF, BUFSIZ); 231 | 232 | // find the active cpus 233 | cpu_set_t cpus; 234 | if (sched_getaffinity(getpid(), sizeof(cpus), &cpus)) { 235 | perror("sched_getaffinity"); 236 | exit(1); 237 | } 238 | 239 | printf( 240 | "avg latency to communicate a modified line from one core to another\n"); 241 | printf("times are in ns\n\n"); 242 | 243 | // print top row header 244 | const int col_width = 8; 245 | size_t first_cpu = ~0; 246 | size_t last_cpu = 0; 247 | printf(" "); 248 | for (size_t j = 0; j < CPU_SETSIZE; ++j) { 249 | if (CPU_ISSET(j, &cpus)) { 250 | if (first_cpu > j) { 251 | first_cpu = j; 252 | } else { 253 | printf("%*zu", col_width, j); 254 | } 255 | if (last_cpu < j) { 256 | last_cpu = j; 257 | } 258 | } 259 | } 260 | printf("\n"); 261 | 262 | for (size_t i = 0, core = 0; i < last_cpu && core < nr_tested_cores; ++i) { 263 | if (!CPU_ISSET(i, &cpus)) { 264 | continue; 265 | } 266 | ++core; 267 | thread_args_t even; 268 | CPU_ZERO(&even.cpus); 269 | CPU_SET(i, &even.cpus); 270 | even.me = 0; 271 | even.buddy = 1; 272 | printf("%2zu:", i); 273 | for (size_t j = first_cpu + 1; j <= i; ++j) { 274 | if (CPU_ISSET(j, &cpus)) { 275 | printf("%*s", col_width, ""); 276 | } 277 | } 278 | for (size_t j = i + 1; j <= last_cpu; ++j) { 279 | if (!CPU_ISSET(j, &cpus)) { 280 | continue; 281 | } 282 | 283 | thread_args_t odd; 284 | CPU_ZERO(&odd.cpus); 285 | CPU_SET(j, &odd.cpus); 286 | odd.me = 1; 287 | odd.buddy = 0; 288 | __sync_lock_test_and_set(&nr_pingpongs.x, 0); 289 | pthread_t odd_thread; 290 | if (pthread_create(&odd_thread, NULL, thread_fn, &odd)) { 291 | perror("pthread_create odd"); 292 | exit(1); 293 | } 294 | pthread_t even_thread; 295 | if (pthread_create(&even_thread, NULL, thread_fn, &even)) { 296 | perror("pthread_create even"); 297 | exit(1); 298 | } 299 | 300 | uint64_t last_stamp = now_nsec(); 301 | double best_sample = 1. / 0.; // infinity 302 | for (size_t sample_no = 0; sample_no < NR_SAMPLES; ++sample_no) { 303 | usleep(SAMPLE_US); 304 | atomic_t s = __sync_lock_test_and_set(&nr_pingpongs.x, 0); 305 | uint64_t time_stamp = now_nsec(); 306 | double sample = (time_stamp - last_stamp) / (double)s; 307 | last_stamp = time_stamp; 308 | if (sample < best_sample) { 309 | best_sample = sample; 310 | } 311 | } 312 | printf("%*.1f", col_width, best_sample); 313 | 314 | stop_loops = 1; 315 | if (pthread_join(odd_thread, NULL)) { 316 | perror("pthread_join odd_thread"); 317 | exit(1); 318 | } 319 | if (pthread_join(even_thread, NULL)) { 320 | perror("pthread_join even_thread"); 321 | exit(1); 322 | } 323 | stop_loops = 0; 324 | 325 | if (munmap(pingpong_mutex, getpagesize())) { 326 | perror("munmap"); 327 | exit(1); 328 | } 329 | pingpong_mutex = NULL; 330 | } 331 | printf("\n"); 332 | } 333 | printf("\n"); 334 | 335 | return 0; 336 | } 337 | -------------------------------------------------------------------------------- /run_multiload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2020 Ampere Computing LLC. All Rights Reserved. 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | ARCH=$(uname -p) 16 | HOSTNAME=$(cat /etc/hostname) 17 | DATE=`date +"%Y_%m-%d_%H.%M.%S.%p"` 18 | TOPDIR=$(pwd) 19 | start_time=$(date +'%s') 20 | MULTILOAD_PATH=${TOPDIR} 21 | BINDIR="${MULTILOAD_PATH}" 22 | OUTPUT_DIR="${TOPDIR}/results" 23 | mkdir -p ${OUTPUT_DIR} 24 | LOG_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_run.log" 25 | log(){ 26 | echo -e "[$(date +"%m/%d %H:%M:%S %p")] $1" | tee -a $LOG_FILE 27 | } 28 | 29 | log_without_date(){ 30 | echo -e "$1" | tee -a $LOG_FILE 31 | } 32 | 33 | ################################################################################ 34 | # Test configuration variables 35 | ################################################################################ 36 | RUN_TEST_TYPE=2 # 0=latency only, 1=load only, 2=loaded latency 37 | ITERATIONS=3 # Number of times the program is run. Due to "Samples" above, reliable data can typically be had with only 1 iteration. 38 | SAMPLES=5 # Specfies the number of data samples taken during a single run of mulitiload 39 | # ~2sec per sample + 4sec warmup. Duration depends on LOAD_DELAY_* defines in multiload.c 40 | # Default is to return the best latency of the samples. Command "-a" can be used to return the average. 41 | SOCKET_EVAL=1 # 1=1P testing on a 2P system. 2=2P testing on a 2P system. Does not apply to a Non-Numa system. 42 | USE_REMOTE_MEMNODE=0 # Numactl only: 0= use localalloc, 1=force SOCKET_EVAL=1 and use remote memory (ie. 2nd half of the numa nodes) 43 | THREAD_AFFINITY_ENABLED=1 # enables use of tasket/numactl for thread control 44 | MPSTAT_PROFILE_ENABLE=0 #enables mpstat data collection 45 | VMSTAT_PROFILE_ENABLE=0 #enables vmstat data collection 46 | PROFILING_INTERVAL_SEC=3 #defines sampling rate 47 | 48 | usage(){ 49 | echo "========================================================================================" 50 | echo "Multiload memory read latency, bandwidth, and loaded-latency Benchmark Test" 51 | echo "hostname is ${HOSTNAME}, $(date)" 52 | echo "========================================================================================" 53 | echo " " 54 | echo "Command args: $0 " 55 | echo "Default is: $0 $RUN_TEST_TYPE $ITERATIONS $SAMPLES $SOCKET_EVAL $USE_REMOTE_MEMNODE $THREAD_AFFINITY_ENABLED" 56 | echo " " 57 | echo "" 58 | echo " 0 = Memory Read Latency (Runs multichase \"simple\" test, but all multichase commands should work manually)" 59 | echo " 1 = Memory Bandwidth (Runs a list of bandwidth load algorithms)" 60 | echo " 2 = Loaded Latency. (\"Chaseload\" combines 1 \"simple\" latency thread and multiple Bandwidth threads)" 61 | echo " " 62 | echo "The following are optional" 63 | echo " Number of multiload test runs. (Default=$ITERATIONS)" 64 | echo " Data samples per test run. (Default=$SAMPLES)" 65 | echo " 2P system only: 1=test as 1P, 2=test as 2P. (Default=$SOCKET_EVAL)" 66 | echo " NUMA system only: 0= --localalloc, 1= -membind to 2nd half of numa nodes (Default=$USE_REMOTE_MEMNODE)" 67 | echo " 0=no affinity, 1=use taskset for No-Numa and use numactl for Numa systems (Default=$THREAD_AFFINITY_ENABLED)" 68 | echo " " 69 | echo "========================================================================================" 70 | echo "Chase algorithm list. Issue multiload -h command for full list. The 2 used by this script are:" 71 | echo " simple - randomized pointer chaser latency" 72 | echo " chaseload - Runs 1 thread of \"simple\" latency with multiple threads using the loads below." 73 | echo " " 74 | echo "Load algorithm list to test various rd/wr ratios. More algorithms can easily be added to multiload.c. Current algorithms are:" 75 | echo " memcpy-libc 1:1 rd:wr ratio - glibc memcpy()" 76 | echo " memset-libc 0:1 rd:wr ratio - glibc memset() non-zero data" 77 | echo " memsetz-libc 0:1 rd:wr ratio - glibc memset() zero data" 78 | echo " stream-copy 1:1 rd:wr ratio - lmbench stream copy instructions b[i]=a[i] (actual binary depends on compiler & -O level)" 79 | echo " stream-sum 1:0 rd:wr ratio - lmbench stream sum instructions: a[i]+=1 (actual binary depends on compiler & -O level)" 80 | echo " stream-triad 2:1 rd:wr ratio - lmbench stream triad instructions: a[i]=b[i]+(scalar*c[i])" 81 | echo " " 82 | echo "*** Due to the complexity of other options, they can only be changed by editting this script" 83 | echo "========================================================================================" 84 | } 85 | 86 | if [ "$#" == "0" ] ; then 87 | usage 88 | exit 1 89 | fi 90 | if [ ! -z $1 ]; then 91 | RUN_TEST_TYPE=$1 92 | if [ ! -z $2 ]; then 93 | ITERATIONS=$2 94 | if [ ! -z $3 ]; then 95 | SAMPLES=$3 96 | if [ ! -z $4 ]; then 97 | SOCKET_EVAL=$4 98 | if [ ! -z $5 ]; then 99 | USE_REMOTE_MEMNODE=$5 100 | if [ ! -z $6 ]; then 101 | THREAD_AFFINITY_ENABLED=$6 102 | fi 103 | fi 104 | fi 105 | fi 106 | fi 107 | fi 108 | 109 | RUN_CHASE=0 110 | RUN_BANDWIDTH=1 111 | RUN_CHASE_LOADED=2 112 | if [ $RUN_TEST_TYPE = $RUN_CHASE ]; then 113 | PSTEP_START=1 # Parallel thread start value when running thread scaling tests. 114 | PSTEP_INC=4 # Parallel thread steps when running thread scaling tests. 115 | PSTEP_END=512 # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END. 116 | CHASE_ALGORITHM="simple" 117 | LOAD_ALGORITHM_LIST="none" 118 | RAND_STRIDE=16 #lmbench latmemrd uses 16 for simple chase. Other chase/mem sizes may need to be bigger (ie. 512). 119 | BUFLIST_TYPE=0 # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom 120 | let MEM_SIZE_END_B=1*1024*1024*1024 121 | let MEM_SIZE_START_B=4*1024 122 | #let MEM_SIZE_START_B=MEM_SIZE_END_B 123 | buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) ) # 64K / 1M / 32M caches 124 | 125 | elif [ $RUN_TEST_TYPE = $RUN_BANDWIDTH ]; then 126 | PSTEP_START=1 # Parallel thread start value when running thread scaling tests. 127 | PSTEP_INC=4 # Parallel thread steps when running thread scaling tests. 128 | PSTEP_END=512 # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END. 129 | CHASE_ALGORITHM="none" 130 | LOAD_ALGORITHM_LIST="memcpy-libc memset-libc memsetz-libc stream-sum stream-triad" 131 | RAND_STRIDE=16 #not used for bandwidth test 132 | BUFLIST_TYPE=1 # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom 133 | let MEM_SIZE_END_B=1*1024*1024*1024 134 | #let MEM_SIZE_START_B=4*1024 135 | let MEM_SIZE_START_B=MEM_SIZE_END_B 136 | buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) ) # 64K / 1M / 32M caches 137 | 138 | elif [ $RUN_TEST_TYPE = $RUN_CHASE_LOADED ]; then 139 | PSTEP_START=1 # Parallel thread start value when running thread scaling tests. 140 | PSTEP_INC=4 # Parallel thread steps when running thread scaling tests. 141 | PSTEP_END=512 # Will be reduced to CPUTHREADS if CPUTHREADS < PSTEP_END. 142 | CHASE_ALGORITHM="chaseload" 143 | LOAD_ALGORITHM_LIST="memcpy-libc memset-libc memsetz-libc stream-sum stream-triad" 144 | RAND_STRIDE=16 #lmbench latmemrd uses 16 for simple chase. Other chase/mem sizes may need to be bigger (ie. 512). 145 | BUFLIST_TYPE=0 # 0=Use MEM_SIZE* to create a memory list, 1=use buflist_custom 146 | let MEM_SIZE_END_B=1*1024*1024*1024 147 | #let MEM_SIZE_START_B=4*1024 148 | let MEM_SIZE_START_B=MEM_SIZE_END_B 149 | buflist_custom=( $((32*1024)) $((512*1024)) $((16*1024*1024)) $((1*1024*1024*1024)) ) # 64K / 1M / 32M caches 150 | else 151 | echo "Found unknown RUN_TEST_TYPE=$RUN_TEST_TYPE" 152 | usage 153 | exit 154 | fi 155 | 156 | ################################################################################ 157 | # Functions 158 | ################################################################################ 159 | profiling_start(){ 160 | if [ "$MPSTAT_PROFILE_ENABLE" == "1" ] ; then 161 | echo "$1" >> ${LOG_MPSTATS_FILE} 162 | mpstat -P ALL $PROFILING_INTERVAL_SEC >> ${LOG_MPSTATS_FILE} 2>&1 & 163 | mpstat_pid=$! 164 | fi 165 | if [ "$VMSTAT_PROFILE_ENABLE" == "1" ] ; then 166 | echo "$1" >> ${LOG_VMSTATS_FILE} 167 | vmstat -t $PROFILING_INTERVAL_SEC >> ${LOG_VMSTATS_FILE} 2>&1 & 168 | vmstat_pid=$! 169 | fi 170 | } 171 | 172 | profiling_end(){ 173 | # kill the profiling pids and try to hide the "terminated" messages 174 | if [ "$MPSTAT_PROFILE_ENABLE" == "1" ] ; then 175 | ( kill $mpstat_pid &> /dev/null ) & 176 | wait $mpstat_pid &> /dev/null 177 | fi 178 | if [ "$VMSTAT_PROFILE_ENABLE" == "1" ] ; then 179 | ( kill $vmstat_pid &> /dev/null ) & 180 | wait $vmstat_pid &> /dev/null 181 | fi 182 | } 183 | 184 | get_hardware_config () 185 | { 186 | log_without_date " " 187 | numactl --hardware | tee -a ${LOG_FILE} # display current NUMA & memory setup 188 | log_without_date " " 189 | 190 | phycore_num=`lscpu | grep "Core(s) per socket" | tr -d ' ' | cut -d':' -f2 2> /dev/null` 191 | core_threads=`lscpu | grep "Thread(s) per core:" | tr -d ' ' | cut -d':' -f2 2> /dev/null` 192 | cputhread_num=`lscpu | grep "CPU(s): " | head -n 1 | tr -d ' ' | cut -d ':' -f2 2> /dev/null` 193 | numa_num=`lscpu | grep "NUMA node(s)" | tr -d ' ' | cut -d':' -f2 2> /dev/null` 194 | socket_num=`lscpu | grep "Socket(s)" | tr -d ' ' | cut -d':' -f2 2> /dev/null` 195 | MEMBIND_LIST=`numactl --show 2> /dev/null | grep membind | cut -d':' -f2 2> /dev/null` 196 | let phycore_end=$phycore_num-1 197 | let ht_threads=$phycore_num*$core_threads 198 | #echo "get_hardware_config: phyend=$phycore_end, ht_t=$ht_threads" 199 | 200 | if [ -z $cputhread_num ]; then 201 | log_without_date "Can't find the CPU(s) core count, exiting" 202 | exit $? 203 | else 204 | CPUTHREADS=$cputhread_num 205 | fi 206 | 207 | log "Found the following hardware:" 208 | log_without_date " sockets = $socket_num" 209 | log_without_date " physical cores = $phycore_num" 210 | log_without_date " threads per core= $core_threads" 211 | log_without_date " logical threads = $cputhread_num" 212 | if [ -z "$numa_num" ]; then 213 | NUMA_NODES=1 214 | log_without_date " NUMA nodes = none found" 215 | else 216 | NUMA_NODES=$numa_num 217 | log_without_date " NUMA nodes = $NUMA_NODES" 218 | fi 219 | 220 | if [ $USE_REMOTE_MEMNODE == "1" ] && [ $NUMA_NODES -gt "1" ] && [ $THREAD_AFFINITY_ENABLED == "1" ]; then 221 | SOCKET_EVAL=1 222 | if [ $numa_num == "2" ]; then 223 | NODE_MEMBIND="1" 224 | elif [ $numa_num == "4" ]; then 225 | NODE_MEMBIND="2,3" 226 | elif [ $numa_num == "8" ]; then 227 | NODE_MEMBIND="4,5,6,7" 228 | else 229 | NODE_MEMBIND="0" 230 | fi 231 | fi 232 | 233 | if [ $socket_num -eq "1" ]; then 234 | #Check if this is only a 1P box force SOCKET_EVAL=1 235 | SOCKET_EVAL=1 236 | elif [ $socket_num -ge "2" ] && [ $SOCKET_EVAL -eq "1" ]; then 237 | #Check if doing 1P only testing on a 2P+ box and adjust CPUTHREADS for 1P 238 | let CPUTHREADS=$cputhread_num/$socket_num 239 | fi 240 | 241 | if [ $PSTEP_END -gt $CPUTHREADS ]; then 242 | PSTEP_END=$CPUTHREADS 243 | fi 244 | } 245 | 246 | duration(){ # calculates duration in secs 247 | duration=$SECONDS 248 | log_without_date 249 | log "$1 runtime: $(($duration / 3600)) hrs, $((($duration % 3600) / 60)) mins, $(($duration % 60)) secs" 250 | } 251 | 252 | #converts an array into string and deletes spaces (can also add delimiters using $1) 253 | join_ws() { local d=$1 s=$2; shift 2 && printf %s "$s${@/#/$d}"; } 254 | 255 | create_taskset_cpulist_x86_64() 256 | { 257 | let cpu_max_4bits=$1/4+1 #need +1 in case cputhreads is not a multiple of 4. 258 | #create base string arrays 259 | for (( cpu=0; cpu&1 470 | done 471 | elif [ $NUMA_NODES -eq 1 ] ; then 472 | profiling_start "Run $ITERATIONS iterations, taskset ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}" 473 | for i in $(seq 1 $ITERATIONS) ; do 474 | log "Iteration $i of $ITERATIONS, taskset ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}" 475 | taskset ${cpulist[$t-1]} ${BASE_MULTILOAD_CMD} -t $t -m $j ${LOAD_COMMAND} | tee -a ${OUTPUT_DIR}/$FILENAME.txt 2>&1 476 | done 477 | else 478 | profiling_start "Run $ITERATIONS iterations, numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}" 479 | for i in $(seq 1 $ITERATIONS) ; do 480 | log "Run iter $i/$ITERATIONS, numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} $BASE_MULTILOAD_CMD -t $t -m $j ${LOAD_COMMAND}" 481 | numactl ${MEMBIND_COMMAND} -C ${cpulist[$t-1]} ${BASE_MULTILOAD_CMD} -t $t -m $j ${LOAD_COMMAND} | tee -a ${OUTPUT_DIR}/$FILENAME.txt 2>&1 482 | done 483 | fi 484 | profiling_end &> /dev/null 485 | duration "Total" 486 | done 487 | done 488 | done 489 | } 490 | 491 | parse(){ 492 | first=1 493 | rm -f out.txt 494 | while IFS= read -r line; do 495 | #Only keep 1st header line 496 | if [ "$first" -eq "1" ]; then 497 | echo "$line" > out.txt 498 | first=0 499 | elif [[ ! $line =~ "ample" ]]; then 500 | echo "$line" >> out.txt 501 | fi 502 | done < ${OUTPUT_DIR}/$1.txt 503 | 504 | #delete all spaces and tabs 505 | tr -d '[[:blank:]]' < out.txt > ${OUTPUT_DIR}/$1.csv 506 | rm -f out.txt 507 | } 508 | 509 | ################################################################################ 510 | # Main 511 | ################################################################################ 512 | get_hardware_config 513 | create_thdlist 514 | if [ $BUFLIST_TYPE == "0" ]; then 515 | create_buffer_list_lmbench 516 | else 517 | create_buffer_list_custom 518 | fi 519 | 520 | if [ $CHASE_ALGORITHM == "none" ]; then 521 | BASE_MULTILOAD_CMD="${BINDIR}/multiload -s ${RAND_STRIDE} -T 16g -n ${SAMPLES}" 522 | FILENAME="multiload_${DATE}_-s_${RAND_STRIDE}_-T_16g_-n_${SAMPLES}_-m_${bufsize_testlist[0]}-${bufsize_testlist[-1]}_-t_${thdcount_testlist[0]}-${thdcount_testlist[-1]}_EVAL_${SOCKET_EVAL}P" 523 | else 524 | BASE_MULTILOAD_CMD="${BINDIR}/multiload -s ${RAND_STRIDE} -T 16g -n ${SAMPLES} -c ${CHASE_ALGORITHM}" 525 | FILENAME="multiload_${DATE}_-s_${RAND_STRIDE}_-T_16g_-n_${SAMPLES}_-c_${CHASE_ALGORITHM}_-m_${bufsize_testlist[0]}-${bufsize_testlist[-1]}_-t_${thdcount_testlist[0]}-${thdcount_testlist[-1]}_EVAL_${SOCKET_EVAL}P" 526 | fi 527 | if [ $USE_REMOTE_MEMNODE == "1" ] && [ $NUMA_NODES -gt "1" ] && [ $THREAD_AFFINITY_ENABLED == "1" ]; then 528 | FILENAME="${FILENAME}_REMOTE" 529 | fi 530 | LOG_VMSTATS_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_vmstats.log" 531 | LOG_MPSTATS_FILE="${OUTPUT_DIR}/multiload_${DATE}_${HOSTNAME}_mpstats.log" 532 | mpstat_pid="" 533 | vmstat_pid="" 534 | 535 | log_without_date "Test Parameters" 536 | log_without_date " Date: $DATE" 537 | log_without_date " Output Directory: $OUTPUT_DIR" 538 | log_without_date " Data File: $FILENAME.txt" 539 | log_without_date " Data File: $FILENAME.csv" 540 | log_without_date " Log File: $LOG_FILE" 541 | if [ $MPSTAT_PROFILE_ENABLE -eq 1 ]; then 542 | log_without_date " Stat File: $LOG_MPSTATS_FILE" 543 | fi 544 | if [ $VMSTAT_PROFILE_ENABLE -eq 1 ]; then 545 | log_without_date " Stat File: $LOG_VMSTATS_FILE" 546 | fi 547 | log_without_date " Iterations: $ITERATIONS" 548 | log_without_date " Thread List: $( echo "${thdcount_testlist[@]}" )" 549 | log_without_date " Mem Buf List: $( echo "${bufsize_testlist[@]}" )" 550 | log_without_date " Random Stride: $RAND_STRIDE" 551 | if [ "$THREAD_AFFINITY_ENABLED" == "0" ]; then 552 | log_without_date " Thread affinity disabled" 553 | run_test "Thread affinity disabled" 554 | elif [ "$NUMA_NODES" == "1" ] ; then 555 | log_without_date " Numa runs: No" 556 | if [ "$ARCH" == "x86_64" ] ; then 557 | create_taskset_cpulist_x86_64 $CPUTHREADS 558 | else 559 | create_taskset_cpulist_arm64 $CPUTHREADS 560 | fi 561 | run_test "NUMA=${NUMA_NODES}" 562 | else 563 | if [ "$ARCH" == "x86_64" ] ; then 564 | create_numactl_cpulist_x86_64 $CPUTHREADS 565 | else 566 | create_numactl_cpulist_arm64 $CPUTHREADS 567 | fi 568 | if [ $USE_REMOTE_MEMNODE == "1" ]; then 569 | #MEMBIND_COMMAND="-m ${NODE_MEMBIND}" # --membind causes allocation to start on the 1st node, then the next so limited to 1 node of DDR bandwidth. 570 | #MEMBIND_COMMAND_TEXT="m${NODE_MEMBIND}" 571 | MEMBIND_COMMAND="-i ${NODE_MEMBIND}" # --interleave does round robin allocation between the nodes giving higher DDR bandwidth for multiple nodes. 572 | MEMBIND_COMMAND_TEXT="i${NODE_MEMBIND}" 573 | else 574 | MEMBIND_COMMAND="--localalloc" # localalloc allocates in same node as the process or thread calling a malloc() function 575 | MEMBIND_COMMAND_TEXT="localalloc" 576 | fi 577 | log_without_date " Numa runs: Yes" 578 | log_without_date " Numa nodes: $MEMBIND_LIST" 579 | log_without_date " Remote Memory: USE_REMOTE_MEMNODE=$USE_REMOTE_MEMNODE, MEMBIND_COMMAND=$MEMBIND_COMMAND" 580 | log_without_date " Socket eval: Testing cores from $SOCKET_EVAL out of $socket_num sockets" 581 | run_test "NUMA=${NUMA_NODES}" 582 | fi 583 | 584 | parse $FILENAME 585 | finish_time=$(date +'%s') 586 | log_without_date 587 | log "Total eval runtime = $((($finish_time-$start_time) / 3600)) hrs.. $(((($finish_time-$start_time) % 3600) / 60)) mins.. $((($finish_time-$start_time) % 60)) secs.." 588 | log_without_date 589 | exit 590 | -------------------------------------------------------------------------------- /timer.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #ifndef TIMER_H_INCLUDED 15 | #define TIMER_H_INCLUDED 16 | 17 | #include 18 | #include 19 | #include 20 | 21 | static inline uint64_t now_nsec(void) { 22 | struct timespec ts; 23 | clock_gettime(CLOCK_MONOTONIC, &ts); 24 | return ts.tv_sec * ((uint64_t)1000 * 1000 * 1000) + ts.tv_nsec; 25 | } 26 | 27 | #endif 28 | -------------------------------------------------------------------------------- /util.c: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #include "util.h" 15 | 16 | #include 17 | 18 | int parse_mem_arg(const char *str, size_t *result) { 19 | size_t r; 20 | char *p; 21 | 22 | r = strtoull(str, &p, 0); 23 | switch (*p) { 24 | case 'k': 25 | case 'K': 26 | r *= 1024; 27 | ++p; 28 | break; 29 | case 'm': 30 | case 'M': 31 | r *= 1024 * 1024; 32 | ++p; 33 | break; 34 | case 'g': 35 | case 'G': 36 | r *= 1024 * 1024 * 1024; 37 | ++p; 38 | break; 39 | } 40 | if (*p) { 41 | return -1; 42 | } 43 | *result = r; 44 | return 0; 45 | } 46 | -------------------------------------------------------------------------------- /util.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | * Licensed under the Apache License, Version 2.0 (the "License"); 3 | * you may not use this file except in compliance with the License. 4 | * You may obtain a copy of the License at 5 | * 6 | * http://www.apache.org/licenses/LICENSE-2.0 7 | * 8 | * Unless required by applicable law or agreed to in writing, software 9 | * distributed under the License is distributed on an "AS IS" BASIS, 10 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 11 | * See the License for the specific language governing permissions and 12 | * limitations under the License. 13 | */ 14 | #ifndef UTIL_H_INCLUDED 15 | #define UTIL_H_INCLUDED 16 | 17 | #include 18 | 19 | int parse_mem_arg(const char *str, size_t *result); 20 | 21 | #endif 22 | --------------------------------------------------------------------------------