├── README.md ├── apps ├── Makefile └── src │ ├── primary │ ├── ptrchase.cpp │ └── template.cpp │ └── scavengers │ └── compute.cpp ├── do_prof_primary.sh ├── do_prof_scavenger.sh ├── libmsh ├── Makefile ├── msh.c ├── msh.h ├── pthread.c └── spinlock.h ├── msh_bolt.diff └── profile └── scripts ├── capture_cm_inst_primary.sh ├── capture_cm_inst_process.sh ├── capture_cm_inst_scav.sh ├── compute_ld_prob.py ├── compute_ld_prob.sh ├── llc_missed_pcs_rfile.py ├── prepare_tbench.sh ├── read_func.py ├── read_func_process.py ├── read_lats.py └── scav_profile.py /README.md: -------------------------------------------------------------------------------- 1 | # Memory Stall Software Harvester 2 | This repository contains the prototype of the Memory Stall Software Harvester (MSH), the first system designed to transparently and efficiently harvest memory-bound CPU stall cycles in software. Why harvest memory-bound stalls through a software mechanism when there are well-known hardware harvesting mechanisms like Intel Hyperthreads? The answer is that hardware mechanisms are inflexible: they cannot differentiate between latency-sensitive applications and others, and they only provide limited concurrency (e.g., 2 threads), often harvesting too much or too little. MSH allows for adjusting the length and frequency of cycle harvesting more precisely, providing a unique opportunity to utilize stalled cycles of latency-sensitive applications while meeting different latency SLOs. For more details about MSH, please take a look at our OSDI'24 [paper](https://www.usenix.org/system/files/osdi24-luo.pdf). 3 | 4 | ## Limitation in Prototype 5 | Our prototype has some assumptions on primary and scavenger applications to simplify the implementation 6 | - Primary application must use `pthread` library. 7 | - To use a lightweight context switch, our prototype assumes that scavenger applications are given as shared objects and include several symbols and an entry point. Thus, you need to modify and recompile your scavenger application. The required transformation is straightforward and short. 8 | - We assume x86-64 platform and use gcc/g++ with `-fno-omit-frame-pointer` and `-mno-red-zone` flags enabled. 9 | - Tested with gcc/g++ 11.4 and 22.04.1-Ubuntu (kernel version 6.5.0). 10 | 11 | ## Basic Usage 12 | ### Workflow 13 | The use of MSH involves three pieces of software: profiler (scripts based on `perf`), binary instrumentation (`llvm-bolt`), and MSH runtime(`libmsh`). We assume that a user has a primary application and a set of scavenger applications that will run when there are stalled cycles in the primary application. One can use MSH in the following way. 14 | - Modify scavenger applications 15 | - Profile primary/scavenger applications. 16 | - Modify primary/scavenger binaries through binary instrumentation with the profile result. 17 | - Finally, run instrumented binaries with MSH runtime. 18 | 19 | We'll show you how this workflow works with one simple primary(`ptrchase`) and scavenger(`compute.so`). 20 | 21 | ### Prerequisite: Modify Scavenger 22 | MSH requires a scavenger to have the following symbols in the file containing `main` function. 23 | ``` 24 | extern "C" { 25 | int crt_pos = 0; 26 | int argc = 0; 27 | char **argv = 0; 28 | } 29 | ``` 30 | Then, rename `main` function to `entry` as shown below. 31 | 32 | ``` 33 | extern "C" int 34 | entry(void) { 35 | ... 36 | } 37 | ``` 38 | Lastly, compile the scavenger to a shared object file. 39 | 40 | ### Prerequisite: BOLT 41 | We implemented all the binary-level instrumentation in BOLT. Here is the patch instruction: 42 | ``` 43 | git clone https://github.com/llvm/llvm-project.git 44 | git checkout 30c1f31 45 | patch -p1 < msh_bolt.diff 46 | ``` 47 | Then, compile BOLT by following the instruction in [BOLT page]([url](https://github.com/llvm/llvm-project/tree/main/bolt)). We'll assume that `llvm-bolt` is in `PATH` from now on. 48 | 49 | ### Profile and Instrument Primary 50 | ``` 51 | # Compile ptrchase 52 | cd apps 53 | mkdir build 54 | make primary 55 | ``` 56 | 57 | ``` 58 | # usage: ./do_prof_primary.sh [binary] [args] 59 | # Pointer chase 50MB array 60 | cd ${HOME} 61 | ./do_prof_primary.sh ./apps/build/ptrchase 13107200 62 | ``` 63 | 64 | `Makefile` in `apps` has rules that use BOLT to perform binary instrumentation. We assume that `llvm-bolt` is in `PATH`. 65 | ``` 66 | cd apps 67 | make build/ptrchase.bolt 68 | ``` 69 | 70 | ### Profile and Instrument Scavenger 71 | ``` 72 | # Compile compute.so 73 | cd apps 74 | make scavenger 75 | ``` 76 | 77 | ``` 78 | cd ${HOME} 79 | echo "$(pwd)/apps/build/compute.so" > scav.txt 80 | ./do_prof_scavenger.sh 81 | ``` 82 | 83 | You can instrument scavenger in a similar way. However, you need to specify the average yield distance (in nanoseconds) in scavenger to bound it. 84 | ``` 85 | cd apps 86 | YIELD_DISTANCE=100 make build/compute.so.bolt 87 | ``` 88 | 89 | ### Run with MSH Runtime 90 | We used `LD_PRELOAD` trick to attach the runtime to the primary without recompilation. 91 | ``` 92 | cd libmsh 93 | make all 94 | 95 | cd .. 96 | export LD_PRELOAD=$(pwd)/build/libmsh.so:${LD_PRELOAD} 97 | export LD_LIBRARY_PATH=$(pwd)/build:${LD_LIBRARY_PATH} 98 | 99 | echo "$(pwd)/apps/build/compute.so.bolt" > scav.txt 100 | export MSH_SCAV_POOL_PATH=$(pwd)/scav.txt 101 | export SKIP_FIRST_THREAD=1 102 | 103 | ./build/apps/ptrchase.bolt 13107200 104 | ``` 105 | Note: don't forget to reset `LD_PRELOAD` when you finish testing MSH. MSH runtime will intercept all the following `pthread` functions otherwise 106 | 107 | ## TODO 108 | - Add the use of more complex primary (e.g., tailbench) and scavenger (e.g., graph algorithm). 109 | - Explain knobs 110 | 111 | ## Developed by 112 | Sam Son and Zhihong Luo 113 | -------------------------------------------------------------------------------- /apps/Makefile: -------------------------------------------------------------------------------- 1 | CCP=g++ 2 | 3 | PRM_SRC=src/primary 4 | SCAV_SRC=src/scavengers 5 | BUILD=build 6 | CMPC_LIST=../profile/results/cmpc_list.txt 7 | CMPC_LIST=../profile/results/pred_prof.txt 8 | CMPC_LIST=../profile/results/lat_prof.txt 9 | CPPFLAGS=-Wall -I$(INCLUDES) -std=c++17 -fno-omit-frame-pointer -mno-red-zone -fpermissive 10 | 11 | all: pre-build primary scav 12 | 13 | pre-build: 14 | @mkdir -p $(BUILD) 15 | @mkdir -p $(BUILD)/asms 16 | 17 | primary: $(BUILD)/ptrchase $(BUILD)/template 18 | 19 | scav: $(BUILD)/compute.so 20 | 21 | $(BUILD)/ptrchase: $(PRM_SRC)/ptrchase.cpp 22 | $(CCP) $(CPPFLAGS) -o $@ $< -lpthread 23 | 24 | $(BUILD)/template: $(PRM_SRC)/template.cpp 25 | $(CCP) $(CPPFLAGS) -o $@ $< -lpthread 26 | 27 | $(BUILD)/%.so: $(SCAV_SRC)/%.cpp 28 | $(CCP) $(CPPFLAGS) -shared -fPIC -o $@ $^ 29 | 30 | 31 | $(BUILD)/%.so.bolt: $(BUILD)/%.so 32 | llvm-bolt $< -o $@ \ 33 | --clh-cmpc-list=$(CMPC_LIST) \ 34 | --disable-ctx-prefetch \ 35 | --disable-pseudo-inline \ 36 | --disable-asympp \ 37 | --assume-abi \ 38 | --clh-opt-yield \ 39 | --test-special-yield \ 40 | --inst-scav \ 41 | --lat-prof=$(LAT_PROF) \ 42 | --pred-prof=$(PRED_PROF) \ 43 | --bound-yield-distance=$(YIELD_DISTANCE) \ 44 | --clh-prob-th=200 > out.txt 2> err.txt #/dev/null 45 | 46 | $(BUILD)/%.bolt: $(BUILD)/% 47 | llvm-bolt $< -o $@ \ 48 | --clh-cmpc-list=$(CMPC_LIST) \ 49 | --disable-ctx-prefetch \ 50 | --assume-abi \ 51 | --clh-opt-yield \ 52 | --clh-prob-th=75 > out.txt 2> err.txt #/dev/null 53 | 54 | $(BUILD)/asms/%.asm: $(BUILD)/% 55 | objdump -s -S $< > $@ 56 | -------------------------------------------------------------------------------- /apps/src/primary/ptrchase.cpp: -------------------------------------------------------------------------------- 1 | /* check if /proc/sys/vm/nr_hugepages > 0 */ 2 | 3 | #ifndef _GNU_SOURCE 4 | #define _GNU_SOURCE 5 | #endif 6 | 7 | #include 8 | #include 9 | #include 10 | 11 | #include 12 | #include 13 | #include 14 | #include 15 | 16 | #include 17 | 18 | /** 19 | * system configurations 20 | * netsys-c27 21 | */ 22 | #define CORE_BASE 28 // make sure each core in different physical core 23 | #define CORE_NUM 1 24 | #define JOBS_PER_CORE 1 25 | #define CACHE_LINE_SZ 64 26 | #define NODE_ALIGN 64 27 | #define FREQ_MHZ 2100 28 | 29 | #define ARR_LEN 1024 * 64 30 | #define REPS 10 31 | 32 | #define RDTSC_TO_NS(x) ((x) *1000 / FREQ_MHZ) 33 | 34 | #define THREAD_NUM 1 35 | // #define POINTER_CHASE 36 | // #define SHUFFLE 37 | 38 | #pragma GCC push_options 39 | #pragma GCC optimize("O0") 40 | 41 | static inline uint64_t 42 | start64_rdtsc() { 43 | uint64_t t; 44 | __asm__ volatile("lfence\n\t" 45 | "rdtsc\n\t" 46 | "shl $32, %%rdx\n\t" 47 | "or %%rdx, %0\n\t" 48 | "lfence" 49 | : "=a"(t) 50 | : 51 | : "rdx", "memory", "cc"); 52 | return t; 53 | } 54 | 55 | static inline uint64_t 56 | stop64_rdtsc() { 57 | uint64_t t; 58 | __asm__ volatile("rdtscp\n\t" 59 | "shl $32, %%rdx\n\t" 60 | "or %%rdx, %0\n\t" 61 | "lfence" 62 | : "=a"(t) 63 | : 64 | : "rcx", "rdx", "memory", "cc"); 65 | return t; 66 | } 67 | 68 | static inline uint64_t 69 | start64_ts() { 70 | return start64_rdtsc(); 71 | } 72 | 73 | static inline uint64_t 74 | stop64_ts() { 75 | return stop64_rdtsc(); 76 | } 77 | #pragma GCC pop_options 78 | 79 | #define handle_error(msg) \ 80 | do { \ 81 | perror(msg); \ 82 | exit(EXIT_FAILURE); \ 83 | } while (0) 84 | 85 | struct node { 86 | struct node *next; 87 | char pad[NODE_ALIGN - sizeof(struct node *)]; 88 | }; 89 | 90 | struct status { 91 | int cur_tid; 92 | struct node *last_pos; 93 | 94 | #ifndef POINTER_CHASE 95 | uint64_t last_idx; 96 | #endif 97 | }; 98 | 99 | struct thread_arg { 100 | int tid; 101 | uint64_t arr_len; 102 | uint64_t rep_per_task; 103 | int is_tq; 104 | }; 105 | 106 | struct node *data_arr = NULL; 107 | 108 | static int seed_randomizer = 0; 109 | void 110 | gen_cycle(int *out, uint64_t out_len) { 111 | uint64_t idx, dst_idx; 112 | int tmp; 113 | srand((unsigned) time(NULL) + seed_randomizer++); 114 | 115 | int *perm = (int *) malloc(sizeof(int) * out_len); 116 | 117 | for (idx = 0; idx < out_len; idx++) 118 | perm[idx] = idx; 119 | 120 | for (idx = 0; idx < out_len; idx++) { 121 | dst_idx = rand() % out_len; 122 | tmp = perm[idx]; 123 | perm[idx] = perm[dst_idx]; 124 | perm[dst_idx] = tmp; 125 | } 126 | 127 | for (idx = 0; idx < out_len; idx++) 128 | out[perm[idx]] = perm[(idx + 1) % out_len]; 129 | } 130 | 131 | void 132 | init_run(uint64_t arr_len) { 133 | int *cycle = (int *) malloc(sizeof(int) * arr_len); 134 | 135 | if (data_arr == NULL) { 136 | data_arr = (struct node *) mmap( 137 | NULL, sizeof(struct node) * arr_len * CORE_NUM * JOBS_PER_CORE, 138 | PROT_READ | PROT_WRITE, 139 | MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0); 140 | if (data_arr == MAP_FAILED) 141 | handle_error("mmap"); 142 | for (int job_id = 0; job_id < CORE_NUM * JOBS_PER_CORE; job_id++) { 143 | int cur = 0; 144 | gen_cycle(cycle, arr_len); 145 | do { 146 | data_arr[job_id * arr_len + cur].next = 147 | &data_arr[job_id * arr_len + cycle[cur]]; 148 | cur = cycle[cur]; 149 | } while (cur != 0); 150 | } 151 | } 152 | 153 | free(cycle); 154 | } 155 | 156 | uint64_t run_time[THREAD_NUM]; 157 | void * 158 | ptrchase_tq(void *arg) { 159 | int tid = ((struct thread_arg *) arg)->tid; 160 | uint64_t rep_per_task = ((struct thread_arg *) arg)->rep_per_task; 161 | uint64_t start, end; 162 | 163 | struct node *cur = data_arr; 164 | uint64_t reps = 0; 165 | 166 | start = start64_ts(); 167 | while (reps++ < rep_per_task) { 168 | // here shuld be the coroutine yield point 169 | cur = cur->next; 170 | } 171 | end = stop64_ts(); 172 | 173 | run_time[tid] = end - start; 174 | 175 | fprintf(stderr, "[%d] ptrchase finishes\n", tid); 176 | fprintf(stderr, "[%d] AMAT: %lu ns\n", tid, RDTSC_TO_NS(run_time[tid] / REPS / (rep_per_task / REPS))); 177 | 178 | return NULL; 179 | } 180 | 181 | void 182 | run(uint64_t arr_len) { 183 | thread_arg targ[THREAD_NUM]; 184 | 185 | init_run(arr_len); 186 | 187 | pthread_t t[THREAD_NUM]; 188 | for (int i = 0; i < THREAD_NUM; i++) { 189 | targ[i].rep_per_task = arr_len * REPS; 190 | targ[i].tid = i; 191 | if (pthread_create(&t[i], NULL, ptrchase_tq, &targ[i])) 192 | handle_error("pthread_create"); 193 | } 194 | 195 | for (int i = 0; i < THREAD_NUM; i++) 196 | pthread_join(t[i], NULL); 197 | 198 | //ptrchase_tq(&targ); 199 | 200 | return; 201 | } 202 | 203 | int main(int argc, char **argv) { 204 | uint64_t arr_len = 0; 205 | if (argc > 1) 206 | arr_len = atoi(argv[1]); 207 | else 208 | arr_len = ARR_LEN; 209 | 210 | fprintf(stderr, "ptrchase %lu %d\n", arr_len, REPS); 211 | 212 | run(arr_len); 213 | 214 | return 0; 215 | } 216 | -------------------------------------------------------------------------------- /apps/src/primary/template.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #define THREAD_NUM 1 4 | 5 | void *do_nothing(void *arg) { 6 | return NULL; 7 | } 8 | 9 | int main() { 10 | pthread_t t[THREAD_NUM]; 11 | for (int i = 0; i < THREAD_NUM; i++) 12 | pthread_create(&t[i], NULL, do_nothing, NULL); 13 | 14 | for (int i = 0; i < THREAD_NUM; i++) 15 | pthread_join(t[i], NULL); 16 | return 0; 17 | } 18 | -------------------------------------------------------------------------------- /apps/src/scavengers/compute.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | static uint64_t total_reps = 100 * 1000 * 1000; 7 | static uint64_t gsum = 0; 8 | 9 | #define FREQ_MHZ 2100 10 | #define RDTSC_TO_NS(x) ((x) *1000 / FREQ_MHZ) 11 | 12 | #pragma GCC push_options 13 | #pragma GCC optimize("O0") 14 | 15 | static inline uint64_t 16 | start64_rdtsc() { 17 | uint64_t t; 18 | __asm__ volatile("lfence\n\t" 19 | "rdtsc\n\t" 20 | "shl $32, %%rdx\n\t" 21 | "or %%rdx, %0\n\t" 22 | "lfence" 23 | : "=a"(t) 24 | : 25 | : "rdx", "memory", "cc"); 26 | return t; 27 | } 28 | 29 | static inline uint64_t 30 | stop64_rdtsc() { 31 | uint64_t t; 32 | __asm__ volatile("rdtscp\n\t" 33 | "shl $32, %%rdx\n\t" 34 | "or %%rdx, %0\n\t" 35 | "lfence" 36 | : "=a"(t) 37 | : 38 | : "rcx", "rdx", "memory", "cc"); 39 | return t; 40 | } 41 | 42 | static inline uint64_t 43 | start64_ts() { 44 | return start64_rdtsc(); 45 | } 46 | 47 | static inline uint64_t 48 | stop64_ts() { 49 | return stop64_rdtsc(); 50 | } 51 | #pragma GCC pop_options 52 | 53 | uint64_t run_time = 0; 54 | int comp_len = 10; 55 | void 56 | compute() { 57 | uint64_t start, end; 58 | uint64_t rep = 0; 59 | 60 | start = start64_ts(); 61 | while (++rep < total_reps) { 62 | // Here should be coroutine yield point 63 | int x = 0; 64 | for (int i = 0; i < comp_len; i++) { 65 | x += i; 66 | } 67 | 68 | gsum += x; 69 | } 70 | end = stop64_ts(); 71 | run_time = end - start; 72 | } 73 | 74 | #pragma GCC push_options 75 | #pragma GCC optimize("O0") 76 | extern "C" { 77 | int crt_pos = 0; 78 | int argc = 0; 79 | char **argv = 0; 80 | } 81 | 82 | extern "C" int 83 | entry(void) { 84 | if (argc > 1) { 85 | comp_len = atoi(argv[1]); 86 | total_reps = std::stoull(argv[2]); 87 | } 88 | 89 | fprintf(stderr, "compute starts -- crt_pos = %d\n", crt_pos); 90 | 91 | compute(); 92 | 93 | fprintf(stderr, "compute finishes %d\n", crt_pos); 94 | fprintf(stderr, "per task time: %lu ns\n", 95 | RDTSC_TO_NS(run_time) / total_reps); 96 | 97 | // coroutine context is designed to return to 98 | // "ret_to_scheduler" 99 | 100 | // set %rax = crt_pos 101 | return crt_pos; 102 | } 103 | #pragma GCC pop_options 104 | -------------------------------------------------------------------------------- /do_prof_primary.sh: -------------------------------------------------------------------------------- 1 | export LD_PRELOAD="" 2 | 3 | PROF_DIR=profile 4 | SAMPLING_RATE_IN_CNT=102400 5 | BIN=$1 6 | shift 1 7 | ARGS="$*" 8 | echo $ARGS 9 | 10 | ${PROF_DIR}/scripts/capture_cm_inst_primary.sh ${SAMPLING_RATE_IN_CNT} ${BIN} ${ARGS} 11 | ${PROF_DIR}/scripts/compute_ld_prob.sh profile/results/perf.data 0 $SAMPLING_RATE_IN_CNT 12 | -------------------------------------------------------------------------------- /do_prof_scavenger.sh: -------------------------------------------------------------------------------- 1 | export LD_PRELOAD=$(pwd)/libmsh/build/libmsh.so:${LD_PRELOAD} 2 | export LD_LIBRARY_PATH=$(pwd)/libmsh/build:${LD_LIBRARY_PATH} 3 | 4 | PROF_DIR=profile 5 | TEMPLATE_BIN=apps/build/template 6 | SAMPLING_RATE_IN_CNT=102400 7 | 8 | # set scav.txt on your own 9 | 10 | ${PROF_DIR}/scripts/capture_cm_inst_scav.sh ${SAMPLING_RATE_IN_CNT} ${TEMPLATE_BIN} 11 | ${PROF_DIR}/scripts/compute_ld_prob.sh profile/results/perf.data 1 ${SAMPLING_RATE_IN_CNT} 12 | -------------------------------------------------------------------------------- /libmsh/Makefile: -------------------------------------------------------------------------------- 1 | CC=gcc 2 | 3 | BUILD=build 4 | SRC=. 5 | 6 | TARGET=$(BUILD)/libmsh.so 7 | 8 | INCLUDES=-I$(SRC)/ 9 | CFLAGS=-Wall -I$(INCLUDES) -std=c11 -fno-omit-frame-pointer -mno-red-zone 10 | HEADERS=$(wildcard $(SRC)/*.h) 11 | 12 | all: prebuild $(TARGET) 13 | 14 | prebuild: 15 | @mkdir -p $(BUILD) 16 | 17 | $(TARGET): $(BUILD)/msh.o $(BUILD)/pthread.o 18 | $(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -o $@ $^ -ldl -mfsgsbase 19 | 20 | $(BUILD)/msh.o: $(SRC)/msh.c $(HEADERS) 21 | $(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -c $< -o $@ -mfsgsbase 22 | 23 | $(BUILD)/pthread.o: $(SRC)/pthread.c $(HEADERS) 24 | $(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -c $< -o $@ -mfsgsbase 25 | 26 | .PHONY: clean 27 | clean: 28 | rm -rf $(BUILD) 29 | -------------------------------------------------------------------------------- /libmsh/msh.c: -------------------------------------------------------------------------------- 1 | #ifndef _GNU_SOURCE 2 | #define _GNU_SOURCE 3 | #endif 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | 15 | #include /* Definition of ARCH_* constants */ 16 | #include /* Definition of SYS_* constants */ 17 | #include 18 | 19 | 20 | #include "msh.h" 21 | #include "spinlock.h" 22 | 23 | #define BOLDBLUE "\033[1m\033[34m" 24 | #define BLUE "\033[34m" 25 | #define RESET "\033[0m" 26 | 27 | #define PRINT(str, ...) fprintf(stderr, BOLDBLUE "[MSH] " RESET str, ##__VA_ARGS__) 28 | 29 | #define err_check(err_cond, ...) \ 30 | if (err_cond) { \ 31 | PRINT("%s:%d: ", __FILE__, __LINE__); \ 32 | PRINT(__VA_ARGS__); \ 33 | exit(1); \ 34 | } 35 | 36 | // 0-th index of lists is reserved for primary thread 37 | #define push_back_to_list(list, size, item, pos) \ 38 | for (int i = 1; i < size; i++) { \ 39 | if (list[i] == -1) { \ 40 | list[i] = item; \ 41 | pos = i; \ 42 | break; \ 43 | } \ 44 | } 45 | 46 | #define pop_back_from_list(list, size, item, pos) \ 47 | for (int i = size - 1; i > 0; i--) { \ 48 | if (list[i] != -1) { \ 49 | item = list[i]; \ 50 | list[i] = -1; \ 51 | pos = i; \ 52 | break; \ 53 | } \ 54 | } 55 | 56 | #define CAS(ptr, old_val, new_val) \ 57 | __sync_bool_compare_and_swap(ptr, old_val, new_val) 58 | 59 | // Static allocation of per Application Tables 60 | static struct coroutine_ctx SCAV_TABLE[MAX_ACTIVE_CRT_NUM_PER_APPLICATION]; 61 | static struct primary_thread_ctx PRIM_TABLE[MAX_THREAD_NUM]; 62 | spinlock scav_table_lock; 63 | spinlock prim_table_lock; 64 | 65 | static __thread char *gsbase = 0; 66 | #define tl_yield_ctx_at(pos) \ 67 | (struct yield_ctx *) (gsbase + sizeof(struct yield_ctx) * pos) 68 | 69 | 70 | #define tl_get_crt_idx_at(pos) (PRIM_TABLE[msh_tid].idx_list[pos]) 71 | #define tl_set_crt_id_at(pos, val) (PRIM_TABLE[msh_tid].idx_list[pos] = val) 72 | 73 | static __thread int cur_crt_pos = 0; 74 | static __thread int msh_tid = -1; 75 | static bool dummy; 76 | static __thread bool *msh_reallocatable = &dummy; 77 | static __thread bool *msh_reallocated = &dummy; 78 | static _Atomic int finished_scav_cnt = 0; 79 | static int max_scav_per_thread = 1; 80 | 81 | // Scavenger pool management 82 | static char scav_pool_path[256]; 83 | static FILE *scav_pool_fp = NULL; 84 | static const int MAX_SCV_CMD_LEN = 256; 85 | 86 | // jmp_buf for scheduler 87 | jmp_buf cleanup_jmpbuf[MAX_THREAD_NUM]; 88 | 89 | static void tl_sched_next_crt(int ret_crt_pos); 90 | static int allocate_scav(struct primary_thread_ctx *ctx); 91 | 92 | #pragma GCC push_options 93 | #pragma GCC optimize("O0") 94 | void 95 | ret_to_scheduler(void) { 96 | int ret_crt_pos; 97 | __asm__ volatile( 98 | "testq $15, %%rsp\n\t" 99 | "jz _no_adjust\n\t" 100 | "subq $8, %%rsp\n\t" // after coroutine returns, rsp might not be 101 | // 16-byte alinged (ABI requirement) 102 | "_no_adjust:\n" 103 | "movl %%eax, %0\n\t" 104 | : "=rm"(ret_crt_pos) 105 | :); 106 | tl_sched_next_crt(ret_crt_pos); 107 | 108 | // this program point shouldn't be reached 109 | assert(false); 110 | } 111 | #pragma GCC pop_options 112 | 113 | static void 114 | print_prim_ctx(struct primary_thread_ctx *ctx) { 115 | PRINT("Primary %d: \n", msh_tid); 116 | PRINT(" idx_list:"); 117 | for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) { 118 | PRINT("%d ", ctx->idx_list[i]); 119 | } 120 | PRINT("\n"); 121 | 122 | PRINT(" ycs-normal:"); 123 | for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) { 124 | PRINT("%d ", ctx->ycs[i].normal_next); 125 | } 126 | 127 | PRINT(" ycs-special:"); 128 | for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) { 129 | PRINT("%d ", ctx->ycs[i].special_next); 130 | } 131 | 132 | PRINT("\n"); 133 | fflush(stderr); 134 | } 135 | 136 | static void 137 | parse_command(char *string, int *argc, char ***argv) { 138 | *argc = 0; 139 | wordexp_t p; 140 | wordexp(string, &p, 0); 141 | *argc = p.we_wordc; 142 | *argv = p.we_wordv; 143 | } 144 | 145 | static void 146 | init_scav_table_entry(struct coroutine_ctx *ctx, char *cmd) { 147 | cmd[strcspn(cmd, "\n")] = 0; 148 | parse_command(cmd, &ctx->argc, &ctx->argv); 149 | 150 | void *lib_handle = dlopen(ctx->argv[0], RTLD_NOW | RTLD_DEEPBIND); 151 | 152 | err_check(!lib_handle, "Failed to dlopen file %s %s\n", cmd, dlerror()); 153 | 154 | memset(ctx->stack, 0, CRT_STACK_SZ); 155 | ctx->lib_handle = lib_handle; 156 | ctx->entry = (void (*)(void)) dlsym(lib_handle, "entry"); 157 | ctx->ret_to_sched = (void *) ret_to_scheduler; 158 | ctx->yc = NULL; 159 | ctx->finished = false; 160 | err_check(dlerror(), "Failed to dlsym entry in file\n"); 161 | 162 | ctx->init = (void (*)(void)) dlsym(lib_handle, "init"); 163 | ctx->inited = false; 164 | dlerror(); // reset error. init may not exist, but that's fine 165 | return; 166 | } 167 | 168 | static void 169 | init_prim_table() { 170 | for (int i = 0; i < MAX_THREAD_NUM; i++) { 171 | // Cache-line alignment check 172 | assert((uint64_t) &PRIM_TABLE[i].ycs % 64 == 0); 173 | for (int j = 0; j < MAX_CRT_PER_THREAD + 1; j++) 174 | PRIM_TABLE[i].ycs[j].normal_next = -1; 175 | for (int j = 0; j < MAX_CRT_PER_THREAD + 1; j++) 176 | PRIM_TABLE[i].idx_list[j] = -1; 177 | PRIM_TABLE[i].reallocatable = false; 178 | PRIM_TABLE[i].reallocated = false; 179 | PRIM_TABLE[i].scav_num = 0; 180 | } 181 | } 182 | 183 | static void 184 | init_scav_table(char *local_pool_path) { 185 | spin_lock(&scav_table_lock); 186 | scav_pool_fp = fopen(local_pool_path, "r"); 187 | err_check(!scav_pool_fp, "Failed to open file\n"); 188 | 189 | for (int i = 0; i < MAX_ACTIVE_CRT_NUM_PER_APPLICATION; i++) { 190 | SCAV_TABLE[i].lib_handle = NULL; 191 | } 192 | 193 | char line_buf[MAX_SCV_CMD_LEN]; 194 | int scav_counts = 0; 195 | while (fgets(line_buf, MAX_SCV_CMD_LEN, scav_pool_fp)) { 196 | if (line_buf[0] == '#') { // # is a comment 197 | continue; 198 | } 199 | 200 | err_check(scav_counts > MAX_ACTIVE_CRT_NUM_PER_APPLICATION, 201 | "Too many scavengers in the pool\n"); 202 | 203 | init_scav_table_entry(&SCAV_TABLE[scav_counts], line_buf); 204 | scav_counts++; 205 | } 206 | 207 | PRINT("Scavenger pool has %d entries\n", scav_counts); 208 | spin_unlock(&scav_table_lock); 209 | } 210 | 211 | static void 212 | set_scav_symbol(int scav_idx, int crt_pos) { 213 | struct coroutine_ctx *ctx = &SCAV_TABLE[scav_idx]; 214 | 215 | void *sym = dlsym(ctx->lib_handle, "argc"); 216 | err_check(dlerror(), "Failed to dlsym argc in file\n"); 217 | int *entry_argc = (int *) sym; 218 | *entry_argc = ctx->argc; 219 | 220 | sym = dlsym(ctx->lib_handle, "argv"); 221 | err_check(dlerror(), "Failed to dlsym argv in file\n"); 222 | char ***entry_argv = (char ***) sym; 223 | *entry_argv = ctx->argv; 224 | 225 | sym = dlsym(ctx->lib_handle, "crt_pos"); 226 | err_check(dlerror(), "Failed to dlsym crt_pos in file\n"); 227 | int *entry_crt_pos = (int *) sym; 228 | *entry_crt_pos = crt_pos * sizeof(struct yield_ctx); 229 | 230 | sym = dlsym(ctx->lib_handle, "loop_counter_loc"); 231 | if (dlerror()) { 232 | PRINT("This scavenger doesn't have loop counter\n"); 233 | } else { 234 | long *entry_loop_counter_loc = (long *) sym; 235 | *entry_loop_counter_loc = offsetof(struct primary_thread_ctx,loop_counter) + (NUM_OF_LINES_FOR_LOOP_COUNTER * 64 * crt_pos); 236 | memset(PRIM_TABLE[msh_tid].loop_counter[crt_pos], 0, NUM_OF_LINES_FOR_LOOP_COUNTER * 64); 237 | } 238 | } 239 | 240 | // swap coroutines at a_pos and b_pos in the index list 241 | static void 242 | tl_swap_crts(int a_pos, int b_pos) { 243 | spin_lock(&scav_table_lock); 244 | 245 | int a_idx = tl_get_crt_idx_at(a_pos); 246 | int b_idx = tl_get_crt_idx_at(b_pos); 247 | tl_set_crt_id_at(a_pos, b_idx); 248 | tl_set_crt_id_at(b_pos, a_idx); 249 | 250 | // swap yc 251 | struct yield_ctx *a_yc = tl_yield_ctx_at(a_pos); 252 | struct yield_ctx *b_yc = tl_yield_ctx_at(b_pos); 253 | struct yield_ctx tmp; 254 | memcpy(&tmp, a_yc, sizeof(struct yield_ctx)); 255 | memcpy(a_yc, b_yc, sizeof(struct yield_ctx)); 256 | memcpy(b_yc, &tmp, sizeof(struct yield_ctx)); 257 | 258 | SCAV_TABLE[a_idx].yc = b_yc; 259 | SCAV_TABLE[b_idx].yc = a_yc; 260 | 261 | // update next map 262 | for (int pos = 0; pos < MAX_CRT_PER_THREAD + 1; pos++) { 263 | struct yield_ctx *yc = tl_yield_ctx_at(pos); 264 | if (yc->normal_next == a_pos) { 265 | yc->normal_next = b_pos; 266 | } else if (yc->normal_next == b_pos) { 267 | yc->normal_next = a_pos; 268 | } 269 | } 270 | 271 | // swap register set 272 | struct x86_registers *a_regs = PRIM_TABLE[msh_tid].regs[a_pos]; 273 | struct x86_registers *b_regs = PRIM_TABLE[msh_tid].regs[b_pos]; 274 | struct x86_registers tmp_regs[MAX_REG_SET_NUM]; 275 | memcpy(tmp_regs, a_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM); 276 | memcpy(a_regs, b_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM); 277 | memcpy(b_regs, tmp_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM); 278 | 279 | // swap binary symbol 280 | // handle only if the symbol is defined at pos 281 | if (a_idx != -1) { 282 | void *a_lib_handle = SCAV_TABLE[a_idx].lib_handle; 283 | assert(a_lib_handle); 284 | void *a_sym = dlsym(a_lib_handle, "crt_pos"); 285 | int *a_entry_crt_pos = (int *) a_sym; 286 | *a_entry_crt_pos = b_pos; 287 | } 288 | 289 | if (b_idx != -1) { 290 | void *b_lib_handle = SCAV_TABLE[b_idx].lib_handle; 291 | assert(b_lib_handle); 292 | void *b_sym = dlsym(b_lib_handle, "crt_pos"); 293 | int *b_entry_crt_pos = (int *) b_sym; 294 | *b_entry_crt_pos = a_pos; 295 | } 296 | 297 | spin_unlock(&scav_table_lock); 298 | 299 | } 300 | 301 | static void 302 | tl_finish_crt(int crt_pos) { 303 | struct yield_ctx *yc = tl_yield_ctx_at(crt_pos); 304 | int crt_idx = tl_get_crt_idx_at(crt_pos); 305 | yc->normal_next = -1; 306 | 307 | if (crt_idx == -1) { 308 | // primary thread is finished 309 | return; 310 | } 311 | 312 | spin_lock(&scav_table_lock); 313 | tl_set_crt_id_at(crt_pos, -1); 314 | SCAV_TABLE[crt_idx].yc = NULL; 315 | SCAV_TABLE[crt_idx].finished = true; 316 | dlclose(SCAV_TABLE[crt_idx].lib_handle); 317 | spin_unlock(&scav_table_lock); 318 | } 319 | 320 | static void 321 | set_special_next(struct primary_thread_ctx *ctx) { 322 | // this function is called every time new scavenger is allocated 323 | // it keeps the link made through special_next to be a ring 324 | 325 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 326 | struct yield_ctx *yc = tl_yield_ctx_at(pos); 327 | if (pos == MAX_CRT_PER_THREAD) 328 | yc->special_next = 0; 329 | 330 | int idx_next = tl_get_crt_idx_at(pos + 1); 331 | if (idx_next == -1) { 332 | yc->special_next = 0; 333 | } else { 334 | yc->special_next = (pos + 1) * sizeof(struct yield_ctx); 335 | } 336 | } 337 | } 338 | 339 | static int 340 | tl_next_available_crt() { 341 | struct yield_ctx *ycs = PRIM_TABLE[msh_tid].ycs; 342 | // find if there is any available scavenger coroutine 343 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 344 | if (ycs[pos].normal_next != -1) 345 | return pos; 346 | } 347 | 348 | // if none, set to primary 349 | return 0; 350 | } 351 | 352 | // TODO: currently we only update normal_next of primary 353 | // we will have to handle special_next of scavenger coroutines later 354 | static void 355 | tl_update_next_map() { 356 | struct yield_ctx *ycs = PRIM_TABLE[msh_tid].ycs; 357 | int crt_next_to_prim = ycs[0].normal_next; 358 | if (crt_next_to_prim == -1) { 359 | // primary thread is finished 360 | return; 361 | } 362 | 363 | // next coroutine is finished 364 | ycs[0].normal_next = sizeof(struct yield_ctx) * tl_next_available_crt(); 365 | } 366 | 367 | static int 368 | tl_find_unfinished_crt() { 369 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 370 | int crt_idx = tl_get_crt_idx_at(pos); 371 | if (crt_idx != -1) { 372 | return pos; 373 | } 374 | } 375 | return 0; 376 | } 377 | 378 | static void 379 | tl_pack_crts() { 380 | // if primary is finished, skip 381 | struct yield_ctx *primary_yc = tl_yield_ctx_at(0); 382 | if (primary_yc->normal_next == -1) { 383 | return; 384 | } 385 | 386 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 387 | if (tl_get_crt_idx_at(pos) == -1) { 388 | for (int next_pos = pos + 1; next_pos < MAX_CRT_PER_THREAD + 1; 389 | next_pos++) { 390 | if (tl_get_crt_idx_at(next_pos) != -1) { 391 | tl_swap_crts(pos, next_pos); 392 | break; 393 | } 394 | } 395 | } 396 | } 397 | 398 | set_special_next(&PRIM_TABLE[msh_tid]); 399 | } 400 | 401 | static void 402 | tl_sched_next_crt(int ret_crt_pos) { 403 | ret_crt_pos = ret_crt_pos / sizeof(struct yield_ctx); // change to idx (from pointer) 404 | 405 | PRINT( 406 | BOLDBLUE "[%d] Coroutine at pos %d is finished -- " 407 | "Scavenger index = %d\n" RESET, 408 | msh_tid, ret_crt_pos, tl_get_crt_idx_at(ret_crt_pos)); 409 | if (ret_crt_pos != 0) 410 | PRINT("FinishedCoroutines: %d \n", ++finished_scav_cnt); 411 | struct yield_ctx *ret_yield_ctx = tl_yield_ctx_at(ret_crt_pos); 412 | struct yield_ctx *primary_yield_ctx = tl_yield_ctx_at(0); 413 | 414 | uint64_t sp; 415 | uint64_t ip; 416 | short next = ret_yield_ctx->normal_next / sizeof(struct yield_ctx); 417 | // print_prim_ctx(&PRIM_TABLE[msh_tid]); 418 | // PRINT("[%d] next of retcrt = %d\n", msh_tid, next); 419 | tl_finish_crt(ret_crt_pos); 420 | tl_update_next_map(); 421 | tl_pack_crts(); 422 | 423 | // print_prim_ctx(&PRIM_TABLE[msh_tid]); 424 | if (primary_yield_ctx->normal_next != -1) { // primary is not finished 425 | allocate_scav(&PRIM_TABLE[msh_tid]); // get new one if needed 426 | tl_update_next_map(); 427 | } 428 | 429 | // prefetch yield context 430 | __asm__ volatile("prefetcht0 %%gs:0\n\t" : :); 431 | 432 | if (primary_yield_ctx->normal_next == -1) { 433 | // primary is finished 434 | // we are running cleanup routine 435 | cur_crt_pos = tl_find_unfinished_crt(); 436 | if (cur_crt_pos == 0) { 437 | longjmp(cleanup_jmpbuf[msh_tid], 0); 438 | } 439 | struct yield_ctx *yc = tl_yield_ctx_at(cur_crt_pos); 440 | sp = yc->sp; 441 | ip = yc->ip; 442 | } else { 443 | cur_crt_pos = next; 444 | struct yield_ctx *yc = tl_yield_ctx_at(cur_crt_pos); 445 | sp = yc->sp; 446 | ip = yc->ip; 447 | } 448 | 449 | PRINT("[%d] Coroutine %d is scheduled %p %p\n", msh_tid, 450 | cur_crt_pos, (void *) sp, (void *) ip); 451 | __asm__ volatile("movq %1, %%rsp\n\t" 452 | "jmp *%0\n\t" 453 | : 454 | : "m"(ip), "mr"(sp)); 455 | // should never reach here 456 | assert(false); 457 | } 458 | 459 | // The design of this function is undecided. 460 | // This static cap version is tentative. 461 | static int 462 | need_more_scav(struct primary_thread_ctx *ctx) { 463 | int counts = 0; 464 | for (int i = 1; i < MAX_CRT_PER_THREAD + 1; i++) { 465 | if (ctx->idx_list[i] != -1) { 466 | counts++; 467 | } 468 | } 469 | 470 | if (counts < max_scav_per_thread) { 471 | return max_scav_per_thread - counts; 472 | } else { 473 | return 0; 474 | } 475 | } 476 | 477 | static int 478 | realloc_scav(struct primary_thread_ctx *from, struct primary_thread_ctx *to, 479 | int demands) { 480 | int ret_counts = 0; 481 | while (ret_counts < demands) { 482 | int crt_idx_to_realloc, from_pos = -1, to_pos = -1; 483 | pop_back_from_list(from->idx_list, MAX_CRT_PER_THREAD + 1, 484 | crt_idx_to_realloc, 485 | from_pos); // pos = idx in idx_list 486 | if (from_pos == -1) 487 | break; // no more scavengers in *from 488 | 489 | push_back_to_list(to->idx_list, MAX_CRT_PER_THREAD + 1, 490 | crt_idx_to_realloc, to_pos); 491 | if (to_pos == -1) { 492 | // no more space, rollback and return 493 | push_back_to_list(from->idx_list, MAX_CRT_PER_THREAD + 1, 494 | crt_idx_to_realloc, from_pos); 495 | return ret_counts; 496 | } 497 | 498 | // no rollback from this point 499 | // yc update 500 | struct yield_ctx *from_yc = from->ycs + from_pos; 501 | struct yield_ctx *to_yc = to->ycs + to_pos; 502 | to_yc->sp = from_yc->sp; 503 | to_yc->ip = from_yc->ip; 504 | to_yc->normal_next = 0; 505 | 506 | from_yc->normal_next = -1; 507 | tl_update_next_map(); 508 | to->ycs[0].normal_next = sizeof(struct yield_ctx); 509 | 510 | // TODO: handle special next 511 | 512 | // regset update 513 | struct x86_registers *from_regs = from->regs[from_pos]; 514 | struct x86_registers *to_regs = to->regs[to_pos]; 515 | memcpy(to_regs, from_regs, 516 | sizeof(struct x86_registers) * MAX_REG_SET_NUM); 517 | 518 | // binary symbol update 519 | void *lib_handle = SCAV_TABLE[crt_idx_to_realloc].lib_handle; 520 | void *sym = dlsym(lib_handle, "crt_pos"); 521 | int *entry_crt_pos = (int *) sym; 522 | *entry_crt_pos = to_pos; 523 | 524 | // finalize 525 | ret_counts++; 526 | SCAV_TABLE[crt_idx_to_realloc].yc = to_yc; 527 | __asm__ __volatile__( 528 | "" :: 529 | : "memory"); // prevent compiler from reordering 530 | from->reallocated = true; 531 | 532 | int from_tid = ((uint64_t) from - (uint64_t) PRIM_TABLE) / 533 | sizeof(struct primary_thread_ctx); 534 | int to_tid = ((uint64_t) to - (uint64_t) PRIM_TABLE) / 535 | sizeof(struct primary_thread_ctx); 536 | PRINT("Scav %d is reallocated [%d] --> [%d] %d %d %d %d\n", 537 | crt_idx_to_realloc, from_tid, to_tid, from->idx_list[1], 538 | from->idx_list[2], from->idx_list[3], from->idx_list[4]); 539 | } 540 | 541 | return ret_counts; 542 | } 543 | 544 | static int 545 | _allocate_scav(struct primary_thread_ctx *ctx, int demands) { 546 | int ret_counts = 0; 547 | 548 | // Check if there is reallocatable scavenger 549 | // Note for concurrency: this is the only place where primary threads may 550 | // access the context of other primary threads. The access must be protected 551 | // by "reallocatable" flag. 552 | for (int tid = 0; tid < MAX_THREAD_NUM; tid++) { 553 | struct primary_thread_ctx *t = &PRIM_TABLE[tid]; 554 | if (CAS(&t->reallocatable, true, false)) { 555 | // reallocation for t begins -- t will busy-wait until it's done 556 | int cnt = realloc_scav(t /*from*/, ctx /*to*/, demands); 557 | __asm__ __volatile__( 558 | "" :: 559 | : "memory"); // prevent compiler from reordering 560 | t->reallocatable = true; // unlock 561 | ret_counts += cnt; 562 | if (ret_counts >= demands) 563 | return ret_counts; 564 | } 565 | } 566 | 567 | // check scav pool 568 | for (int idx = 0; idx < MAX_ACTIVE_CRT_NUM_PER_APPLICATION; idx++) { 569 | if (!SCAV_TABLE[idx].lib_handle) // empty entry 570 | continue; 571 | 572 | if (SCAV_TABLE[idx].yc || 573 | SCAV_TABLE[idx].finished) // already allocated 574 | continue; 575 | 576 | // allocate idx-th scavenger to the primary 577 | int pos = -1; 578 | push_back_to_list(ctx->idx_list, MAX_CRT_PER_THREAD + 1, idx, pos); 579 | if (pos == -1) 580 | break; // no more space 581 | 582 | ctx->ycs[pos].sp = (uint64_t) &SCAV_TABLE[idx].ret_to_sched; 583 | ctx->ycs[pos].ip = (uint64_t) SCAV_TABLE[idx].entry; 584 | ctx->ycs[pos].normal_next = 0; 585 | 586 | set_scav_symbol(idx, pos); 587 | 588 | SCAV_TABLE[idx].yc = &ctx->ycs[pos]; 589 | 590 | ret_counts++; 591 | if (ret_counts >= demands) 592 | return ret_counts; 593 | } 594 | 595 | return ret_counts; 596 | } 597 | 598 | static void 599 | call_init_for_new_scav(struct primary_thread_ctx *ctx) { 600 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 601 | int crt_idx = tl_get_crt_idx_at(pos); 602 | 603 | if (crt_idx == -1) 604 | continue; 605 | 606 | struct coroutine_ctx *crt_ctx = &SCAV_TABLE[crt_idx]; 607 | if (crt_ctx->inited) 608 | continue; 609 | 610 | // a scavenger is assinged uniquely to this primary at this moment, so 611 | // no need to think about concurrency 612 | if (crt_ctx->init) { 613 | struct yield_ctx *yc = tl_yield_ctx_at(pos); 614 | 615 | // yc of primary is not set at this moment, so make crts jump to itself while init 616 | uint64_t sp_tmp = yc->sp; 617 | uint64_t ip_tmp = yc->ip; 618 | short normal_tmp = yc->normal_next; 619 | short special_tmp = yc->special_next; 620 | yc->normal_next = pos * sizeof(struct yield_ctx); // set to itself temporarily 621 | yc->special_next = pos * sizeof(struct yield_ctx); // set to itself temporarily 622 | 623 | crt_ctx->init(); 624 | crt_ctx->inited = true; 625 | 626 | yc->sp = sp_tmp; 627 | yc->ip = ip_tmp; 628 | yc->normal_next = normal_tmp; 629 | yc->special_next = special_tmp; 630 | } 631 | } 632 | } 633 | 634 | static int 635 | allocate_scav(struct primary_thread_ctx *ctx) { 636 | int counts = need_more_scav(ctx); 637 | if (counts == 0) 638 | return 0; 639 | 640 | spin_lock(&scav_table_lock); 641 | int ret = _allocate_scav(ctx, counts); 642 | spin_unlock(&scav_table_lock); 643 | 644 | // scav init -- just call it!! 645 | call_init_for_new_scav(ctx); 646 | set_special_next(ctx); 647 | return ret; 648 | } 649 | 650 | static void 651 | init_primary_ctx(struct primary_thread_ctx *ctx, int counts) { 652 | // sp/ip will be initialized at the first yield 653 | ctx->ycs[0].normal_next = counts > 0 ? sizeof(struct yield_ctx) : 0; 654 | // speical_next shouldn't be used in primary 655 | ctx->scav_num = counts; 656 | ctx->reallocatable = 0; 657 | ctx->reallocated = 0; 658 | msh_reallocatable = &ctx->reallocatable; 659 | msh_reallocated = &ctx->reallocated; 660 | } 661 | 662 | int 663 | msh_init(int max_scav) { 664 | 665 | int fd = open("/dev/cpu_dma_latency", O_WRONLY); 666 | if (fd < 0) { 667 | perror("open /dev/cpu_dma_latency"); 668 | // keep going without setting c-state 669 | // return 1; 670 | } 671 | int l = 0; 672 | if (write(fd, &l, sizeof(l)) != sizeof(l)) { 673 | perror("write to /dev/cpu_dma_latency"); 674 | // return 1; 675 | } 676 | 677 | PRINT("\n\n MSH Init\n"); 678 | 679 | // print offset information 680 | PRINT(" offset to ycs: %lu\n", 681 | offsetof(struct primary_thread_ctx, ycs)); 682 | PRINT(" offset to regs: %lu\n", 683 | offsetof(struct primary_thread_ctx, regs)); 684 | PRINT(" offset to idx_list: %lu\n", 685 | offsetof(struct primary_thread_ctx, idx_list)); 686 | PRINT(" offset to regset: %lu\n", 687 | offsetof(struct primary_thread_ctx, regs[0])); 688 | 689 | PRINT(" size of x86_registers: %lu\n", 690 | sizeof(struct x86_registers)); 691 | PRINT(" size of regset: %lu\n", 692 | sizeof(struct x86_registers) * MAX_REG_SET_NUM); 693 | PRINT(" size of yield context: %lu\n", 694 | sizeof(struct yield_ctx)); 695 | PRINT(" offset of sp: %lu\n", 696 | offsetof(struct yield_ctx, sp)); 697 | PRINT(" offset of ip: %lu\n", 698 | offsetof(struct yield_ctx, ip)); 699 | PRINT(" offset of next: %lu\n", 700 | offsetof(struct yield_ctx, normal_next)); 701 | 702 | PRINT("Warning: this number must match the value in binary " 703 | "instrumentation\n"); 704 | max_scav_per_thread = max_scav; 705 | 706 | init_prim_table(); 707 | 708 | char *path = getenv("MSH_SCAV_POOL_PATH"); 709 | if (path == NULL) { 710 | fprintf(stderr, 711 | "Scavenger pool path should be given in MSH_SCAV_POOL_PATH\n"); 712 | return -1; 713 | } 714 | 715 | strcpy(scav_pool_path, path); 716 | init_scav_table(scav_pool_path); 717 | 718 | PRINT("MSH is successfully initiated\n"); 719 | return 0; 720 | } 721 | 722 | // identify unallocated scavengers and allocated them 723 | int 724 | msh_alloc_ctx(int tid) { 725 | // set per-thread variables 726 | msh_tid = tid; // PRIM_TABLE[tid] is yours from now on 727 | cur_crt_pos = 0; // current coroutine = primary 728 | 729 | if (tid >= MAX_THREAD_NUM) { 730 | PRINT("libmsh: thread number exceeds the limit %d\n", tid); 731 | return -1; 732 | } 733 | 734 | // use ctx, not tid if possible 735 | struct primary_thread_ctx *ctx = &PRIM_TABLE[tid]; 736 | 737 | syscall(SYS_arch_prctl, ARCH_SET_GS, (unsigned long *) ctx); 738 | gsbase = (char *) ctx; 739 | //_writegsbase_u64( 740 | // (uint64_t) ctx); // set gs. this MUST be done before allocate_scav 741 | // because allocate_scav depends on gs 742 | 743 | int cnt = allocate_scav(ctx); 744 | 745 | init_primary_ctx(ctx, cnt); 746 | 747 | PRINT("Primary thread %d has %d scavengers\n", tid, cnt); 748 | //print_prim_ctx(ctx); 749 | 750 | return 0; 751 | } 752 | 753 | int 754 | msh_cleanup() { 755 | if (getenv("SKIP_CLEANUP")) { 756 | fflush(stderr); 757 | fflush(stdout); 758 | return 0; 759 | } 760 | 761 | PRINT("Cleanup thread %d\n", msh_tid); 762 | // setup cleanup 763 | // primary->next = 0 will be done in tl_sched_next_crt 764 | for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) { 765 | struct yield_ctx *yc = tl_yield_ctx_at(pos); 766 | if (yc->normal_next != -1) { 767 | // set it to itself so that it runs to completion 768 | yc->normal_next = pos * sizeof(struct yield_ctx); 769 | yc->special_next = pos * sizeof(struct yield_ctx); 770 | } 771 | } 772 | 773 | // finish remaining scavengers using scheduler code 774 | setjmp(cleanup_jmpbuf[msh_tid]); 775 | int unfinished_crt_pos = tl_find_unfinished_crt(); 776 | if (unfinished_crt_pos == 0) { 777 | fprintf( 778 | stderr, 779 | BOLDBLUE 780 | "[%d] ***Primary thread [%d] is finished***\n" RESET, 781 | msh_tid, msh_tid); 782 | struct yield_ctx *yc = 783 | tl_yield_ctx_at(0); // reset yc[0].next = 0 to avoid error 784 | yc->normal_next = 0; // do this just in case there is an insrumented 785 | // code after cleanup 786 | return 0; 787 | } 788 | 789 | tl_sched_next_crt(0); 790 | assert(false); // shouldn't reach here 791 | } 792 | 793 | void 794 | msh_enter_blockable_call() { 795 | *msh_reallocatable = true; 796 | } 797 | 798 | void 799 | msh_exit_blockable_call() { 800 | while (!CAS(msh_reallocatable, true, false)) { 801 | // someone's trying to reallocate scavs 802 | // busy wait 803 | } 804 | 805 | __asm__ __volatile__("" ::: "memory"); // prevent compiler from reordering 806 | if (*msh_reallocated) { 807 | struct primary_thread_ctx *t = &PRIM_TABLE[msh_tid]; 808 | // slow-path 809 | PRINT("[%d] my scavs are reallocated...\n", msh_tid); 810 | allocate_scav(t); 811 | tl_update_next_map(); 812 | t->reallocated = false; 813 | } 814 | } 815 | -------------------------------------------------------------------------------- /libmsh/msh.h: -------------------------------------------------------------------------------- 1 | #ifndef __MSH_H__ 2 | #define __MSH_H__ 3 | #include 4 | #include 5 | #include 6 | 7 | #define CRT_STACK_SZ (1 << 20) // 1MB 8 | 9 | // This number are over-provisioned to avoid out-of-resource issue. 10 | #define MAX_THREAD_NUM 128 11 | #define MAX_ACTIVE_CRT_NUM_PER_APPLICATION 512 12 | #define MAX_CRT_PER_THREAD 128 13 | #define MAX_REG_SET_NUM 15 // You may need more regiser set depending on the number of 14 | #define NUM_OF_LINES_FOR_LOOP_COUNTER 4 15 | 16 | // x86 GPR + floating-point registers 17 | struct x86_registers { 18 | void *rax; 19 | void *rbx; 20 | void *rcx; 21 | void *rdx; 22 | void *rsi; 23 | void *rdi; 24 | void *rbp; 25 | void *rsp; 26 | void *r8; 27 | void *r9; 28 | void *r10; 29 | void *r11; 30 | void *r12; 31 | void *r13; 32 | void *r14; 33 | void *r15; 34 | void *XMM0[2]; 35 | void *XMM1[2]; 36 | void *XMM2[2]; 37 | void *XMM3[2]; 38 | }; 39 | 40 | // performance critical contexts 41 | // the field order must be unchanged -- binary instrumentator relies on the offset of fields 42 | struct __attribute__((aligned(4), packed)) yield_ctx { 43 | // We hope the first instance of this struct starts at cache line boundary 44 | // If so, we can pack three coroutine_ctx per cache line 45 | uint64_t sp; // last stack pointer 46 | uint64_t ip; // last instruction pointer 47 | short special_next; 48 | short normal_next; // normal_next = -1 means the coroutine is finished 49 | }; 50 | 51 | // Coroutine Context represents a scavenger thread 52 | // the field order must be respected 53 | struct __attribute__((aligned(16), packed)) coroutine_ctx { 54 | void *lib_handle; // 0 55 | void (*entry)(void); // 8 56 | char pad[8]; // 16, padded for stack alignment 57 | char stack[CRT_STACK_SZ]; // 24 58 | void *ret_to_sched; // this must be at the top of stack 59 | char pad2[32]; 60 | // entry arugments 61 | int argc; 62 | char **argv; 63 | struct yield_ctx *yc; // pointer to current yc 64 | bool finished; 65 | void (*init)(void); // scavenger init function 66 | bool inited; 67 | }; 68 | 69 | // gs points to primary_thread_ctx of the current thread 70 | struct __attribute__((aligned(64))) primary_thread_ctx { 71 | struct yield_ctx ycs[MAX_CRT_PER_THREAD + 1]; // should be contiguous, 0 72 | // is reserved for primary 73 | char loop_counter[MAX_CRT_PER_THREAD + 1][NUM_OF_LINES_FOR_LOOP_COUNTER * 64]; 74 | struct x86_registers 75 | regs[MAX_CRT_PER_THREAD + 1] 76 | [MAX_REG_SET_NUM]; // manipulated by insturmented code 77 | // space for register saving for 78 | // optimizations that is unable to use stack 79 | // - FUR 80 | // - Loop Optimization 81 | int idx_list[MAX_CRT_PER_THREAD + 1]; // list of scavenger thread index that 82 | // this primary hosts 83 | int scav_num; 84 | bool reallocatable; // whether the scavengers allocated to this primary 85 | // can be reallocated to other primaries 86 | bool reallocated; // whether the scavengers allocated to this primary 87 | // have been reallocated to other primaries 88 | }; 89 | 90 | /** 91 | * Interfaces 92 | * Pthread wrappers use thses calls. 93 | * WARNING: all functions work in the context of the current thread 94 | * You should assume that every interface includes the thread context 95 | * as the first argument, and every function except msh_init() is reentrant. 96 | */ 97 | int msh_init(int max_scav); 98 | int msh_alloc_ctx(int tid); 99 | int msh_cleanup(); 100 | 101 | void msh_enter_blockable_call(); 102 | void msh_exit_blockable_call(); 103 | 104 | #endif /* __MSH_H__ */ 105 | -------------------------------------------------------------------------------- /libmsh/pthread.c: -------------------------------------------------------------------------------- 1 | /** 2 | * This library will be preloaded on the primary's address space 3 | * Main purpose is 4 | * 1) overriding some pthread functions 5 | * 2) declare some global symbols 6 | * 7 | * This code assumes C11 standard 8 | */ 9 | #define _GNU_SOURCE 10 | 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | 17 | #include "msh.h" 18 | 19 | static __thread int msh_thread_id; 20 | static _Atomic int thread_cnt = 0; 21 | 22 | struct msh_start_routine_arg { 23 | void *(*start_routine)(void *); 24 | void *arg; 25 | int thread_id; 26 | }; 27 | 28 | 29 | // this routine is invoked in the context of the new thread 30 | void * 31 | msh_start_routine(void *arg) { 32 | msh_thread_id = ((struct msh_start_routine_arg *) arg)->thread_id; 33 | 34 | struct timespec start, end; 35 | clock_gettime(CLOCK_MONOTONIC, &start); 36 | int err = msh_alloc_ctx(msh_thread_id); 37 | clock_gettime(CLOCK_MONOTONIC, &end); 38 | fprintf(stdout, "msh_alloc_time= %ld ms\n", (end.tv_sec - start.tv_sec) * 1000 + (end.tv_nsec - start.tv_nsec) / 1000000); 39 | 40 | if (err) { 41 | fprintf(stderr, "msh_alloc_ctx() failed\n"); 42 | return NULL; 43 | } 44 | 45 | void *ret = 46 | ((struct msh_start_routine_arg *) arg) 47 | ->start_routine(((struct msh_start_routine_arg *) arg)->arg); 48 | 49 | err = msh_cleanup(msh_thread_id); 50 | if (err) { 51 | fprintf(stderr, "msh_cleanup() failed\n"); 52 | return NULL; 53 | } 54 | 55 | return ret; 56 | } 57 | 58 | // overriden pthread_create 59 | int (*original_pthread_create)(pthread_t *restrict, 60 | const pthread_attr_t *restrict, 61 | void *(*) (void *), void *restrict) = NULL; 62 | 63 | int 64 | pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, 65 | void *(*start_routine)(void *), void *restrict arg) { 66 | if (thread_cnt == 0) { 67 | char *max_scav_str = getenv("MAX_SCAV_PER_THREAD"); 68 | int max_scav = 1; 69 | if (max_scav_str) 70 | max_scav = atoi(max_scav_str); 71 | 72 | if (msh_init(max_scav)) { 73 | fprintf(stderr, "msh_init() failed\n"); 74 | return -1; 75 | } 76 | 77 | 78 | if (!getenv("SKIP_FIRST_THREAD")) { 79 | int err = msh_alloc_ctx(thread_cnt++); 80 | if (err) { 81 | fprintf(stderr, "msh_alloc_ctx() failed\n"); 82 | return -1; 83 | } 84 | } 85 | } 86 | 87 | struct msh_start_routine_arg *msh_arg = 88 | (struct msh_start_routine_arg *) malloc( 89 | sizeof(struct msh_start_routine_arg)); 90 | msh_arg->start_routine = start_routine; 91 | msh_arg->arg = arg; 92 | msh_arg->thread_id = thread_cnt++; 93 | if (!original_pthread_create) 94 | original_pthread_create = dlsym(RTLD_NEXT, "pthread_create"); 95 | 96 | int ret = original_pthread_create(thread, attr, msh_start_routine, 97 | (void *) msh_arg); 98 | return ret; 99 | } 100 | 101 | /** 102 | * pthread blockable calls 103 | */ 104 | 105 | int (*original_pthread_mutex_lock)(pthread_mutex_t *mutex) = NULL; 106 | 107 | int 108 | pthread_mutex_lock(pthread_mutex_t *mutex) { 109 | msh_enter_blockable_call(); 110 | if (!original_pthread_mutex_lock) { 111 | original_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock"); 112 | } 113 | int ret = original_pthread_mutex_lock(mutex); 114 | msh_exit_blockable_call(); 115 | 116 | return ret; 117 | } 118 | 119 | 120 | int (*original_pthread_cond_wait)(pthread_cond_t *cond, 121 | pthread_mutex_t *mutex) = NULL; 122 | 123 | int 124 | pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex) { 125 | msh_enter_blockable_call(); 126 | if (!original_pthread_cond_wait) { 127 | original_pthread_cond_wait = dlsym(RTLD_NEXT, "pthread_cond_wait"); 128 | } 129 | int ret = original_pthread_cond_wait(cond, mutex); 130 | msh_exit_blockable_call(); 131 | 132 | return ret; 133 | } 134 | 135 | int (*original_pthread_cond_timedwait)( 136 | pthread_cond_t *restrict cond, pthread_mutex_t *restrict mutex, 137 | const struct timespec *restrict abstime) = NULL; 138 | 139 | int 140 | pthread_cond_timedwait(pthread_cond_t *restrict cond, 141 | pthread_mutex_t *restrict mutex, 142 | const struct timespec *restrict abstime) { 143 | msh_enter_blockable_call(); 144 | if (!original_pthread_cond_timedwait) 145 | original_pthread_cond_timedwait = 146 | dlsym(RTLD_NEXT, "pthread_cond_timedwait"); 147 | int ret = original_pthread_cond_timedwait(cond, mutex, abstime); 148 | msh_exit_blockable_call(); 149 | 150 | return ret; 151 | } 152 | 153 | int (*original_pthread_barrier_wait)(pthread_barrier_t *barrier) = NULL; 154 | 155 | int 156 | pthread_barrier_wait(pthread_barrier_t *barrier) { 157 | msh_enter_blockable_call(); 158 | if (!original_pthread_barrier_wait) 159 | original_pthread_barrier_wait = 160 | dlsym(RTLD_NEXT, "pthread_barrier_wait"); 161 | int ret = original_pthread_barrier_wait(barrier); 162 | msh_exit_blockable_call(); 163 | 164 | return ret; 165 | } 166 | 167 | int (*original_pthread_join) (pthread_t thread, void **retval) = NULL; 168 | 169 | bool first_join = true; 170 | #define CAS(ptr, old_val, new_val) \ 171 | __sync_bool_compare_and_swap(ptr, old_val, new_val) 172 | int 173 | pthread_join(pthread_t thread, void **retval) { 174 | if (!getenv("SKIP_FIRST_THREAD")) { 175 | if (CAS(&first_join, true, false)) { 176 | int err = msh_cleanup(); 177 | if (err) { 178 | fprintf(stderr, "msh_cleanup() failed\n"); 179 | return -1; 180 | } 181 | } 182 | } 183 | if (!original_pthread_join) 184 | original_pthread_join = dlsym(RTLD_NEXT, "pthread_join"); 185 | int ret = original_pthread_join(thread, retval); 186 | return ret; 187 | } 188 | 189 | /** 190 | * pthread non-blockable calls -- we need to intercept these calls to match the version 191 | */ 192 | int (*original_pthread_cond_signal)(pthread_cond_t *cond) = NULL; 193 | 194 | int 195 | pthread_cond_signal(pthread_cond_t *cond) { 196 | if (!original_pthread_cond_signal) { 197 | original_pthread_cond_signal = dlsym(RTLD_NEXT, "pthread_cond_signal"); 198 | } 199 | int ret = original_pthread_cond_signal(cond); 200 | 201 | return ret; 202 | } 203 | 204 | int (*original_pthread_mutex_unlock)(pthread_mutex_t *mutex) = NULL; 205 | 206 | int 207 | pthread_mutex_unlock(pthread_mutex_t *mutex) { 208 | if (!original_pthread_mutex_unlock) { 209 | original_pthread_mutex_unlock = 210 | dlsym(RTLD_NEXT, "pthread_mutex_unlock"); 211 | } 212 | int ret = original_pthread_mutex_unlock(mutex); 213 | 214 | return ret; 215 | } 216 | -------------------------------------------------------------------------------- /libmsh/spinlock.h: -------------------------------------------------------------------------------- 1 | #ifndef _SPINLOCK_XCHG_H 2 | #define _SPINLOCK_XCHG_H 3 | 4 | /* Spin lock using xchg. 5 | * Copied from http://locklessinc.com/articles/locks/ 6 | */ 7 | 8 | /* Compiler read-write barrier */ 9 | #define barrier() __asm__ volatile("": : :"memory") 10 | 11 | /* Pause instruction to prevent excess processor bus usage */ 12 | #define cpu_relax() __asm__ volatile("pause\n": : :"memory") 13 | 14 | static inline unsigned short xchg_8(void *ptr, unsigned char x) 15 | { 16 | __asm__ __volatile__("xchgb %0,%1" 17 | :"=r" (x) 18 | :"m" (*(volatile unsigned char *)ptr), "0" (x) 19 | :"memory"); 20 | 21 | return x; 22 | } 23 | 24 | #define BUSY 1 25 | typedef unsigned char spinlock; 26 | 27 | #define SPINLOCK_INITIALIZER 0 28 | 29 | static inline void spin_lock(spinlock *lock) 30 | { 31 | while (1) { 32 | if (!xchg_8(lock, BUSY)) return; 33 | 34 | while (*lock) cpu_relax(); 35 | } 36 | } 37 | 38 | static inline void spin_unlock(spinlock *lock) 39 | { 40 | barrier(); 41 | *lock = 0; 42 | } 43 | 44 | static inline int spin_trylock(spinlock *lock) 45 | { 46 | return xchg_8(lock, BUSY); 47 | } 48 | 49 | #endif /* _SPINLOCK_XCHG_H */ -------------------------------------------------------------------------------- /profile/scripts/capture_cm_inst_primary.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SAMPLING_RATE_IN_CNTS=$1 3 | TARGET=$2 4 | shift 2 5 | ARGS="$*" 6 | 7 | TARGET_ABS_PATH=$(readlink -f $TARGET) 8 | TARGET_BASENAME=$(basename $TARGET_ABS_PATH) 9 | 10 | cd "$(dirname "$0")" 11 | 12 | MAIN_DIR=../.. 13 | BUILD_DIR=${MAIN_DIR}/build 14 | CRLIST=${MAIN_DIR}/crlist.txt 15 | RES_DIR=../results 16 | TMP_DIR=../results/tmp 17 | SCHED_NAME=sched.out 18 | #TARGET=$(cat ${CRLIST} | awk '{print $1}') 19 | 20 | echo "Capture deliquent load PCs ... " 21 | echo "Target object: " 22 | #echo -n "$TARGET_ABS_PATH " > $CRLIST 23 | #echo $ARGS >> $CRLIST 24 | #cat $CRLIST 25 | 26 | mkdir -p ${RES_DIR} 27 | 28 | rm -r ${TMP_DIR} 2> /dev/null 29 | mkdir -p ${TMP_DIR} 30 | 31 | echo " 1) perf record L2/L3 misses" 32 | # TODO: determine sampling freq, type of events 33 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g 34 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp 35 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp 36 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES 37 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\ 38 | -b \ 39 | -c${SAMPLING_RATE_IN_CNTS} \ 40 | -o ${RES_DIR}/perf.data -- sudo sh -c ". ${MAIN_DIR}/config/set_env.sh; chrt -r 99 time -p ${TARGET_ABS_PATH} ${ARGS}" 41 | 42 | sudo chown ${USER} ${RES_DIR}/perf.data 43 | 44 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST} 45 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST} 46 | 47 | echo " 2) perf report" 48 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" | 49 | sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt 50 | 51 | 52 | echo " 3) Capture Functions that cause most L2/L3 misses" 53 | python3 read_func.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${TARGET_BASENAME} #${SCHED_NAME} 54 | 55 | echo " 4) perf annotate --stdio & Capturing all deliquent load PCs for each function ...." 56 | echo -n > ${RES_DIR}/all_PClist.txt 57 | 58 | IFS=$'\n' 59 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do 60 | echo " Function Name: $FUNC" 61 | echo " L2 misses " 62 | FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g') 63 | 64 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 65 | sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' | 66 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt 67 | 68 | echo " L3 misses " 69 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 70 | sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' | 71 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt 72 | 73 | 74 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 75 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 76 | 77 | unset IFS 78 | for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do 79 | echo -n "${TARGET_ABS_PATH} " >> ${RES_DIR}/all_PClist.txt 80 | echo ${PC} >> ${RES_DIR}/all_PClist.txt 81 | done 82 | IFS=$'\n' 83 | 84 | sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt 85 | done 86 | unset IFS 87 | 88 | #rm -r ${TMP_DIR} 2> /dev/null 89 | echo "Instruction capture done!" 90 | -------------------------------------------------------------------------------- /profile/scripts/capture_cm_inst_process.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SAMPLING_RATE_IN_CNTS=$1 3 | TARGET=$2 4 | TARGET_PID=$3 5 | 6 | TARGET_ABS_PATH=$(readlink -f $TARGET) 7 | TARGET_BASENAME=$(basename $TARGET_ABS_PATH) 8 | 9 | cd "$(dirname "$0")" 10 | 11 | MAIN_DIR=../../ 12 | BUILD_DIR=${MAIN_DIR}/build 13 | CRLIST=${MAIN_DIR}/crlist.txt 14 | RES_DIR=../results 15 | TMP_DIR=../results/tmp 16 | SCHED_NAME=sched.out 17 | #TARGET=$(cat ${CRLIST} | awk '{print $1}') 18 | 19 | echo "Capture deliquent load PCs ... " 20 | echo "Target object: " 21 | #echo -n "$TARGET_ABS_PATH " > $CRLIST 22 | #echo $ARGS >> $CRLIST 23 | cat $CRLIST 24 | 25 | mkdir -p ${RES_DIR} 26 | 27 | rm -r ${TMP_DIR} 2> /dev/null 28 | mkdir -p ${TMP_DIR} 29 | 30 | echo " 1) perf record L2/L3 misses" 31 | # TODO: determine sampling freq, type of events 32 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g 33 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp 34 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp 35 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES 36 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\ 37 | -b \ 38 | -c${SAMPLING_RATE_IN_CNTS} \ 39 | -o ${RES_DIR}/perf.data \ 40 | -p ${TARGET_PID} -- sleep 20 41 | echo "done" 42 | sudo chown ${USER} ${RES_DIR}/perf.data 43 | echo "hmm" 44 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST} 45 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST} 46 | 47 | echo " 2) perf report" 48 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" | 49 | sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt 50 | 51 | 52 | echo " 3) Capture Functions that cause most L2/L3 misses" 53 | python3 read_func_process.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${TARGET_BASENAME} #${SCHED_NAME} 54 | 55 | echo " 4) perf annotate --stdio & Capturing all deliquent load PCs for each function ...." 56 | echo -n > ${RES_DIR}/all_PClist.txt 57 | 58 | IFS=$'\n' 59 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do 60 | echo " Function Name: $FUNC" 61 | echo " L2 misses " 62 | FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g') 63 | 64 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 65 | sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' | 66 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt 67 | 68 | echo " L3 misses " 69 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 70 | sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' | 71 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt 72 | 73 | 74 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 75 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 76 | 77 | unset IFS 78 | for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do 79 | echo -n "${TARGET_ABS_PATH} " >> ${RES_DIR}/all_PClist.txt 80 | echo ${PC} >> ${RES_DIR}/all_PClist.txt 81 | done 82 | IFS=$'\n' 83 | 84 | sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt 85 | done 86 | unset IFS 87 | 88 | #rm -r ${TMP_DIR} 2> /dev/null 89 | echo "Instruction capture done!" 90 | -------------------------------------------------------------------------------- /profile/scripts/capture_cm_inst_scav.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SAMPLING_RATE_IN_CNTS=$1 3 | 4 | cd "$(dirname "$0")" 5 | 6 | MAIN_DIR=../../ 7 | BUILD_DIR=${MAIN_DIR}/apps/build 8 | SCAVLIST=${MAIN_DIR}/scav.txt 9 | RES_DIR=../results 10 | TMP_DIR=../results/tmp 11 | SCHED_NAME=template 12 | TARGET=$(cat ${SCAVLIST} | awk '{print $1}') 13 | 14 | echo "Capture deliquent load PCs ... " 15 | echo "Target object: " 16 | #echo -n "$TARGET_ABS_PATH " > $SCAVLIST 17 | #echo $ARGS >> $SCAVLIST 18 | cat $SCAVLIST 19 | 20 | mkdir -p ${RES_DIR} 21 | 22 | rm -r ${TMP_DIR} 2> /dev/null 23 | mkdir -p ${TMP_DIR} 24 | 25 | echo " 1) perf record L2/L3 misses" 26 | # TODO: determine sampling freq, type of events 27 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g 28 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp 29 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp 30 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES 31 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\ 32 | -b \ 33 | -c${SAMPLING_RATE_IN_CNTS} \ 34 | -o ${RES_DIR}/perf.data -- sudo sh -c ". ${MAIN_DIR}/config/set_env.sh; chrt -r 99 ${BUILD_DIR}/${SCHED_NAME} 1" 35 | 36 | sudo chown ${USER} ${RES_DIR}/perf.data 37 | 38 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${SCAVLIST} 39 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${SCAVLIST} 40 | 41 | echo " 2) perf report" 42 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" | 43 | sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt 44 | 45 | 46 | echo " 3) Capture Functions that cause most L2/L3 misses" 47 | python3 read_func.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${SCHED_NAME} 48 | 49 | echo " 4) perf annotate --stdio & Capturing all deliquent load PCs for each function ...." 50 | echo -n > ${RES_DIR}/all_PClist.txt 51 | 52 | IFS=$'\n' 53 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do 54 | echo " Function Name: $FUNC" 55 | echo " L2 misses " 56 | FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g') 57 | 58 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 59 | sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' | 60 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt 61 | 62 | echo " L3 misses " 63 | perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" | 64 | sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' | 65 | sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt 66 | 67 | 68 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 69 | python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt 70 | 71 | unset IFS 72 | for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do 73 | echo -n "${TARGET} " >> ${RES_DIR}/all_PClist.txt 74 | echo ${PC} >> ${RES_DIR}/all_PClist.txt 75 | done 76 | IFS=$'\n' 77 | 78 | sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt 79 | done 80 | unset IFS 81 | 82 | #rm -r ${TMP_DIR} 2> /dev/null 83 | echo "Instruction capture done!" 84 | -------------------------------------------------------------------------------- /profile/scripts/compute_ld_prob.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | sub_cf_dict_list = [] 4 | cf_dict = {} 5 | lat_dict = {} 6 | pred_dict = {} 7 | l2m_dict = {} 8 | l3m_dict = {} 9 | symbol_map = {} 10 | 11 | SAMPLING_RATE_IN_CNT=1 12 | SYS_LBR_ENTRIES=32 13 | LBR_SUBSAMPLE_SIZE=32 14 | is_scav_profile = 0 15 | 16 | last_match = None 17 | def __symbol_map_lookup(addr: int) -> tuple: 18 | global last_match 19 | if last_match is not None: 20 | if last_match[1][1] <= addr and addr <= last_match[1][2]: 21 | return last_match[0], addr - last_match[1][0] 22 | 23 | for key, value in symbol_map.items(): 24 | if value[1] <= addr and addr <= value[2]: # addr is in the range of the symbol 25 | last_match = (key, value) 26 | return key, addr - value[0] # compute offset from the object local address (not symbol local address) 27 | return None, 0 28 | 29 | def __localize_address(range: tuple, allow_diff_obj=False) -> tuple: 30 | assert(len(range) == 2) 31 | 32 | src_pc = range[0] 33 | dst_pc = range[1] 34 | 35 | # check address range of src, dst using symbol map 36 | # then do two things 37 | # 1) filter out if src, dst are not in the same symbol 38 | # 2) if src, dst are in the same symbol, then translate them to symbol-local address 39 | # by subtracting the symbol start address from src, dst 40 | src_obj_name, src_local_addr = __symbol_map_lookup(src_pc) 41 | dst_obj_name, dst_local_addr = __symbol_map_lookup(dst_pc) 42 | 43 | if src_obj_name is None or dst_obj_name is None: 44 | return None 45 | if (src_obj_name != dst_obj_name) and (not allow_diff_obj): 46 | return None 47 | 48 | return (src_obj_name, src_local_addr, dst_local_addr) 49 | 50 | def create_control_flow_tuples(sub_cf_dict, sub_lat_dict, branch_src_to_dst_list: list) -> None: 51 | src, dst = None, None 52 | for i, src_to_dst in enumerate(branch_src_to_dst_list): 53 | if i == 0: # first item 54 | src = src_to_dst[0] 55 | else: 56 | dst = src_to_dst[1] 57 | cycles = src_to_dst[2] 58 | triple = __localize_address((dst, src)) 59 | if triple is None: 60 | src = src_to_dst[0] 61 | continue 62 | 63 | if triple in sub_cf_dict: 64 | sub_cf_dict[triple] += 1 65 | sub_lat_dict[triple] = (sub_lat_dict[triple][0] + cycles, sub_lat_dict[triple][1] + 1) 66 | else: 67 | sub_cf_dict[triple] = 1 68 | sub_lat_dict[triple] = (cycles, 1) 69 | 70 | src = src_to_dst[0] 71 | 72 | 73 | def create_predecessor_profile(sub_pred_dict, branch_src_to_dst_list: list) -> None: 74 | for i, src_to_dst in enumerate(branch_src_to_dst_list): 75 | src = src_to_dst[0] 76 | dst = src_to_dst[1] 77 | triple = __localize_address((src, dst), allow_diff_obj=True) 78 | if triple is None: 79 | continue 80 | 81 | local_addr_src = (triple[0], triple[1]) 82 | local_addr_dst = (triple[0], triple[2]) 83 | 84 | if local_addr_dst in sub_pred_dict: 85 | #do something 86 | if local_addr_src in sub_pred_dict[local_addr_dst]: 87 | sub_pred_dict[local_addr_dst][local_addr_src] += 1 88 | else: 89 | sub_pred_dict[local_addr_dst][local_addr_src] = 1 90 | else: 91 | sub_pred_dict[local_addr_dst] = {} 92 | sub_pred_dict[local_addr_dst][local_addr_src] = 1 93 | 94 | def aggregate_sub_dicts(sub_dicts: list) -> None: 95 | sub_cf_dicts = [sub_dict[0] for sub_dict in sub_dicts] 96 | sub_lat_dicts = [sub_dict[1] for sub_dict in sub_dicts] 97 | sub_pred_dicts = [sub_dict[2] for sub_dict in sub_dicts] 98 | 99 | for sub_cf_dict in sub_cf_dicts: 100 | for key, value in sub_cf_dict.items(): 101 | if key in cf_dict: 102 | cf_dict[key] += value 103 | else: 104 | cf_dict[key] = value 105 | 106 | if is_scav_profile == 0: 107 | return 108 | 109 | for sub_lat_dict in sub_lat_dicts: 110 | for key, value in sub_lat_dict.items(): 111 | if key in lat_dict: 112 | lat_dict[key] = (lat_dict[key][0] + value[0], lat_dict[key][1] + value[1]) 113 | else: 114 | lat_dict[key] = value 115 | 116 | for key, value in lat_dict.items(): 117 | lat_dict[key] = (value[0] / value[1]) # cycles / cnt 118 | 119 | for sub_pred_dict in sub_pred_dicts: 120 | for key, value in sub_pred_dict.items(): 121 | if key in pred_dict: 122 | for key2, value2 in value.items(): 123 | if key2 in pred_dict[key]: 124 | pred_dict[key][key2] += value2 125 | else: 126 | pred_dict[key][key2] = value2 127 | else: 128 | pred_dict[key] = value 129 | 130 | 131 | def sub_process_lbr(lines): 132 | sub_cf_dict = {} 133 | sub_lat_dict = {} 134 | sub_pred_dict = {} 135 | for line in lines: 136 | tokens = line.split() 137 | if len(tokens) < 2: 138 | continue 139 | 140 | # Note: python3 uses 8-byte integers by default (64-bit) 141 | # So no need to worry about overflow in conversion 142 | # (branch src, branch dst, cycles elapsed since the last LRB) 143 | branch_src_to_dst_list = [(int(token.split("/")[0].strip(),16), \ 144 | int(token.split("/")[1].strip(),16), \ 145 | int(token.split("/")[5].strip())) 146 | for token in tokens] 147 | 148 | create_control_flow_tuples(sub_cf_dict, sub_lat_dict, branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE]) 149 | if is_scav_profile == 1: 150 | create_predecessor_profile(sub_pred_dict, branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE]) 151 | return sub_cf_dict, sub_lat_dict, sub_pred_dict 152 | 153 | 154 | def process_lbr(trace_filename: str) -> None: 155 | f = open(trace_filename, "r") 156 | lines = f.readlines() 157 | 158 | f.close() 159 | src_to_dst_count = {} 160 | 161 | NUM_CORES = 56 162 | # split lines into N processes 163 | # each process generate control_flow_tuples and aggregate later 164 | sublines = [[] for i in range(NUM_CORES)] 165 | for i in range(NUM_CORES): 166 | sublines[i] = lines[i::NUM_CORES] 167 | 168 | import multiprocessing 169 | results = [] 170 | with multiprocessing.Pool(NUM_CORES) as p: 171 | results = p.map(sub_process_lbr, sublines) 172 | 173 | p.close() 174 | p.join() 175 | print([len(dic) if dic is not None else 0 for dic in results]) 176 | aggregate_sub_dicts(results) 177 | 178 | 179 | #for line in lines: 180 | # tokens = line.split() 181 | # if len(tokens) < 2: 182 | # continue 183 | # 184 | # # Note: python3 uses 8-byte integers by default (64-bit) 185 | # # So no need to worry about overflow in conversion 186 | # branch_src_to_dst_list = [(int(token.split("/")[0].strip(),16), \ 187 | # int(token.split("/")[1].strip(),16)) 188 | # for token in tokens] 189 | # 190 | # create_control_flow_tuples(branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE]) 191 | 192 | 193 | print("BB execution summary:") 194 | for key, value in cf_dict.items(): 195 | print(f"{key[0]} {hex(key[1])} {hex(key[2])} {value}") 196 | print(" ") 197 | 198 | print("BB latency summary:") 199 | for key, value in lat_dict.items(): 200 | print(f"LAT_PROF {key[0]} {key[1]} {key[2]} {value}") 201 | print(" ") 202 | 203 | print("BB predecessor summary:") 204 | for key, value in pred_dict.items(): 205 | for key2, value2 in value.items(): 206 | print(f"PRED_PROF {key[0]} {key[1]} {key2[0]} {key2[1]} {value2}") 207 | print(" ") 208 | 209 | #with open("cf_dict.txt", "w") as f: 210 | # for key, value in cf_dict.items(): 211 | # f.write(f"{key[0]} {key[1]} {value}\n") 212 | 213 | def sample_to_approx_total(exec_cnt: int, scale:int = 1) -> int: 214 | return exec_cnt / scale * SAMPLING_RATE_IN_CNT 215 | 216 | 217 | def create_cm_dict(summary_filename: str, level: int) -> None: 218 | dic = l2m_dict if level == 2 else l3m_dict 219 | 220 | f = open(summary_filename, "r") 221 | lines = f.readlines() 222 | for line in lines: 223 | # L2/L3 summary file format: 224 | # per line 225 | pc = int(line.split()[0].strip(), 16) 226 | sample_cnt = int(line.split()[1].strip()) 227 | 228 | obj_name, addr = __symbol_map_lookup(pc) 229 | if obj_name is None: 230 | continue 231 | dic[(obj_name, addr)] = sample_to_approx_total(sample_cnt) 232 | 233 | f.close() 234 | 235 | # Compute the probability that each load instruction causes a cache miss 236 | # Assumption: addresses are given in hex and symbol-local address. 237 | def compute_prob(addrlist_filename: str) -> None: 238 | f = open(addrlist_filename, "r") 239 | #out = open("exec_counts.txt", "w") 240 | lines = f.readlines() 241 | 242 | for line in lines: 243 | # Address list file format: 244 | # per line 245 | obj_name = line.split()[0].strip() 246 | addr = int(line.split()[1].strip(), 16) 247 | sample_cnt = 0 248 | 249 | # Loop through all control flow tuples to find the ones that include the given address 250 | for key, value in cf_dict.items(): 251 | if key[0] == obj_name and key[1] <= addr and addr <= key[2]: 252 | sample_cnt += value 253 | exec_cnt = sample_to_approx_total(sample_cnt, LBR_SUBSAMPLE_SIZE) 254 | 255 | l2_miss_cnt = l2m_dict[(obj_name, addr)] if (obj_name, addr) in l2m_dict else 0 256 | l3_miss_cnt = l3m_dict[(obj_name, addr)] if (obj_name, addr) in l3m_dict else 0 257 | 258 | # print prob up to two decimal digits after the point 259 | if exec_cnt == 0: 260 | print(f"LOAD_PROB {obj_name} {addr} {l2_miss_cnt} {l3_miss_cnt} {exec_cnt} 0.00% 0.00%") 261 | continue 262 | 263 | print(f"LOAD_PROB {obj_name} {addr} {l2_miss_cnt} {l3_miss_cnt} {exec_cnt}" + \ 264 | " {0:.2f}% {1:.2f}%".format(100*l2_miss_cnt/exec_cnt, 100*l3_miss_cnt/exec_cnt)) 265 | f.close() 266 | 267 | 268 | def run(lbr_trace_filename: str, 269 | l2_summary_filename: str, 270 | l3_summary_filename: str, 271 | addrlist_filename: str) -> None: 272 | 273 | t1 = time.time() 274 | process_lbr(lbr_trace_filename) 275 | t2 = time.time() 276 | create_cm_dict(l2_summary_filename, 2) 277 | t3 = time.time() 278 | create_cm_dict(l3_summary_filename, 3) 279 | t4 = time.time() 280 | compute_prob(addrlist_filename) 281 | t5 = time.time() 282 | 283 | print(f"process_lbr: {t2 - t1} seconds") 284 | print(f"create_cm_dictl2: {t3 - t2} seconds") 285 | print(f"create_cm_dictl3: {t4 - t3} seconds") 286 | print(f"compute_prob: {t5 - t4} seconds") 287 | 288 | 289 | def build_symbol_map(symbolmap_filename: str) -> None: 290 | f = open(symbolmap_filename, "r") 291 | lines = f.readlines() 292 | for line in lines: 293 | #Symbol map file format: 294 | # per line 295 | #we want to maintain (object start address, symbol start address, symbol end address) 296 | tokens = line.split() 297 | if len(tokens) < 4: 298 | continue 299 | #Assuming a dumb linear search 300 | symbol_map[tokens[0]] = (int(tokens[1], 16) - int(tokens[3], 16), int(tokens[1], 16), int(tokens[1], 16) + int(tokens[2], 16)) 301 | f.close() 302 | 303 | if __name__ == "__main__": 304 | if len(sys.argv) < 8: 305 | print("usage: " + sys.argv[0] + " ") 306 | sys.exit(1) 307 | lbr_trace_filename = sys.argv[1] 308 | symbolmap_filename = sys.argv[2] 309 | addrlist_filename = sys.argv[3] 310 | l2_summary_filename = sys.argv[4] 311 | l3_summary_filename = sys.argv[5] 312 | SAMPLING_RATE_IN_CNT = int(sys.argv[6]) 313 | is_scav_profile = int(sys.argv[7]) 314 | 315 | start = time.time() 316 | build_symbol_map(symbolmap_filename) 317 | end = time.time() 318 | print(f"Symbol map built in {end - start} seconds") 319 | 320 | run(lbr_trace_filename, l2_summary_filename, l3_summary_filename, addrlist_filename) 321 | -------------------------------------------------------------------------------- /profile/scripts/compute_ld_prob.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | if [ $# -ne 3 ]; then 3 | echo "Usage: $0 " 4 | exit 1 5 | fi 6 | 7 | PERF_DATA_PATH=$1 8 | IS_SCAV=$2 9 | PERF_ABS_PATH=$(readlink -f $PERF_DATA_PATH) 10 | SAMPLING_RATE_IN_CNTS=$3 11 | #SAMPLING_RATE_IN_CNTS=$(perf script -i ${PERF_ABS_PATH} --header | grep BR_INST | grep "sample_freq" | awk -F'[, ]' '{print $256}') 12 | echo ${SAMPLING_RATE_IN_CNTS} 13 | cd "$(dirname "$0")" 14 | 15 | MAIN_DIR=../../ 16 | BUILD_DIR=${MAIN_DIR}/build 17 | RES_DIR=../results 18 | TMP_DIR=../results/tmp 19 | 20 | mkdir -p ${TMP_DIR} 21 | 22 | # symbol map 23 | perf report -i ${PERF_ABS_PATH} --dump-raw-trace | grep PERF_RECORD_MMAP2 | \ 24 | awk -F'[][() ]' '{print $NF, $9, $10, $13}' > ${TMP_DIR}/symbol.map # 25 | 26 | # LBR samples 27 | perf script -i ${PERF_ABS_PATH} -F event,brstack | grep BRANCH | \ 28 | awk -F':' '{print $2}' > ${TMP_DIR}/lbr.samples 29 | 30 | # L2, L3 summary 31 | perf script -i ${PERF_ABS_PATH} -F event,ip | grep "MEM_LOAD_RETIRED.L2_MISS" | \ 32 | awk '{cnt[$2]+=1} END {for (ip in cnt) print ip, cnt[ip]}' > ${TMP_DIR}/l2.summary 33 | 34 | perf script -i ${PERF_ABS_PATH} -F event,ip | grep "MEM_LOAD_RETIRED.L3_MISS" | \ 35 | awk '{cnt[$2]+=1} END {for (ip in cnt) print ip, cnt[ip]}' > ${TMP_DIR}/l3.summary 36 | 37 | # finally, compute prob 38 | python3 compute_ld_prob.py ${TMP_DIR}/lbr.samples \ 39 | ${TMP_DIR}/symbol.map \ 40 | ${RES_DIR}/all_PClist.txt \ 41 | ${TMP_DIR}/l2.summary \ 42 | ${TMP_DIR}/l3.summary \ 43 | ${SAMPLING_RATE_IN_CNTS} \ 44 | $IS_SCAV > ${RES_DIR}/ld_prob.txt 45 | 46 | cat ${RES_DIR}/ld_prob.txt | grep "LOAD_PROB" | awk -F'[% ]' '{print $2, $3, $7, $9}' > ${RES_DIR}/cmpc_list.txt 47 | if [ $IS_SCAV -eq 1 ]; then 48 | cat ${RES_DIR}/ld_prob.txt | grep "LAT_PROF" > ${RES_DIR}/lat_prof.txt 49 | cat ${RES_DIR}/ld_prob.txt | grep "PRED_PROF" > ${RES_DIR}/pred_prof.txt 50 | fi 51 | 52 | #rm -r ${TMP_DIR} 2> /dev/null 53 | -------------------------------------------------------------------------------- /profile/scripts/llc_missed_pcs_rfile.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import re 3 | 4 | pc=0 5 | percent=[] 6 | pc_list=[] 7 | max=0.00 8 | pos_in_line=0 9 | 10 | with open(sys.argv[1]) as file_in: 11 | lines = [] 12 | for line in file_in: 13 | lines.append(line.strip()) 14 | #if '40' in line: 15 | #if '4' in line.split()[2]: 16 | if '.' in line: 17 | #print(line) 18 | #print(re.sub(r"\s+", "", line) 19 | #for value in line.split(): 20 | #print(line.split()[0]) 21 | if(line.split()[0] == ":"): 22 | continue 23 | 24 | if(int(float(line.split()[0]))> 0): 25 | percent_value =int(float(line.split()[0])) 26 | percent.append(percent_value) 27 | pc_value=line.split()[2] 28 | pc_list.append(pc_value[:-1]) 29 | if(max < percent_value): 30 | max = percent_value 31 | pc = line.split()[2][:-1] 32 | # print("max: ",max) 33 | # print("pc: ",pc) 34 | 35 | 36 | sum_percent =0 37 | 38 | output_file_pc_list = open(sys.argv[2], "a") 39 | output_file_most_missed_pc = open(sys.argv[3], "a") 40 | 41 | sorted_percent= sorted(percent,reverse=True) 42 | for x in range(0,len(sorted_percent)): 43 | #print("sorted_percent: ", sorted_percent[x]) 44 | for y in range(0,len(percent)): 45 | 46 | #if(percent[y]==sorted_percent[x] and sum_percent < 40): 47 | if(percent[y]==sorted_percent[x] and sum_percent < 70): 48 | print(" PC: ", pc_list[y], " percent: ", percent[y]) 49 | sum_percent = sum_percent + percent[y] 50 | output_file_pc_list.write("percent: "+ str(percent[y])+ " PC: "+ str(pc_list[y])+"\n") 51 | output_file_most_missed_pc.write(str(pc_list[y])+"\n") 52 | 53 | 54 | #for x in range(0,len(pc_list)): 55 | # print("percent: ", percent[x]) 56 | # output_file_pc_list.write("percent: "+ str(percent[x])+ " pc: "+ str(pc_list[x])+"\n") 57 | 58 | #output_file_most_missed_pc.write(pc) 59 | -------------------------------------------------------------------------------- /profile/scripts/prepare_tbench.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd $(dirname "$0") 3 | 4 | TBENCH_DIR=/home/sam/clh/tailbench-clh 5 | TBENCH_APPS="img-dnn masstree sphinx xapian moses" 6 | 7 | BUILD_DIR=$(readlink -f ../../apps/build) 8 | 9 | if [ -z $1 ]; then 10 | echo "Usage: ./prepare_tbench.sh [build|run]" 11 | exit 12 | fi 13 | 14 | MODE=$1 15 | if [ $MODE == "build" ]; then 16 | for app in $TBENCH_APPS; do 17 | if [ $app == "moses" ]; then 18 | cd $TBENCH_DIR/${app}/moses-cmd 19 | else 20 | cd $TBENCH_DIR/${app} 21 | fi 22 | make clean 23 | if [ $app == "sphinx" ]; then 24 | PKG_CONFIG_PATH=${TBENCH_DIR}/${app}/sphinx-install/lib/pkgconfig make -j16 25 | elif [ $app == "silo" ]; then 26 | ./build.sh 27 | else 28 | make -j16 29 | fi 30 | done 31 | fi 32 | 33 | if [ $MODE == "install" ]; then 34 | for app in $TBENCH_APPS; do 35 | target=$app 36 | if [ $app == "masstree" ]; then 37 | target="mttest" 38 | elif [ $app == "sphinx" ]; then 39 | target="decoder" 40 | elif [ $app == "moses" ]; then 41 | target="moses-cmd/moses" 42 | elif [ $app == "silo" ]; then 43 | target="out-perf.masstree/benchmarks/dbtest" 44 | fi 45 | 46 | cd $TBENCH_DIR/${app} 47 | cp $TBENCH_DIR/${app}/${target}_integrated ${BUILD_DIR} 48 | done 49 | fi 50 | -------------------------------------------------------------------------------- /profile/scripts/read_func.py: -------------------------------------------------------------------------------- 1 | """ 2 | IN: perf report of L2, L3 cache misses 3 | OUT: 1)fn_list: list of functions that are responsible for most cache misses 4 | 2)fn_percent_list: list of percentages of cache misses for each function 5 | """ 6 | import sys 7 | import re 8 | 9 | funcNum = 0 10 | percentSum = 0 11 | #funcList = [] 12 | #percentList = [] 13 | fn_to_pct_map = {} 14 | 15 | bench = sys.argv[3] 16 | 17 | MAX_FUNC_NUM = 20 18 | with open(sys.argv[1]) as file_in: 19 | lines = [] 20 | for line in file_in: 21 | # check if it is substr 22 | if line.split('$')[1] in bench: # command = bench 23 | fn_name_prim = line.split('$')[3] 24 | fn_name = fn_name_prim.split()[1] 25 | for i in range(2, len(fn_name_prim.split())): 26 | fn_name += " " 27 | fn_name += fn_name_prim.split()[i] 28 | 29 | overhead_pct = float(line.split("%")[0]) 30 | 31 | if overhead_pct < 1: 32 | continue 33 | 34 | if fn_name in fn_to_pct_map: 35 | fn_to_pct_map[fn_name] += overhead_pct 36 | else: 37 | if funcNum >= MAX_FUNC_NUM: 38 | continue 39 | funcNum=funcNum+1 40 | fn_to_pct_map[fn_name] = overhead_pct 41 | 42 | 43 | #output_file_func_percent_list = open(sys.argv[2], "a") 44 | output_file_func_list = open(sys.argv[2], "a") 45 | 46 | 47 | if fn_to_pct_map: 48 | print("Functions that cause most cache misses (L2, L3 aggregated, total 200%):") 49 | for fn, pct in fn_to_pct_map.items(): 50 | print(f"{fn} {str(pct)}%"); 51 | output_file_func_list.write(f"{fn}\t{str(pct)}%\n") 52 | -------------------------------------------------------------------------------- /profile/scripts/read_func_process.py: -------------------------------------------------------------------------------- 1 | """ 2 | IN: perf report of L2, L3 cache misses 3 | OUT: 1)fn_list: list of functions that are responsible for most cache misses 4 | 2)fn_percent_list: list of percentages of cache misses for each function 5 | """ 6 | import sys 7 | import re 8 | 9 | funcNum = 0 10 | percentSum = 0 11 | #funcList = [] 12 | #percentList = [] 13 | fn_to_pct_map = {} 14 | 15 | bench = sys.argv[3] 16 | 17 | MAX_FUNC_NUM = 20 18 | with open(sys.argv[1]) as file_in: 19 | lines = [] 20 | for line in file_in: 21 | # check if it is substr 22 | print(line.split('$')[2], bench) 23 | if line.split('$')[2] in bench or bench in line.split('$')[2]: # command = bench 24 | fn_name_prim = line.split('$')[3] 25 | fn_name = fn_name_prim.split()[1] 26 | for i in range(2, len(fn_name_prim.split())): 27 | fn_name += " " 28 | fn_name += fn_name_prim.split()[i] 29 | 30 | overhead_pct = float(line.split("%")[0]) 31 | if overhead_pct < 1: 32 | continue 33 | 34 | if fn_name in fn_to_pct_map: 35 | fn_to_pct_map[fn_name] += overhead_pct 36 | else: 37 | if funcNum >= MAX_FUNC_NUM: 38 | continue 39 | funcNum=funcNum+1 40 | fn_to_pct_map[fn_name] = overhead_pct 41 | 42 | 43 | #output_file_func_percent_list = open(sys.argv[2], "a") 44 | output_file_func_list = open(sys.argv[2], "a") 45 | 46 | 47 | if fn_to_pct_map: 48 | print("Functions that cause most cache misses (L2, L3 aggregated, total 200%):") 49 | for fn, pct in fn_to_pct_map.items(): 50 | print(f"{fn} {str(pct)}%"); 51 | output_file_func_list.write(f"{fn}\t{str(pct)}%\n") 52 | -------------------------------------------------------------------------------- /profile/scripts/read_lats.py: -------------------------------------------------------------------------------- 1 | import struct 2 | import numpy as np 3 | import sys 4 | 5 | def main(): 6 | fileName = 'lats.bin' 7 | with open(fileName, mode='rb') as file: 8 | fileContent = file.read() 9 | 10 | results = [ r for r in struct.iter_unpack("QQQ", fileContent) ] 11 | queue_times = [ v[0]/1e6 for v in results ] 12 | svc_times = [ v[1]/1e6 for v in results ] 13 | sjrn_times = [ v[2]/1e6 for v in results ] 14 | 15 | if len(sys.argv) > 1: 16 | for i, _ in enumerate(queue_times): 17 | print('{} {} {}'.format(queue_times[i], svc_times[i], sjrn_times[i])) 18 | 19 | print('query count: {}'.format(len(queue_times))) 20 | print('queue time: average {}'.format(np.mean(queue_times))) 21 | print('service time: average {}'.format(np.mean(svc_times))) 22 | print('sojourn time: average {}'.format(np.mean(sjrn_times))) 23 | 24 | print('queue time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(queue_times), np.percentile(queue_times, 90), np.percentile(queue_times, 95), np.percentile(queue_times, 99))) 25 | print('service time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(svc_times), np.percentile(svc_times, 90), np.percentile(svc_times, 95), np.percentile(svc_times, 99))) 26 | print('sojourn time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(sjrn_times), np.percentile(sjrn_times, 90), np.percentile(sjrn_times, 95), np.percentile(sjrn_times, 99))) 27 | 28 | if __name__ == "__main__": 29 | main() 30 | -------------------------------------------------------------------------------- /profile/scripts/scav_profile.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sosson97/msh/5cdde8a2fe8f8eee103caf296ff21e170b465c6b/profile/scripts/scav_profile.py --------------------------------------------------------------------------------