├── README.md
├── apps
    ├── Makefile
    └── src
    │   ├── primary
    │       ├── ptrchase.cpp
    │       └── template.cpp
    │   └── scavengers
    │       └── compute.cpp
├── do_prof_primary.sh
├── do_prof_scavenger.sh
├── libmsh
    ├── Makefile
    ├── msh.c
    ├── msh.h
    ├── pthread.c
    └── spinlock.h
├── msh_bolt.diff
└── profile
    └── scripts
        ├── capture_cm_inst_primary.sh
        ├── capture_cm_inst_process.sh
        ├── capture_cm_inst_scav.sh
        ├── compute_ld_prob.py
        ├── compute_ld_prob.sh
        ├── llc_missed_pcs_rfile.py
        ├── prepare_tbench.sh
        ├── read_func.py
        ├── read_func_process.py
        ├── read_lats.py
        └── scav_profile.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Memory Stall Software Harvester
  2 | This repository contains the prototype of the Memory Stall Software Harvester (MSH), the first system designed to transparently and efficiently harvest memory-bound CPU stall cycles in software. Why harvest memory-bound stalls through a software mechanism when there are well-known hardware harvesting mechanisms like Intel Hyperthreads? The answer is that hardware mechanisms are inflexible: they cannot differentiate between latency-sensitive applications and others, and they only provide limited concurrency (e.g., 2 threads), often harvesting too much or too little. MSH allows for adjusting the length and frequency of cycle harvesting more precisely, providing a unique opportunity to utilize stalled cycles of latency-sensitive applications while meeting different latency SLOs. For more details about MSH, please take a look at our OSDI'24 [paper](https://www.usenix.org/system/files/osdi24-luo.pdf).
  3 | 
  4 | ## Limitation in Prototype
  5 | Our prototype has some assumptions on primary and scavenger applications to simplify the implementation
  6 | - Primary application must use `pthread` library.
  7 | - To use a lightweight context switch, our prototype assumes that scavenger applications are given as shared objects and include several symbols and an entry point. Thus, you need to modify and recompile your scavenger application. The required transformation is straightforward and short.
  8 | - We assume x86-64 platform and use gcc/g++ with `-fno-omit-frame-pointer` and `-mno-red-zone` flags enabled.
  9 |   - Tested with gcc/g++ 11.4 and 22.04.1-Ubuntu (kernel version 6.5.0).  
 10 | 
 11 | ## Basic Usage
 12 | ### Workflow
 13 | The use of MSH involves three pieces of software: profiler (scripts based on `perf`), binary instrumentation (`llvm-bolt`), and MSH runtime(`libmsh`). We assume that a user has a primary application and a set of scavenger applications that will run when there are stalled cycles in the primary application. One can use MSH in the following way.
 14 | - Modify scavenger applications
 15 | - Profile primary/scavenger applications.
 16 | - Modify primary/scavenger binaries through binary instrumentation with the profile result.
 17 | - Finally, run instrumented binaries with MSH runtime.
 18 | 
 19 | We'll show you how this workflow works with one simple primary(`ptrchase`) and scavenger(`compute.so`).
 20 | 
 21 | ### Prerequisite: Modify Scavenger
 22 | MSH requires a scavenger to have the following symbols in the file containing `main` function.
 23 | ```
 24 | extern "C" {
 25 | int    crt_pos      = 0;
 26 | int    argc         = 0;
 27 | char **argv         = 0;
 28 | }
 29 | ```
 30 | Then, rename `main` function to `entry` as shown below.
 31 | 
 32 | ```
 33 | extern "C" int
 34 | entry(void) {
 35 | ...
 36 | }
 37 | ```
 38 | Lastly, compile the scavenger to a shared object file.
 39 | 
 40 | ### Prerequisite: BOLT
 41 | We implemented all the binary-level instrumentation in BOLT. Here is the patch instruction:
 42 | ```
 43 | git clone https://github.com/llvm/llvm-project.git
 44 | git checkout 30c1f31
 45 | patch -p1 < msh_bolt.diff
 46 | ```
 47 | Then, compile BOLT by following the instruction in [BOLT page]([url](https://github.com/llvm/llvm-project/tree/main/bolt)). We'll assume that `llvm-bolt` is in `PATH` from now on.
 48 | 
 49 | ### Profile and Instrument Primary
 50 | ```
 51 | # Compile ptrchase
 52 | cd apps
 53 | mkdir build
 54 | make primary
 55 | ```
 56 | 
 57 | ```
 58 | # usage: ./do_prof_primary.sh [binary] [args]
 59 | # Pointer chase 50MB array
 60 | cd ${HOME}
 61 | ./do_prof_primary.sh ./apps/build/ptrchase 13107200
 62 | ```
 63 | 
 64 | `Makefile` in `apps` has rules that use BOLT to perform binary instrumentation. We assume that `llvm-bolt` is in `PATH`.
 65 | ```
 66 | cd apps
 67 | make build/ptrchase.bolt
 68 | ```
 69 | 
 70 | ### Profile and Instrument Scavenger
 71 | ```
 72 | # Compile compute.so
 73 | cd apps
 74 | make scavenger
 75 | ```
 76 | 
 77 | ```
 78 | cd ${HOME}
 79 | echo "$(pwd)/apps/build/compute.so" > scav.txt
 80 | ./do_prof_scavenger.sh
 81 | ```
 82 | 
 83 | You can instrument scavenger in a similar way. However, you need to specify the average yield distance (in nanoseconds) in scavenger to bound it.
 84 | ```
 85 | cd apps
 86 | YIELD_DISTANCE=100 make build/compute.so.bolt
 87 | ```
 88 | 
 89 | ### Run with MSH Runtime
 90 | We used `LD_PRELOAD` trick to attach the runtime to the primary without recompilation.
 91 | ```
 92 | cd libmsh
 93 | make all
 94 | 
 95 | cd ..
 96 | export LD_PRELOAD=$(pwd)/build/libmsh.so:${LD_PRELOAD}
 97 | export LD_LIBRARY_PATH=$(pwd)/build:${LD_LIBRARY_PATH}
 98 | 
 99 | echo "$(pwd)/apps/build/compute.so.bolt" > scav.txt
100 | export MSH_SCAV_POOL_PATH=$(pwd)/scav.txt
101 | export SKIP_FIRST_THREAD=1
102 | 
103 | ./build/apps/ptrchase.bolt 13107200
104 | ```
105 | Note: don't forget to reset `LD_PRELOAD` when you finish testing MSH. MSH runtime will intercept all the following `pthread` functions otherwise
106 | 
107 | ## TODO
108 | - Add the use of more complex primary (e.g., tailbench) and scavenger (e.g., graph algorithm).
109 | - Explain knobs
110 | 
111 | ## Developed by
112 | Sam Son and Zhihong Luo
113 | 


--------------------------------------------------------------------------------
/apps/Makefile:
--------------------------------------------------------------------------------
 1 | CCP=g++
 2 | 
 3 | PRM_SRC=src/primary
 4 | SCAV_SRC=src/scavengers
 5 | BUILD=build
 6 | CMPC_LIST=../profile/results/cmpc_list.txt
 7 | CMPC_LIST=../profile/results/pred_prof.txt
 8 | CMPC_LIST=../profile/results/lat_prof.txt
 9 | CPPFLAGS=-Wall -I$(INCLUDES) -std=c++17 -fno-omit-frame-pointer -mno-red-zone -fpermissive
10 | 
11 | all: pre-build primary scav
12 | 
13 | pre-build:
14 | 	@mkdir -p $(BUILD)
15 | 	@mkdir -p $(BUILD)/asms
16 | 
17 | primary: $(BUILD)/ptrchase $(BUILD)/template
18 | 
19 | scav: $(BUILD)/compute.so
20 | 
21 | $(BUILD)/ptrchase: $(PRM_SRC)/ptrchase.cpp
22 | 	$(CCP) $(CPPFLAGS) -o $@ $< -lpthread
23 | 
24 | $(BUILD)/template: $(PRM_SRC)/template.cpp
25 | 	$(CCP) $(CPPFLAGS) -o $@ $< -lpthread
26 | 
27 | $(BUILD)/%.so: $(SCAV_SRC)/%.cpp
28 | 	$(CCP) $(CPPFLAGS) -shared -fPIC -o $@ $^
29 | 
30 | 
31 | $(BUILD)/%.so.bolt: $(BUILD)/%.so
32 | 	llvm-bolt $< -o $@ \
33 | 		--clh-cmpc-list=$(CMPC_LIST) \
34 | 		--disable-ctx-prefetch \
35 | 		--disable-pseudo-inline	\
36 | 		--disable-asympp \
37 | 		--assume-abi \
38 | 		--clh-opt-yield \
39 | 		--test-special-yield \
40 | 		--inst-scav \
41 | 		--lat-prof=$(LAT_PROF) \
42 | 		--pred-prof=$(PRED_PROF) \
43 | 		--bound-yield-distance=$(YIELD_DISTANCE) \
44 | 		--clh-prob-th=200 > out.txt 2> err.txt #/dev/null
45 | 
46 | $(BUILD)/%.bolt: $(BUILD)/%
47 | 	llvm-bolt $< -o $@ \
48 | 		--clh-cmpc-list=$(CMPC_LIST) \
49 | 		--disable-ctx-prefetch \
50 | 		--assume-abi \
51 | 		--clh-opt-yield \
52 | 		--clh-prob-th=75 > out.txt 2> err.txt #/dev/null
53 | 
54 | $(BUILD)/asms/%.asm: $(BUILD)/%
55 | 	objdump -s -S $< > $@
56 | 


--------------------------------------------------------------------------------
/apps/src/primary/ptrchase.cpp:
--------------------------------------------------------------------------------
  1 | /* check if /proc/sys/vm/nr_hugepages > 0 */
  2 | 
  3 | #ifndef _GNU_SOURCE
  4 | #define _GNU_SOURCE
  5 | #endif
  6 | 
  7 | #include <stdint.h>
  8 | #include <stdio.h>
  9 | #include <stdlib.h>
 10 | 
 11 | #include <linux/mman.h>
 12 | #include <sched.h>
 13 | #include <sys/mman.h>
 14 | #include <pthread.h>
 15 | 
 16 | #include <time.h>
 17 | 
 18 | /**
 19 |  * system configurations
 20 |  * netsys-c27
 21 |  */
 22 | #define CORE_BASE     28   // make sure each core in different physical core
 23 | #define CORE_NUM      1
 24 | #define JOBS_PER_CORE 1
 25 | #define CACHE_LINE_SZ 64
 26 | #define NODE_ALIGN    64
 27 | #define FREQ_MHZ      2100
 28 | 
 29 | #define ARR_LEN        1024 * 64
 30 | #define REPS   10
 31 | 
 32 | #define RDTSC_TO_NS(x) ((x) *1000 / FREQ_MHZ)
 33 | 
 34 | #define THREAD_NUM 1
 35 | // #define POINTER_CHASE
 36 | // #define SHUFFLE
 37 | 
 38 | #pragma GCC push_options
 39 | #pragma GCC optimize("O0")
 40 | 
 41 | static inline uint64_t
 42 | start64_rdtsc() {
 43 |     uint64_t t;
 44 |     __asm__ volatile("lfence\n\t"
 45 |                      "rdtsc\n\t"
 46 |                      "shl $32, %%rdx\n\t"
 47 |                      "or %%rdx, %0\n\t"
 48 |                      "lfence"
 49 |                      : "=a"(t)
 50 |                      :
 51 |                      : "rdx", "memory", "cc");
 52 |     return t;
 53 | }
 54 | 
 55 | static inline uint64_t
 56 | stop64_rdtsc() {
 57 |     uint64_t t;
 58 |     __asm__ volatile("rdtscp\n\t"
 59 |                      "shl $32, %%rdx\n\t"
 60 |                      "or %%rdx, %0\n\t"
 61 |                      "lfence"
 62 |                      : "=a"(t)
 63 |                      :
 64 |                      : "rcx", "rdx", "memory", "cc");
 65 |     return t;
 66 | }
 67 | 
 68 | static inline uint64_t
 69 | start64_ts() {
 70 |     return start64_rdtsc();
 71 | }
 72 | 
 73 | static inline uint64_t
 74 | stop64_ts() {
 75 |     return stop64_rdtsc();
 76 | }
 77 | #pragma GCC pop_options
 78 | 
 79 | #define handle_error(msg)                                                      \
 80 |     do {                                                                       \
 81 |         perror(msg);                                                           \
 82 |         exit(EXIT_FAILURE);                                                    \
 83 |     } while (0)
 84 | 
 85 | struct node {
 86 |     struct node *next;
 87 |     char         pad[NODE_ALIGN - sizeof(struct node *)];
 88 | };
 89 | 
 90 | struct status {
 91 |     int          cur_tid;
 92 |     struct node *last_pos;
 93 | 
 94 | #ifndef POINTER_CHASE
 95 |     uint64_t last_idx;
 96 | #endif
 97 | };
 98 | 
 99 | struct thread_arg {
100 |     int      tid;
101 |     uint64_t arr_len;
102 |     uint64_t rep_per_task;
103 |     int      is_tq;
104 | };
105 | 
106 | struct node *data_arr = NULL;
107 | 
108 | static int seed_randomizer = 0;
109 | void
110 | gen_cycle(int *out, uint64_t out_len) {
111 |     uint64_t idx, dst_idx;
112 |     int      tmp;
113 |     srand((unsigned) time(NULL) + seed_randomizer++);
114 | 
115 |     int *perm = (int *) malloc(sizeof(int) * out_len);
116 | 
117 |     for (idx = 0; idx < out_len; idx++)
118 |         perm[idx] = idx;
119 | 
120 |     for (idx = 0; idx < out_len; idx++) {
121 |         dst_idx       = rand() % out_len;
122 |         tmp           = perm[idx];
123 |         perm[idx]     = perm[dst_idx];
124 |         perm[dst_idx] = tmp;
125 |     }
126 | 
127 |     for (idx = 0; idx < out_len; idx++)
128 |         out[perm[idx]] = perm[(idx + 1) % out_len];
129 | }
130 | 
131 | void
132 | init_run(uint64_t arr_len) {
133 |     int *cycle = (int *) malloc(sizeof(int) * arr_len);
134 | 
135 |     if (data_arr == NULL) {
136 |         data_arr = (struct node *) mmap(
137 |             NULL, sizeof(struct node) * arr_len * CORE_NUM * JOBS_PER_CORE,
138 |             PROT_READ | PROT_WRITE,
139 |             MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);
140 |         if (data_arr == MAP_FAILED)
141 |             handle_error("mmap");
142 |         for (int job_id = 0; job_id < CORE_NUM * JOBS_PER_CORE; job_id++) {
143 |             int cur = 0;
144 |             gen_cycle(cycle, arr_len);
145 |             do {
146 |                 data_arr[job_id * arr_len + cur].next =
147 |                     &data_arr[job_id * arr_len + cycle[cur]];
148 |                 cur = cycle[cur];
149 |             } while (cur != 0);
150 |         }
151 |     }
152 | 
153 |     free(cycle);
154 | }
155 | 
156 | uint64_t run_time[THREAD_NUM];
157 | void *
158 | ptrchase_tq(void *arg) {
159 |     int tid = ((struct thread_arg *) arg)->tid;
160 |     uint64_t rep_per_task = ((struct thread_arg *) arg)->rep_per_task;
161 |     uint64_t start, end;
162 | 
163 |     struct node *cur  = data_arr;
164 |     uint64_t     reps = 0;
165 | 
166 |     start = start64_ts();
167 |     while (reps++ < rep_per_task) {
168 |         // here shuld be the coroutine yield point
169 |         cur = cur->next;
170 |     }
171 |     end = stop64_ts();
172 | 
173 |     run_time[tid] = end - start;
174 |     
175 |     fprintf(stderr, "[%d] ptrchase finishes\n", tid);
176 |     fprintf(stderr, "[%d] AMAT: %lu ns\n", tid, RDTSC_TO_NS(run_time[tid] / REPS / (rep_per_task / REPS)));
177 | 
178 |     return NULL;
179 | }
180 | 
181 | void
182 | run(uint64_t arr_len) {
183 |     thread_arg targ[THREAD_NUM];
184 | 
185 |     init_run(arr_len);
186 | 
187 |     pthread_t t[THREAD_NUM];
188 |     for (int i = 0; i < THREAD_NUM; i++) {
189 |         targ[i].rep_per_task = arr_len * REPS;
190 |         targ[i].tid = i;
191 |         if (pthread_create(&t[i], NULL, ptrchase_tq, &targ[i]))
192 |             handle_error("pthread_create");
193 |     }
194 | 
195 |     for (int i = 0; i < THREAD_NUM; i++)
196 |         pthread_join(t[i], NULL); 
197 | 
198 |     //ptrchase_tq(&targ);
199 | 
200 |     return;
201 | }
202 | 
203 | int main(int argc, char **argv) {
204 |     uint64_t arr_len = 0;
205 |     if (argc > 1)
206 |         arr_len = atoi(argv[1]);
207 |     else
208 |         arr_len = ARR_LEN;
209 | 
210 |     fprintf(stderr, "ptrchase %lu %d\n", arr_len, REPS);
211 | 
212 |     run(arr_len);
213 | 
214 |     return 0;
215 | }
216 | 


--------------------------------------------------------------------------------
/apps/src/primary/template.cpp:
--------------------------------------------------------------------------------
 1 | #include <pthread.h>
 2 | 
 3 | #define THREAD_NUM 1
 4 | 
 5 | void *do_nothing(void *arg) {
 6 |     return NULL;
 7 | }
 8 | 
 9 | int main() {
10 |     pthread_t t[THREAD_NUM];
11 |     for (int i = 0; i < THREAD_NUM; i++)
12 |         pthread_create(&t[i], NULL, do_nothing, NULL);
13 | 
14 |     for (int i = 0; i < THREAD_NUM; i++)
15 |         pthread_join(t[i], NULL);
16 |     return 0;
17 | }
18 | 


--------------------------------------------------------------------------------
/apps/src/scavengers/compute.cpp:
--------------------------------------------------------------------------------
  1 | #include <stdint.h>
  2 | #include <stdio.h>
  3 | #include <stdlib.h>
  4 | #include <string>
  5 | 
  6 | static uint64_t total_reps = 100 * 1000 * 1000;
  7 | static uint64_t gsum = 0;
  8 | 
  9 | #define FREQ_MHZ       2100
 10 | #define RDTSC_TO_NS(x) ((x) *1000 / FREQ_MHZ)
 11 | 
 12 | #pragma GCC push_options
 13 | #pragma GCC optimize("O0")
 14 | 
 15 | static inline uint64_t
 16 | start64_rdtsc() {
 17 |     uint64_t t;
 18 |     __asm__ volatile("lfence\n\t"
 19 |                      "rdtsc\n\t"
 20 |                      "shl $32, %%rdx\n\t"
 21 |                      "or %%rdx, %0\n\t"
 22 |                      "lfence"
 23 |                      : "=a"(t)
 24 |                      :
 25 |                      : "rdx", "memory", "cc");
 26 |     return t;
 27 | }
 28 | 
 29 | static inline uint64_t
 30 | stop64_rdtsc() {
 31 |     uint64_t t;
 32 |     __asm__ volatile("rdtscp\n\t"
 33 |                      "shl $32, %%rdx\n\t"
 34 |                      "or %%rdx, %0\n\t"
 35 |                      "lfence"
 36 |                      : "=a"(t)
 37 |                      :
 38 |                      : "rcx", "rdx", "memory", "cc");
 39 |     return t;
 40 | }
 41 | 
 42 | static inline uint64_t
 43 | start64_ts() {
 44 |     return start64_rdtsc();
 45 | }
 46 | 
 47 | static inline uint64_t
 48 | stop64_ts() {
 49 |     return stop64_rdtsc();
 50 | }
 51 | #pragma GCC pop_options
 52 | 
 53 | uint64_t run_time = 0;
 54 | int comp_len = 10;
 55 | void
 56 | compute() {
 57 |     uint64_t start, end;
 58 |     uint64_t rep = 0;
 59 | 
 60 |     start = start64_ts();
 61 |     while (++rep < total_reps) {
 62 |         // Here should be coroutine yield point
 63 |         int x = 0;
 64 |         for (int i = 0; i < comp_len; i++) {
 65 |             x += i;
 66 |         }
 67 | 
 68 |         gsum += x;
 69 |     }
 70 |     end      = stop64_ts();
 71 |     run_time = end - start;
 72 | }
 73 | 
 74 | #pragma GCC push_options
 75 | #pragma GCC optimize("O0")
 76 | extern "C" {
 77 | int    crt_pos      = 0;
 78 | int    argc         = 0;
 79 | char **argv         = 0;
 80 | }
 81 | 
 82 | extern "C" int
 83 | entry(void) {
 84 |     if (argc > 1) {
 85 |         comp_len = atoi(argv[1]);
 86 |         total_reps = std::stoull(argv[2]);
 87 |     }
 88 | 
 89 |     fprintf(stderr, "compute starts -- crt_pos = %d\n", crt_pos);
 90 | 
 91 |     compute();
 92 | 
 93 |     fprintf(stderr, "compute finishes %d\n", crt_pos);
 94 |     fprintf(stderr, "per task time: %lu ns\n",
 95 |             RDTSC_TO_NS(run_time) / total_reps);
 96 |     
 97 |     // coroutine context is designed to return to
 98 |     // "ret_to_scheduler"
 99 | 
100 |     // set %rax = crt_pos
101 |     return crt_pos;
102 | }
103 | #pragma GCC pop_options
104 | 


--------------------------------------------------------------------------------
/do_prof_primary.sh:
--------------------------------------------------------------------------------
 1 | export LD_PRELOAD=""
 2 | 
 3 | PROF_DIR=profile
 4 | SAMPLING_RATE_IN_CNT=102400
 5 | BIN=$1
 6 | shift 1
 7 | ARGS="$*"
 8 | echo $ARGS
 9 | 
10 | ${PROF_DIR}/scripts/capture_cm_inst_primary.sh ${SAMPLING_RATE_IN_CNT} ${BIN} ${ARGS}
11 | ${PROF_DIR}/scripts/compute_ld_prob.sh profile/results/perf.data 0 $SAMPLING_RATE_IN_CNT
12 | 


--------------------------------------------------------------------------------
/do_prof_scavenger.sh:
--------------------------------------------------------------------------------
 1 | export LD_PRELOAD=$(pwd)/libmsh/build/libmsh.so:${LD_PRELOAD}
 2 | export LD_LIBRARY_PATH=$(pwd)/libmsh/build:${LD_LIBRARY_PATH}
 3 | 
 4 | PROF_DIR=profile
 5 | TEMPLATE_BIN=apps/build/template
 6 | SAMPLING_RATE_IN_CNT=102400
 7 | 
 8 | # set scav.txt on your own
 9 | 
10 | ${PROF_DIR}/scripts/capture_cm_inst_scav.sh ${SAMPLING_RATE_IN_CNT} ${TEMPLATE_BIN}
11 | ${PROF_DIR}/scripts/compute_ld_prob.sh profile/results/perf.data 1 ${SAMPLING_RATE_IN_CNT}
12 | 


--------------------------------------------------------------------------------
/libmsh/Makefile:
--------------------------------------------------------------------------------
 1 | CC=gcc
 2 | 
 3 | BUILD=build
 4 | SRC=.
 5 | 
 6 | TARGET=$(BUILD)/libmsh.so
 7 | 
 8 | INCLUDES=-I$(SRC)/
 9 | CFLAGS=-Wall -I$(INCLUDES) -std=c11 -fno-omit-frame-pointer -mno-red-zone
10 | HEADERS=$(wildcard $(SRC)/*.h)
11 | 
12 | all: prebuild $(TARGET)
13 | 
14 | prebuild:
15 | 	@mkdir -p $(BUILD)
16 | 
17 | $(TARGET): $(BUILD)/msh.o $(BUILD)/pthread.o
18 | 	$(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -o $@ $^ -ldl -mfsgsbase
19 | 
20 | $(BUILD)/msh.o: $(SRC)/msh.c $(HEADERS)
21 | 	$(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -c $< -o $@ -mfsgsbase
22 | 
23 | $(BUILD)/pthread.o: $(SRC)/pthread.c $(HEADERS)
24 | 	$(CC) $(INCLUDES) $(CFLAGS) -shared -fPIC -c $< -o $@ -mfsgsbase
25 | 
26 | .PHONY: clean
27 | clean:
28 | 	rm -rf $(BUILD)
29 | 


--------------------------------------------------------------------------------
/libmsh/msh.c:
--------------------------------------------------------------------------------
  1 | #ifndef _GNU_SOURCE
  2 | #define _GNU_SOURCE
  3 | #endif
  4 | #include <assert.h>
  5 | #include <dlfcn.h>
  6 | #include <fcntl.h>
  7 | #include <setjmp.h>
  8 | #include <stdbool.h>
  9 | #include <stdint.h>
 10 | #include <stdio.h>
 11 | #include <stdlib.h>
 12 | #include <string.h>
 13 | #include <wordexp.h>
 14 | 
 15 | #include <asm/prctl.h>        /* Definition of ARCH_* constants */
 16 | #include <sys/syscall.h>      /* Definition of SYS_* constants */
 17 | #include <unistd.h>
 18 | 
 19 | 
 20 | #include "msh.h"
 21 | #include "spinlock.h"
 22 | 
 23 | #define BOLDBLUE "\033[1m\033[34m"
 24 | #define BLUE     "\033[34m"
 25 | #define RESET    "\033[0m"
 26 | 
 27 | #define PRINT(str, ...) fprintf(stderr, BOLDBLUE "[MSH] " RESET str, ##__VA_ARGS__)
 28 | 
 29 | #define err_check(err_cond, ...)                                               \
 30 |     if (err_cond) {                                                            \
 31 |         PRINT("%s:%d: ", __FILE__, __LINE__);                        \
 32 |         PRINT(__VA_ARGS__);                                          \
 33 |         exit(1);                                                               \
 34 |     }
 35 | 
 36 | // 0-th index of lists is reserved for primary thread
 37 | #define push_back_to_list(list, size, item, pos)                               \
 38 |     for (int i = 1; i < size; i++) {                                           \
 39 |         if (list[i] == -1) {                                                   \
 40 |             list[i] = item;                                                    \
 41 |             pos     = i;                                                       \
 42 |             break;                                                             \
 43 |         }                                                                      \
 44 |     }
 45 | 
 46 | #define pop_back_from_list(list, size, item, pos)                              \
 47 |     for (int i = size - 1; i > 0; i--) {                                       \
 48 |         if (list[i] != -1) {                                                   \
 49 |             item    = list[i];                                                 \
 50 |             list[i] = -1;                                                      \
 51 |             pos     = i;                                                       \
 52 |             break;                                                             \
 53 |         }                                                                      \
 54 |     }
 55 | 
 56 | #define CAS(ptr, old_val, new_val)                                             \
 57 |     __sync_bool_compare_and_swap(ptr, old_val, new_val)
 58 | 
 59 | // Static allocation of per Application Tables
 60 | static struct coroutine_ctx      SCAV_TABLE[MAX_ACTIVE_CRT_NUM_PER_APPLICATION];
 61 | static struct primary_thread_ctx PRIM_TABLE[MAX_THREAD_NUM];
 62 | spinlock                         scav_table_lock;
 63 | spinlock                         prim_table_lock;
 64 | 
 65 | static __thread char *gsbase = 0;
 66 | #define tl_yield_ctx_at(pos)                                                   \
 67 |     (struct yield_ctx *) (gsbase + sizeof(struct yield_ctx) * pos)
 68 | 
 69 | 
 70 | #define tl_get_crt_idx_at(pos)     (PRIM_TABLE[msh_tid].idx_list[pos])
 71 | #define tl_set_crt_id_at(pos, val) (PRIM_TABLE[msh_tid].idx_list[pos] = val)
 72 | 
 73 | static __thread int   cur_crt_pos = 0;
 74 | static __thread int   msh_tid     = -1;
 75 | static bool           dummy;
 76 | static __thread bool *msh_reallocatable   = &dummy;
 77 | static __thread bool *msh_reallocated     = &dummy;
 78 | static _Atomic int    finished_scav_cnt   = 0;
 79 | static int            max_scav_per_thread = 1;
 80 | 
 81 | // Scavenger pool management
 82 | static char      scav_pool_path[256];
 83 | static FILE     *scav_pool_fp    = NULL;
 84 | static const int MAX_SCV_CMD_LEN = 256;
 85 | 
 86 | // jmp_buf for scheduler
 87 | jmp_buf cleanup_jmpbuf[MAX_THREAD_NUM];
 88 | 
 89 | static void tl_sched_next_crt(int ret_crt_pos);
 90 | static int  allocate_scav(struct primary_thread_ctx *ctx);
 91 | 
 92 | #pragma GCC push_options
 93 | #pragma GCC optimize("O0")
 94 | void
 95 | ret_to_scheduler(void) {
 96 |     int ret_crt_pos;
 97 |     __asm__ volatile(
 98 |         "testq $15, %%rsp\n\t"
 99 |         "jz _no_adjust\n\t"
100 |         "subq $8, %%rsp\n\t"   // after coroutine returns, rsp might not be
101 |                                // 16-byte alinged (ABI requirement)
102 |         "_no_adjust:\n"
103 |         "movl %%eax, %0\n\t"
104 |         : "=rm"(ret_crt_pos)
105 |         :);
106 |     tl_sched_next_crt(ret_crt_pos);
107 | 
108 |     // this program point shouldn't be reached
109 |     assert(false);
110 | }
111 | #pragma GCC pop_options
112 | 
113 | static void
114 | print_prim_ctx(struct primary_thread_ctx *ctx) {
115 |     PRINT("Primary %d: \n", msh_tid);
116 |     PRINT("    idx_list:");
117 |     for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) {
118 |         PRINT("%d ", ctx->idx_list[i]);
119 |     }
120 |     PRINT("\n");
121 | 
122 |     PRINT("    ycs-normal:");
123 |     for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) {
124 |         PRINT("%d ", ctx->ycs[i].normal_next);
125 |     }
126 | 
127 |     PRINT("     ycs-special:");
128 |     for (int i = 0; i < MAX_CRT_PER_THREAD + 1; i++) {
129 |         PRINT("%d ", ctx->ycs[i].special_next);
130 |     }
131 | 
132 |     PRINT("\n");
133 |     fflush(stderr);
134 | }
135 | 
136 | static void
137 | parse_command(char *string, int *argc, char ***argv) {
138 |     *argc = 0;
139 |     wordexp_t p;
140 |     wordexp(string, &p, 0);
141 |     *argc = p.we_wordc;
142 |     *argv = p.we_wordv;
143 | }
144 | 
145 | static void
146 | init_scav_table_entry(struct coroutine_ctx *ctx, char *cmd) {
147 |     cmd[strcspn(cmd, "\n")] = 0;
148 |     parse_command(cmd, &ctx->argc, &ctx->argv);
149 | 
150 |     void *lib_handle = dlopen(ctx->argv[0], RTLD_NOW | RTLD_DEEPBIND);
151 | 
152 |     err_check(!lib_handle, "Failed to dlopen file %s %s\n", cmd, dlerror());
153 | 
154 |     memset(ctx->stack, 0, CRT_STACK_SZ);
155 |     ctx->lib_handle   = lib_handle;
156 |     ctx->entry        = (void (*)(void)) dlsym(lib_handle, "entry");
157 |     ctx->ret_to_sched = (void *) ret_to_scheduler;
158 |     ctx->yc           = NULL;
159 |     ctx->finished     = false;
160 |     err_check(dlerror(), "Failed to dlsym entry in file\n");
161 | 
162 |     ctx->init   = (void (*)(void)) dlsym(lib_handle, "init");
163 |     ctx->inited = false;
164 |     dlerror();   // reset error. init may not exist, but that's fine
165 |     return;
166 | }
167 | 
168 | static void
169 | init_prim_table() {
170 |     for (int i = 0; i < MAX_THREAD_NUM; i++) {
171 |         // Cache-line alignment check
172 |         assert((uint64_t) &PRIM_TABLE[i].ycs % 64 == 0);
173 |         for (int j = 0; j < MAX_CRT_PER_THREAD + 1; j++)
174 |             PRIM_TABLE[i].ycs[j].normal_next = -1;
175 |         for (int j = 0; j < MAX_CRT_PER_THREAD + 1; j++)
176 |             PRIM_TABLE[i].idx_list[j] = -1;
177 |         PRIM_TABLE[i].reallocatable = false;
178 |         PRIM_TABLE[i].reallocated   = false;
179 |         PRIM_TABLE[i].scav_num      = 0;
180 |     }
181 | }
182 | 
183 | static void
184 | init_scav_table(char *local_pool_path) {
185 |     spin_lock(&scav_table_lock);
186 |     scav_pool_fp = fopen(local_pool_path, "r");
187 |     err_check(!scav_pool_fp, "Failed to open file\n");
188 | 
189 |     for (int i = 0; i < MAX_ACTIVE_CRT_NUM_PER_APPLICATION; i++) {
190 |         SCAV_TABLE[i].lib_handle = NULL;
191 |     }
192 | 
193 |     char line_buf[MAX_SCV_CMD_LEN];
194 |     int  scav_counts = 0;
195 |     while (fgets(line_buf, MAX_SCV_CMD_LEN, scav_pool_fp)) {
196 |         if (line_buf[0] == '#') {   // # is a comment
197 |             continue;
198 |         }
199 | 
200 |         err_check(scav_counts > MAX_ACTIVE_CRT_NUM_PER_APPLICATION,
201 |                   "Too many scavengers in the pool\n");
202 | 
203 |         init_scav_table_entry(&SCAV_TABLE[scav_counts], line_buf);
204 |         scav_counts++;
205 |     }
206 | 
207 |     PRINT("Scavenger pool has %d entries\n", scav_counts);
208 |     spin_unlock(&scav_table_lock);
209 | }
210 | 
211 | static void
212 | set_scav_symbol(int scav_idx, int crt_pos) {
213 |     struct coroutine_ctx *ctx = &SCAV_TABLE[scav_idx];
214 | 
215 |     void *sym = dlsym(ctx->lib_handle, "argc");
216 |     err_check(dlerror(), "Failed to dlsym argc in file\n");
217 |     int *entry_argc = (int *) sym;
218 |     *entry_argc     = ctx->argc;
219 | 
220 |     sym = dlsym(ctx->lib_handle, "argv");
221 |     err_check(dlerror(), "Failed to dlsym argv in file\n");
222 |     char ***entry_argv = (char ***) sym;
223 |     *entry_argv        = ctx->argv;
224 | 
225 |     sym = dlsym(ctx->lib_handle, "crt_pos");
226 |     err_check(dlerror(), "Failed to dlsym crt_pos in file\n");
227 |     int *entry_crt_pos = (int *) sym;
228 |     *entry_crt_pos     = crt_pos * sizeof(struct yield_ctx);
229 | 
230 |     sym = dlsym(ctx->lib_handle, "loop_counter_loc");
231 |     if (dlerror()) {
232 |         PRINT("This scavenger doesn't have loop counter\n");
233 |     } else {
234 |         long *entry_loop_counter_loc = (long *) sym;
235 |         *entry_loop_counter_loc     = offsetof(struct primary_thread_ctx,loop_counter) + (NUM_OF_LINES_FOR_LOOP_COUNTER * 64 * crt_pos); 
236 |         memset(PRIM_TABLE[msh_tid].loop_counter[crt_pos], 0, NUM_OF_LINES_FOR_LOOP_COUNTER * 64);
237 |     }
238 | }
239 | 
240 | // swap coroutines at a_pos and b_pos in the index list
241 | static void
242 | tl_swap_crts(int a_pos, int b_pos) {
243 |     spin_lock(&scav_table_lock);
244 | 
245 |     int a_idx = tl_get_crt_idx_at(a_pos);
246 |     int b_idx = tl_get_crt_idx_at(b_pos);
247 |     tl_set_crt_id_at(a_pos, b_idx);
248 |     tl_set_crt_id_at(b_pos, a_idx);
249 | 
250 |     // swap yc
251 |     struct yield_ctx *a_yc = tl_yield_ctx_at(a_pos);
252 |     struct yield_ctx *b_yc = tl_yield_ctx_at(b_pos);
253 |     struct yield_ctx  tmp;
254 |     memcpy(&tmp, a_yc, sizeof(struct yield_ctx));
255 |     memcpy(a_yc, b_yc, sizeof(struct yield_ctx));
256 |     memcpy(b_yc, &tmp, sizeof(struct yield_ctx));
257 | 
258 |     SCAV_TABLE[a_idx].yc = b_yc;
259 |     SCAV_TABLE[b_idx].yc = a_yc;
260 | 
261 |     // update next map
262 |     for (int pos = 0; pos < MAX_CRT_PER_THREAD + 1; pos++) {
263 |         struct yield_ctx *yc = tl_yield_ctx_at(pos);
264 |         if (yc->normal_next == a_pos) {
265 |             yc->normal_next = b_pos;
266 |         } else if (yc->normal_next == b_pos) {
267 |             yc->normal_next = a_pos;
268 |         }
269 |     }
270 | 
271 |     // swap register set
272 |     struct x86_registers *a_regs = PRIM_TABLE[msh_tid].regs[a_pos];
273 |     struct x86_registers *b_regs = PRIM_TABLE[msh_tid].regs[b_pos];
274 |     struct x86_registers  tmp_regs[MAX_REG_SET_NUM];
275 |     memcpy(tmp_regs, a_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM);
276 |     memcpy(a_regs, b_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM);
277 |     memcpy(b_regs, tmp_regs, sizeof(struct x86_registers) * MAX_REG_SET_NUM);
278 | 
279 |     // swap binary symbol
280 |     // handle only if the symbol is defined at pos
281 |     if (a_idx != -1) {
282 |         void *a_lib_handle = SCAV_TABLE[a_idx].lib_handle;
283 |         assert(a_lib_handle);
284 |         void *a_sym           = dlsym(a_lib_handle, "crt_pos");
285 |         int  *a_entry_crt_pos = (int *) a_sym;
286 |         *a_entry_crt_pos      = b_pos;
287 |     }
288 | 
289 |     if (b_idx != -1) {
290 |         void *b_lib_handle = SCAV_TABLE[b_idx].lib_handle;
291 |         assert(b_lib_handle);
292 |         void *b_sym           = dlsym(b_lib_handle, "crt_pos");
293 |         int  *b_entry_crt_pos = (int *) b_sym;
294 |         *b_entry_crt_pos      = a_pos;
295 |     }
296 | 
297 |     spin_unlock(&scav_table_lock);
298 | 
299 | }
300 | 
301 | static void
302 | tl_finish_crt(int crt_pos) {
303 |     struct yield_ctx *yc      = tl_yield_ctx_at(crt_pos);
304 |     int               crt_idx = tl_get_crt_idx_at(crt_pos);
305 |     yc->normal_next           = -1;
306 | 
307 |     if (crt_idx == -1) {
308 |         // primary thread is finished
309 |         return;
310 |     }
311 | 
312 |     spin_lock(&scav_table_lock);
313 |     tl_set_crt_id_at(crt_pos, -1);
314 |     SCAV_TABLE[crt_idx].yc       = NULL;
315 |     SCAV_TABLE[crt_idx].finished = true;
316 |     dlclose(SCAV_TABLE[crt_idx].lib_handle);
317 |     spin_unlock(&scav_table_lock);
318 | }
319 | 
320 | static void
321 | set_special_next(struct primary_thread_ctx *ctx) {
322 |     // this function is called every time new scavenger is allocated
323 |     // it keeps the link made through special_next to be a ring
324 | 
325 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
326 |         struct yield_ctx *yc = tl_yield_ctx_at(pos);
327 |         if (pos == MAX_CRT_PER_THREAD)
328 |             yc->special_next = 0;
329 | 
330 |         int idx_next = tl_get_crt_idx_at(pos + 1);
331 |         if (idx_next == -1) {
332 |             yc->special_next = 0;
333 |         } else {
334 |             yc->special_next = (pos + 1) * sizeof(struct yield_ctx);
335 |         }
336 |     }
337 | }
338 | 
339 | static int
340 | tl_next_available_crt() {
341 |     struct yield_ctx *ycs = PRIM_TABLE[msh_tid].ycs;
342 |     // find if there is any available scavenger coroutine
343 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
344 |         if (ycs[pos].normal_next != -1)
345 |             return pos;
346 |     }
347 | 
348 |     // if none, set to primary
349 |     return 0;
350 | }
351 | 
352 | // TODO: currently we only update normal_next of primary
353 | // we will have to handle special_next of scavenger coroutines later
354 | static void
355 | tl_update_next_map() {
356 |     struct yield_ctx *ycs              = PRIM_TABLE[msh_tid].ycs;
357 |     int               crt_next_to_prim = ycs[0].normal_next;
358 |     if (crt_next_to_prim == -1) {
359 |         // primary thread is finished
360 |         return;
361 |     }
362 | 
363 |     // next coroutine is finished
364 |     ycs[0].normal_next = sizeof(struct yield_ctx) * tl_next_available_crt();
365 | }
366 | 
367 | static int
368 | tl_find_unfinished_crt() {
369 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
370 |         int crt_idx = tl_get_crt_idx_at(pos);
371 |         if (crt_idx != -1) {
372 |             return pos;
373 |         }
374 |     }
375 |     return 0;
376 | }
377 | 
378 | static void
379 | tl_pack_crts() {
380 |     // if primary is finished, skip
381 |     struct yield_ctx *primary_yc = tl_yield_ctx_at(0);
382 |     if (primary_yc->normal_next == -1) {
383 |         return;
384 |     }
385 | 
386 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
387 |         if (tl_get_crt_idx_at(pos) == -1) {
388 |             for (int next_pos = pos + 1; next_pos < MAX_CRT_PER_THREAD + 1;
389 |                  next_pos++) {
390 |                 if (tl_get_crt_idx_at(next_pos) != -1) {
391 |                     tl_swap_crts(pos, next_pos);
392 |                     break;
393 |                 }
394 |             }
395 |         }
396 |     }
397 | 
398 |     set_special_next(&PRIM_TABLE[msh_tid]);
399 | }
400 | 
401 | static void
402 | tl_sched_next_crt(int ret_crt_pos) {
403 |     ret_crt_pos = ret_crt_pos / sizeof(struct yield_ctx); // change to idx (from pointer)
404 |     
405 |     PRINT(
406 |             BOLDBLUE "[%d]              Coroutine at pos %d is finished -- "
407 |                      "Scavenger index = %d\n" RESET,
408 |             msh_tid, ret_crt_pos, tl_get_crt_idx_at(ret_crt_pos));
409 |     if (ret_crt_pos != 0)
410 |         PRINT("FinishedCoroutines: %d \n", ++finished_scav_cnt);
411 |     struct yield_ctx *ret_yield_ctx     = tl_yield_ctx_at(ret_crt_pos);
412 |     struct yield_ctx *primary_yield_ctx = tl_yield_ctx_at(0);
413 | 
414 |     uint64_t sp;
415 |     uint64_t ip;
416 |     short    next = ret_yield_ctx->normal_next / sizeof(struct yield_ctx);
417 |     // print_prim_ctx(&PRIM_TABLE[msh_tid]);
418 |     // PRINT("[%d] next of retcrt = %d\n", msh_tid, next);
419 |     tl_finish_crt(ret_crt_pos);
420 |     tl_update_next_map();
421 |     tl_pack_crts();
422 | 
423 |     // print_prim_ctx(&PRIM_TABLE[msh_tid]);
424 |     if (primary_yield_ctx->normal_next != -1) {   // primary is not finished
425 |         allocate_scav(&PRIM_TABLE[msh_tid]);      // get new one if needed
426 |         tl_update_next_map();
427 |     }
428 | 
429 |     // prefetch yield context
430 |     __asm__ volatile("prefetcht0 %%gs:0\n\t" : :);
431 | 
432 |     if (primary_yield_ctx->normal_next == -1) {
433 |         // primary is finished
434 |         // we are running cleanup routine
435 |         cur_crt_pos = tl_find_unfinished_crt();
436 |         if (cur_crt_pos == 0) {
437 |             longjmp(cleanup_jmpbuf[msh_tid], 0);
438 |         }
439 |         struct yield_ctx *yc = tl_yield_ctx_at(cur_crt_pos);
440 |         sp                   = yc->sp;
441 |         ip                   = yc->ip;
442 |     } else {
443 |         cur_crt_pos          = next;
444 |         struct yield_ctx *yc = tl_yield_ctx_at(cur_crt_pos);
445 |         sp                   = yc->sp;
446 |         ip                   = yc->ip;
447 |     }
448 | 
449 |     PRINT("[%d] Coroutine %d is scheduled %p %p\n", msh_tid,
450 |             cur_crt_pos, (void *) sp, (void *) ip);
451 |     __asm__ volatile("movq %1, %%rsp\n\t"
452 |                      "jmp  *%0\n\t"
453 |                      :
454 |                      : "m"(ip), "mr"(sp));
455 |     // should never reach here
456 |     assert(false);
457 | }
458 | 
459 | // The design of this function is undecided.
460 | // This static cap version is tentative.
461 | static int
462 | need_more_scav(struct primary_thread_ctx *ctx) {
463 |     int counts = 0;
464 |     for (int i = 1; i < MAX_CRT_PER_THREAD + 1; i++) {
465 |         if (ctx->idx_list[i] != -1) {
466 |             counts++;
467 |         }
468 |     }
469 | 
470 |     if (counts < max_scav_per_thread) {
471 |         return max_scav_per_thread - counts;
472 |     } else {
473 |         return 0;
474 |     }
475 | }
476 | 
477 | static int
478 | realloc_scav(struct primary_thread_ctx *from, struct primary_thread_ctx *to,
479 |              int demands) {
480 |     int ret_counts = 0;
481 |     while (ret_counts < demands) {
482 |         int crt_idx_to_realloc, from_pos = -1, to_pos = -1;
483 |         pop_back_from_list(from->idx_list, MAX_CRT_PER_THREAD + 1,
484 |                            crt_idx_to_realloc,
485 |                            from_pos);   // pos = idx in idx_list
486 |         if (from_pos == -1)
487 |             break;   // no more scavengers in *from
488 | 
489 |         push_back_to_list(to->idx_list, MAX_CRT_PER_THREAD + 1,
490 |                           crt_idx_to_realloc, to_pos);
491 |         if (to_pos == -1) {
492 |             // no more space, rollback and return
493 |             push_back_to_list(from->idx_list, MAX_CRT_PER_THREAD + 1,
494 |                               crt_idx_to_realloc, from_pos);
495 |             return ret_counts;
496 |         }
497 | 
498 |         // no rollback from this point
499 |         // yc update
500 |         struct yield_ctx *from_yc = from->ycs + from_pos;
501 |         struct yield_ctx *to_yc   = to->ycs + to_pos;
502 |         to_yc->sp                 = from_yc->sp;
503 |         to_yc->ip                 = from_yc->ip;
504 |         to_yc->normal_next        = 0;
505 | 
506 |         from_yc->normal_next = -1;
507 |         tl_update_next_map();
508 |         to->ycs[0].normal_next = sizeof(struct yield_ctx);
509 | 
510 |         // TODO: handle special next
511 | 
512 |         // regset update
513 |         struct x86_registers *from_regs = from->regs[from_pos];
514 |         struct x86_registers *to_regs   = to->regs[to_pos];
515 |         memcpy(to_regs, from_regs,
516 |                sizeof(struct x86_registers) * MAX_REG_SET_NUM);
517 | 
518 |         // binary symbol update
519 |         void *lib_handle    = SCAV_TABLE[crt_idx_to_realloc].lib_handle;
520 |         void *sym           = dlsym(lib_handle, "crt_pos");
521 |         int  *entry_crt_pos = (int *) sym;
522 |         *entry_crt_pos      = to_pos;
523 | 
524 |         // finalize
525 |         ret_counts++;
526 |         SCAV_TABLE[crt_idx_to_realloc].yc = to_yc;
527 |         __asm__ __volatile__(
528 |             "" ::
529 |                 : "memory");   // prevent compiler from reordering
530 |         from->reallocated = true;
531 | 
532 |         int from_tid = ((uint64_t) from - (uint64_t) PRIM_TABLE) /
533 |                        sizeof(struct primary_thread_ctx);
534 |         int to_tid = ((uint64_t) to - (uint64_t) PRIM_TABLE) /
535 |                      sizeof(struct primary_thread_ctx);
536 |         PRINT("Scav %d is reallocated [%d] --> [%d] %d %d %d %d\n",
537 |                 crt_idx_to_realloc, from_tid, to_tid, from->idx_list[1],
538 |                 from->idx_list[2], from->idx_list[3], from->idx_list[4]);
539 |     }
540 | 
541 |     return ret_counts;
542 | }
543 | 
544 | static int
545 | _allocate_scav(struct primary_thread_ctx *ctx, int demands) {
546 |     int ret_counts = 0;
547 | 
548 |     // Check if there is reallocatable scavenger
549 |     // Note for concurrency: this is the only place where primary threads may
550 |     // access the context of other primary threads. The access must be protected
551 |     // by "reallocatable" flag.
552 |     for (int tid = 0; tid < MAX_THREAD_NUM; tid++) {
553 |         struct primary_thread_ctx *t = &PRIM_TABLE[tid];
554 |         if (CAS(&t->reallocatable, true, false)) {
555 |             // reallocation for t begins -- t will busy-wait until it's done
556 |             int cnt = realloc_scav(t /*from*/, ctx /*to*/, demands);
557 |             __asm__ __volatile__(
558 |                 "" ::
559 |                     : "memory");       // prevent compiler from reordering
560 |             t->reallocatable = true;   // unlock
561 |             ret_counts += cnt;
562 |             if (ret_counts >= demands)
563 |                 return ret_counts;
564 |         }
565 |     }
566 | 
567 |     // check scav pool
568 |     for (int idx = 0; idx < MAX_ACTIVE_CRT_NUM_PER_APPLICATION; idx++) {
569 |         if (!SCAV_TABLE[idx].lib_handle)   // empty entry
570 |             continue;
571 | 
572 |         if (SCAV_TABLE[idx].yc ||
573 |             SCAV_TABLE[idx].finished)   // already allocated
574 |             continue;
575 | 
576 |         // allocate idx-th scavenger to the primary
577 |         int pos = -1;
578 |         push_back_to_list(ctx->idx_list, MAX_CRT_PER_THREAD + 1, idx, pos);
579 |         if (pos == -1)
580 |             break;   // no more space
581 | 
582 |         ctx->ycs[pos].sp          = (uint64_t) &SCAV_TABLE[idx].ret_to_sched;
583 |         ctx->ycs[pos].ip          = (uint64_t) SCAV_TABLE[idx].entry;
584 |         ctx->ycs[pos].normal_next = 0;
585 | 
586 |         set_scav_symbol(idx, pos);
587 | 
588 |         SCAV_TABLE[idx].yc = &ctx->ycs[pos];
589 | 
590 |         ret_counts++;
591 |         if (ret_counts >= demands)
592 |             return ret_counts;
593 |     }
594 | 
595 |     return ret_counts;
596 | }
597 | 
598 | static void
599 | call_init_for_new_scav(struct primary_thread_ctx *ctx) {
600 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
601 |         int crt_idx = tl_get_crt_idx_at(pos);
602 | 
603 |         if (crt_idx == -1)
604 |             continue;
605 | 
606 |         struct coroutine_ctx *crt_ctx = &SCAV_TABLE[crt_idx];
607 |         if (crt_ctx->inited)
608 |             continue;
609 | 
610 |         // a scavenger is assinged uniquely to this primary at this moment, so
611 |         // no need to think about concurrency
612 |         if (crt_ctx->init) {
613 |             struct yield_ctx *yc = tl_yield_ctx_at(pos);
614 | 
615 |             // yc of primary is not set at this moment, so make crts jump to itself while init
616 |             uint64_t sp_tmp = yc->sp;
617 |             uint64_t ip_tmp = yc->ip;
618 |             short normal_tmp   = yc->normal_next; 
619 |             short special_tmp  = yc->special_next;
620 |             yc->normal_next = pos * sizeof(struct yield_ctx); // set to itself temporarily
621 |             yc->special_next = pos * sizeof(struct yield_ctx); // set to itself temporarily
622 |             
623 |             crt_ctx->init(); 
624 |             crt_ctx->inited = true;
625 |             
626 |             yc->sp = sp_tmp;
627 |             yc->ip = ip_tmp;
628 |             yc->normal_next = normal_tmp;
629 |             yc->special_next = special_tmp;
630 |         }
631 |     }
632 | }
633 | 
634 | static int
635 | allocate_scav(struct primary_thread_ctx *ctx) {
636 |     int counts = need_more_scav(ctx);
637 |     if (counts == 0)
638 |         return 0;
639 | 
640 |     spin_lock(&scav_table_lock);
641 |     int ret = _allocate_scav(ctx, counts);
642 |     spin_unlock(&scav_table_lock);
643 | 
644 |     // scav init -- just call it!!
645 |     call_init_for_new_scav(ctx);
646 |     set_special_next(ctx);
647 |     return ret;
648 | }
649 | 
650 | static void
651 | init_primary_ctx(struct primary_thread_ctx *ctx, int counts) {
652 |     // sp/ip will be initialized at the first yield
653 |     ctx->ycs[0].normal_next = counts > 0 ? sizeof(struct yield_ctx) : 0;
654 |     // speical_next shouldn't be used in primary
655 |     ctx->scav_num      = counts;
656 |     ctx->reallocatable = 0;
657 |     ctx->reallocated   = 0;
658 |     msh_reallocatable  = &ctx->reallocatable;
659 |     msh_reallocated    = &ctx->reallocated;
660 | }
661 | 
662 | int
663 | msh_init(int max_scav) {
664 | 
665 |     int fd = open("/dev/cpu_dma_latency", O_WRONLY);
666 |     if (fd < 0) {
667 |         perror("open /dev/cpu_dma_latency");
668 |         // keep going without setting c-state
669 |         // return 1;
670 |     }
671 |     int l = 0;
672 |     if (write(fd, &l, sizeof(l)) != sizeof(l)) {
673 |         perror("write to /dev/cpu_dma_latency");
674 |         // return 1;
675 |     }
676 | 
677 |     PRINT("\n\n MSH Init\n");
678 | 
679 |     // print offset information
680 |     PRINT("    offset to ycs: %lu\n",
681 |             offsetof(struct primary_thread_ctx, ycs));
682 |     PRINT("    offset to regs: %lu\n",
683 |             offsetof(struct primary_thread_ctx, regs));
684 |     PRINT("    offset to idx_list: %lu\n",
685 |             offsetof(struct primary_thread_ctx, idx_list));
686 |     PRINT("    offset to regset: %lu\n",
687 |             offsetof(struct primary_thread_ctx, regs[0]));
688 |         
689 |     PRINT("    size of x86_registers: %lu\n",
690 |             sizeof(struct x86_registers));
691 |     PRINT("    size of regset: %lu\n",
692 |             sizeof(struct x86_registers) * MAX_REG_SET_NUM);
693 |     PRINT("    size of yield context: %lu\n",
694 |             sizeof(struct yield_ctx));
695 |     PRINT("       offset of sp: %lu\n",
696 |             offsetof(struct yield_ctx, sp));
697 |     PRINT("       offset of ip: %lu\n",
698 |             offsetof(struct yield_ctx, ip));
699 |     PRINT("       offset of next: %lu\n",
700 |             offsetof(struct yield_ctx, normal_next));
701 | 
702 |     PRINT("Warning: this number must match the value in binary "
703 |                     "instrumentation\n");
704 |     max_scav_per_thread = max_scav;
705 | 
706 |     init_prim_table();
707 | 
708 |     char *path = getenv("MSH_SCAV_POOL_PATH");
709 |     if (path == NULL) {
710 |         fprintf(stderr,
711 |                 "Scavenger pool path should be given in MSH_SCAV_POOL_PATH\n");
712 |         return -1;
713 |     }
714 | 
715 |     strcpy(scav_pool_path, path);
716 |     init_scav_table(scav_pool_path);
717 | 
718 |     PRINT("MSH is successfully initiated\n");
719 |     return 0;
720 | }
721 | 
722 | // identify unallocated scavengers and allocated them
723 | int
724 | msh_alloc_ctx(int tid) {
725 |     // set per-thread variables
726 |     msh_tid     = tid;   // PRIM_TABLE[tid] is yours from now on
727 |     cur_crt_pos = 0;     // current coroutine = primary
728 | 
729 |     if (tid >= MAX_THREAD_NUM) {
730 |         PRINT("libmsh: thread number exceeds the limit %d\n", tid);
731 |         return -1;
732 |     }
733 | 
734 |     // use ctx, not tid if possible
735 |     struct primary_thread_ctx *ctx = &PRIM_TABLE[tid];
736 | 
737 |     syscall(SYS_arch_prctl, ARCH_SET_GS, (unsigned long *) ctx);
738 |     gsbase = (char *) ctx;
739 |     //_writegsbase_u64(
740 |     //    (uint64_t) ctx);   // set gs. this MUST be done before allocate_scav
741 |                            // because allocate_scav depends on gs
742 | 
743 |     int cnt = allocate_scav(ctx);
744 | 
745 |     init_primary_ctx(ctx, cnt);
746 | 
747 |     PRINT("Primary thread %d has %d scavengers\n", tid, cnt);
748 |     //print_prim_ctx(ctx);
749 | 
750 |     return 0;
751 | }
752 | 
753 | int
754 | msh_cleanup() {
755 |     if (getenv("SKIP_CLEANUP")) {
756 |         fflush(stderr);
757 |         fflush(stdout);
758 |         return 0;
759 |     }
760 | 
761 |     PRINT("Cleanup thread %d\n", msh_tid);
762 |     // setup cleanup
763 |     // primary->next = 0 will be done in tl_sched_next_crt
764 |     for (int pos = 1; pos < MAX_CRT_PER_THREAD + 1; pos++) {
765 |         struct yield_ctx *yc = tl_yield_ctx_at(pos);
766 |         if (yc->normal_next != -1) {
767 |             // set it to itself so that it runs to completion
768 |             yc->normal_next  = pos * sizeof(struct yield_ctx);
769 |             yc->special_next = pos * sizeof(struct yield_ctx);
770 |         }
771 |     }
772 | 
773 |     // finish remaining scavengers using scheduler code
774 |     setjmp(cleanup_jmpbuf[msh_tid]);
775 |     int unfinished_crt_pos = tl_find_unfinished_crt();
776 |     if (unfinished_crt_pos == 0) {
777 |         fprintf(
778 |             stderr,
779 |             BOLDBLUE
780 |             "[%d]             ***Primary thread [%d] is finished***\n" RESET,
781 |             msh_tid, msh_tid);
782 |         struct yield_ctx *yc =
783 |             tl_yield_ctx_at(0);   // reset yc[0].next = 0 to avoid error
784 |         yc->normal_next = 0;   // do this just in case there is an insrumented
785 |                                // code after cleanup
786 |         return 0;
787 |     }
788 | 
789 |     tl_sched_next_crt(0);
790 |     assert(false);   // shouldn't reach here
791 | }
792 | 
793 | void
794 | msh_enter_blockable_call() {
795 |     *msh_reallocatable = true;
796 | }
797 | 
798 | void
799 | msh_exit_blockable_call() {
800 |     while (!CAS(msh_reallocatable, true, false)) {
801 |         // someone's trying to reallocate scavs
802 |         // busy wait
803 |     }
804 | 
805 |     __asm__ __volatile__("" ::: "memory");   // prevent compiler from reordering
806 |     if (*msh_reallocated) {
807 |         struct primary_thread_ctx *t = &PRIM_TABLE[msh_tid];
808 |         // slow-path
809 |         PRINT("[%d] my scavs are reallocated...\n", msh_tid);
810 |         allocate_scav(t);
811 |         tl_update_next_map();
812 |         t->reallocated = false;
813 |     }
814 | }
815 | 


--------------------------------------------------------------------------------
/libmsh/msh.h:
--------------------------------------------------------------------------------
  1 | #ifndef __MSH_H__
  2 | #define __MSH_H__
  3 | #include <immintrin.h>
  4 | #include <stdint.h>
  5 | #include <stdbool.h>
  6 | 
  7 | #define CRT_STACK_SZ                       (1 << 20)  // 1MB
  8 | 
  9 | // This number are over-provisioned to avoid out-of-resource issue.
 10 | #define MAX_THREAD_NUM                     128
 11 | #define MAX_ACTIVE_CRT_NUM_PER_APPLICATION 512
 12 | #define MAX_CRT_PER_THREAD                 128
 13 | #define MAX_REG_SET_NUM                    15 // You may need more regiser set depending on the number of 
 14 | #define NUM_OF_LINES_FOR_LOOP_COUNTER      4
 15 | 
 16 | // x86 GPR + floating-point registers
 17 | struct x86_registers {
 18 |     void *rax;
 19 |     void *rbx;
 20 |     void *rcx;
 21 |     void *rdx;
 22 |     void *rsi;
 23 |     void *rdi;
 24 |     void *rbp;
 25 |     void *rsp;
 26 |     void *r8;
 27 |     void *r9;
 28 |     void *r10;
 29 |     void *r11;
 30 |     void *r12;
 31 |     void *r13;
 32 |     void *r14;
 33 |     void *r15;
 34 |     void *XMM0[2];
 35 |     void *XMM1[2];
 36 |     void *XMM2[2];
 37 |     void *XMM3[2];
 38 | };
 39 | 
 40 | // performance critical contexts
 41 | // the field order must be unchanged -- binary instrumentator relies on the offset of fields
 42 | struct __attribute__((aligned(4), packed)) yield_ctx {
 43 |     // We hope the first instance of this struct starts at cache line boundary
 44 |     // If so, we can pack three coroutine_ctx per cache line
 45 |     uint64_t sp;   // last stack pointer
 46 |     uint64_t ip;   // last instruction pointer
 47 |     short    special_next;
 48 |     short    normal_next;   // normal_next = -1 means the coroutine is finished
 49 | };
 50 | 
 51 | // Coroutine Context represents a scavenger thread
 52 | // the field order must be respected
 53 | struct __attribute__((aligned(16), packed)) coroutine_ctx {
 54 |     void *lib_handle;            // 0
 55 |     void (*entry)(void);         // 8
 56 |     char  pad[8];                // 16, padded for stack alignment
 57 |     char  stack[CRT_STACK_SZ];   // 24
 58 |     void *ret_to_sched;          // this must be at the top of stack
 59 |     char  pad2[32];
 60 |     // entry arugments
 61 |     int               argc;
 62 |     char            **argv;
 63 |     struct yield_ctx *yc;   // pointer to current yc
 64 |     bool               finished;
 65 |     void (*init)(void); // scavenger init function
 66 |     bool inited;
 67 | };
 68 | 
 69 | // gs points to primary_thread_ctx of the current thread
 70 | struct __attribute__((aligned(64))) primary_thread_ctx {
 71 |     struct yield_ctx ycs[MAX_CRT_PER_THREAD + 1];   // should be contiguous, 0
 72 |                                                     // is reserved for primary
 73 |     char loop_counter[MAX_CRT_PER_THREAD + 1][NUM_OF_LINES_FOR_LOOP_COUNTER * 64];
 74 |     struct x86_registers
 75 |         regs[MAX_CRT_PER_THREAD + 1]
 76 |             [MAX_REG_SET_NUM];   // manipulated by insturmented code
 77 |                                  // space for register saving for
 78 |                                  // optimizations that is unable to use stack
 79 |                                  // - FUR
 80 |                                  // - Loop Optimization
 81 |     int idx_list[MAX_CRT_PER_THREAD + 1];   // list of scavenger thread index that
 82 |                                         // this primary hosts
 83 |     int scav_num;
 84 |     bool reallocatable;   // whether the scavengers allocated to this primary
 85 |                          // can be reallocated to other primaries
 86 |     bool reallocated;     // whether the scavengers allocated to this primary
 87 |                          // have been reallocated to other primaries
 88 | };
 89 | 
 90 | /**
 91 |  * Interfaces
 92 |  * Pthread wrappers use thses calls.
 93 |  * WARNING: all functions work in the context of the current thread
 94 |  * You should assume that every interface includes the thread context
 95 |  * as the first argument, and every function except msh_init() is reentrant.
 96 |  */
 97 | int msh_init(int max_scav);
 98 | int msh_alloc_ctx(int tid);
 99 | int msh_cleanup();
100 | 
101 | void msh_enter_blockable_call();
102 | void msh_exit_blockable_call();
103 | 
104 | #endif /* __MSH_H__ */
105 | 


--------------------------------------------------------------------------------
/libmsh/pthread.c:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * This library will be preloaded on the primary's address space
  3 |  * Main purpose is
  4 |  * 1) overriding some pthread functions
  5 |  * 2) declare some global symbols
  6 |  *
  7 |  * This code assumes C11 standard
  8 |  */
  9 | #define _GNU_SOURCE
 10 | 
 11 | #include <dlfcn.h>
 12 | #include <pthread.h>
 13 | #include <stdio.h>
 14 | #include <stdlib.h>
 15 | #include <time.h>
 16 | 
 17 | #include "msh.h"
 18 | 
 19 | static __thread int msh_thread_id;
 20 | static _Atomic int  thread_cnt = 0;
 21 | 
 22 | struct msh_start_routine_arg {
 23 |     void *(*start_routine)(void *);
 24 |     void *arg;
 25 |     int   thread_id;
 26 | };
 27 | 
 28 | 
 29 | // this routine is invoked in the context of the new thread
 30 | void *
 31 | msh_start_routine(void *arg) {
 32 |     msh_thread_id = ((struct msh_start_routine_arg *) arg)->thread_id;
 33 | 
 34 |     struct timespec start, end;
 35 |     clock_gettime(CLOCK_MONOTONIC, &start);
 36 |     int err       = msh_alloc_ctx(msh_thread_id);
 37 |     clock_gettime(CLOCK_MONOTONIC, &end);
 38 |     fprintf(stdout, "msh_alloc_time= %ld ms\n", (end.tv_sec - start.tv_sec) * 1000 + (end.tv_nsec - start.tv_nsec) / 1000000);
 39 | 
 40 |     if (err) {
 41 |         fprintf(stderr, "msh_alloc_ctx() failed\n");
 42 |         return NULL;
 43 |     }
 44 | 
 45 |     void *ret =
 46 |         ((struct msh_start_routine_arg *) arg)
 47 |             ->start_routine(((struct msh_start_routine_arg *) arg)->arg);
 48 | 
 49 |     err = msh_cleanup(msh_thread_id);
 50 |     if (err) {
 51 |         fprintf(stderr, "msh_cleanup() failed\n");
 52 |         return NULL;
 53 |     }
 54 | 
 55 |     return ret;
 56 | }
 57 | 
 58 | // overriden pthread_create
 59 | int (*original_pthread_create)(pthread_t *restrict,
 60 |                                const pthread_attr_t *restrict,
 61 |                                void *(*) (void *), void *restrict) = NULL;
 62 | 
 63 | int
 64 | pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr,
 65 |                void *(*start_routine)(void *), void *restrict arg) {
 66 |     if (thread_cnt == 0) {
 67 |         char *max_scav_str = getenv("MAX_SCAV_PER_THREAD");
 68 |         int max_scav      = 1;
 69 |         if (max_scav_str)
 70 |             max_scav = atoi(max_scav_str);
 71 | 
 72 |         if (msh_init(max_scav)) {
 73 |             fprintf(stderr, "msh_init() failed\n");
 74 |             return -1;
 75 |         }
 76 | 
 77 | 
 78 |         if (!getenv("SKIP_FIRST_THREAD")) {
 79 |             int err = msh_alloc_ctx(thread_cnt++);
 80 |             if (err) {
 81 |                 fprintf(stderr, "msh_alloc_ctx() failed\n");
 82 |                 return -1;
 83 |             }
 84 |         }
 85 |     }
 86 | 
 87 |     struct msh_start_routine_arg *msh_arg =
 88 |         (struct msh_start_routine_arg *) malloc(
 89 |             sizeof(struct msh_start_routine_arg));
 90 |     msh_arg->start_routine = start_routine;
 91 |     msh_arg->arg           = arg;
 92 |     msh_arg->thread_id     = thread_cnt++;
 93 |     if (!original_pthread_create)
 94 |         original_pthread_create = dlsym(RTLD_NEXT, "pthread_create");
 95 | 
 96 |     int ret = original_pthread_create(thread, attr, msh_start_routine,
 97 |                                       (void *) msh_arg);
 98 |     return ret;
 99 | }
100 | 
101 | /**
102 |  * pthread blockable calls
103 |  */
104 | 
105 | int (*original_pthread_mutex_lock)(pthread_mutex_t *mutex) = NULL;
106 | 
107 | int
108 | pthread_mutex_lock(pthread_mutex_t *mutex) {
109 |     msh_enter_blockable_call();
110 |     if (!original_pthread_mutex_lock) {
111 |         original_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock");
112 |     }
113 |     int ret = original_pthread_mutex_lock(mutex);
114 |     msh_exit_blockable_call();
115 | 
116 |     return ret;
117 | }
118 | 
119 | 
120 | int (*original_pthread_cond_wait)(pthread_cond_t  *cond,
121 |                                   pthread_mutex_t *mutex) = NULL;
122 | 
123 | int
124 | pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex) {
125 |     msh_enter_blockable_call();
126 |     if (!original_pthread_cond_wait) {
127 |         original_pthread_cond_wait = dlsym(RTLD_NEXT, "pthread_cond_wait");
128 |     }
129 |     int ret = original_pthread_cond_wait(cond, mutex);
130 |     msh_exit_blockable_call();
131 | 
132 |     return ret;
133 | }
134 | 
135 | int (*original_pthread_cond_timedwait)(
136 |     pthread_cond_t *restrict cond, pthread_mutex_t *restrict mutex,
137 |     const struct timespec *restrict abstime) = NULL;
138 | 
139 | int
140 | pthread_cond_timedwait(pthread_cond_t *restrict cond,
141 |                        pthread_mutex_t *restrict mutex,
142 |                        const struct timespec *restrict abstime) {
143 |     msh_enter_blockable_call();
144 |     if (!original_pthread_cond_timedwait)
145 |         original_pthread_cond_timedwait =
146 |             dlsym(RTLD_NEXT, "pthread_cond_timedwait");
147 |     int ret = original_pthread_cond_timedwait(cond, mutex, abstime);
148 |     msh_exit_blockable_call();
149 | 
150 |     return ret;
151 | }
152 | 
153 | int (*original_pthread_barrier_wait)(pthread_barrier_t *barrier) = NULL;
154 | 
155 | int
156 | pthread_barrier_wait(pthread_barrier_t *barrier) {
157 |     msh_enter_blockable_call();
158 |     if (!original_pthread_barrier_wait)
159 |         original_pthread_barrier_wait =
160 |             dlsym(RTLD_NEXT, "pthread_barrier_wait");
161 |     int ret = original_pthread_barrier_wait(barrier);
162 |     msh_exit_blockable_call();
163 | 
164 |     return ret;
165 | }
166 | 
167 | int (*original_pthread_join) (pthread_t thread, void **retval) = NULL;
168 | 
169 | bool first_join = true;
170 | #define CAS(ptr, old_val, new_val)                                             \
171 |     __sync_bool_compare_and_swap(ptr, old_val, new_val)
172 | int
173 | pthread_join(pthread_t thread, void **retval) {
174 |     if (!getenv("SKIP_FIRST_THREAD")) {
175 |         if (CAS(&first_join, true, false)) {
176 |             int err = msh_cleanup();
177 |             if (err) {
178 |                 fprintf(stderr, "msh_cleanup() failed\n");
179 |                 return -1;
180 |             }
181 |         }
182 |     }
183 |     if (!original_pthread_join)
184 |         original_pthread_join = dlsym(RTLD_NEXT, "pthread_join");
185 |     int ret = original_pthread_join(thread, retval);
186 |     return ret;
187 | }
188 | 
189 | /**
190 |  * pthread non-blockable calls -- we need to intercept these calls to match the version
191 | */
192 | int (*original_pthread_cond_signal)(pthread_cond_t *cond) = NULL;
193 | 
194 | int
195 | pthread_cond_signal(pthread_cond_t *cond) {
196 |     if (!original_pthread_cond_signal) {
197 |         original_pthread_cond_signal = dlsym(RTLD_NEXT, "pthread_cond_signal");
198 |     }
199 |     int ret = original_pthread_cond_signal(cond);
200 | 
201 |     return ret;
202 | }
203 | 
204 | int (*original_pthread_mutex_unlock)(pthread_mutex_t *mutex) = NULL;
205 | 
206 | int
207 | pthread_mutex_unlock(pthread_mutex_t *mutex) {
208 |     if (!original_pthread_mutex_unlock) {
209 |         original_pthread_mutex_unlock =
210 |             dlsym(RTLD_NEXT, "pthread_mutex_unlock");
211 |     }
212 |     int ret = original_pthread_mutex_unlock(mutex);
213 | 
214 |     return ret;
215 | }
216 | 


--------------------------------------------------------------------------------
/libmsh/spinlock.h:
--------------------------------------------------------------------------------
 1 | #ifndef _SPINLOCK_XCHG_H
 2 | #define _SPINLOCK_XCHG_H
 3 | 
 4 | /* Spin lock using xchg.
 5 |  * Copied from http://locklessinc.com/articles/locks/
 6 |  */
 7 | 
 8 | /* Compiler read-write barrier */
 9 | #define barrier() __asm__ volatile("": : :"memory")
10 | 
11 | /* Pause instruction to prevent excess processor bus usage */
12 | #define cpu_relax() __asm__ volatile("pause\n": : :"memory")
13 | 
14 | static inline unsigned short xchg_8(void *ptr, unsigned char x)
15 | {
16 |     __asm__ __volatile__("xchgb %0,%1"
17 |                 :"=r" (x)
18 |                 :"m" (*(volatile unsigned char *)ptr), "0" (x)
19 |                 :"memory");
20 | 
21 |     return x;
22 | }
23 | 
24 | #define BUSY 1
25 | typedef unsigned char spinlock;
26 | 
27 | #define SPINLOCK_INITIALIZER 0
28 | 
29 | static inline void spin_lock(spinlock *lock)
30 | {
31 |     while (1) {
32 |         if (!xchg_8(lock, BUSY)) return;
33 |     
34 |         while (*lock) cpu_relax();
35 |     }
36 | }
37 | 
38 | static inline void spin_unlock(spinlock *lock)
39 | {
40 |     barrier();
41 |     *lock = 0;
42 | }
43 | 
44 | static inline int spin_trylock(spinlock *lock)
45 | {
46 |     return xchg_8(lock, BUSY);
47 | }
48 | 
49 | #endif /* _SPINLOCK_XCHG_H */


--------------------------------------------------------------------------------
/profile/scripts/capture_cm_inst_primary.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | SAMPLING_RATE_IN_CNTS=$1
 3 | TARGET=$2
 4 | shift 2
 5 | ARGS="$*"
 6 | 
 7 | TARGET_ABS_PATH=$(readlink -f $TARGET)
 8 | TARGET_BASENAME=$(basename $TARGET_ABS_PATH)
 9 | 
10 | cd "$(dirname "$0")"
11 | 
12 | MAIN_DIR=../..
13 | BUILD_DIR=${MAIN_DIR}/build
14 | CRLIST=${MAIN_DIR}/crlist.txt
15 | RES_DIR=../results
16 | TMP_DIR=../results/tmp
17 | SCHED_NAME=sched.out
18 | #TARGET=$(cat ${CRLIST} | awk '{print $1}')
19 | 
20 | echo "Capture deliquent load PCs ... "
21 | echo "Target object: "
22 | #echo -n "$TARGET_ABS_PATH " > $CRLIST
23 | #echo $ARGS >> $CRLIST
24 | #cat $CRLIST
25 | 
26 | mkdir -p ${RES_DIR}
27 | 
28 | rm -r ${TMP_DIR} 2> /dev/null
29 | mkdir -p ${TMP_DIR}
30 | 
31 | echo "  1) perf record L2/L3 misses"
32 | # TODO: determine sampling freq, type of events
33 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g
34 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp
35 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp
36 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES
37 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\
38 |         -b \
39 |         -c${SAMPLING_RATE_IN_CNTS} \
40 |         -o ${RES_DIR}/perf.data -- sudo sh -c ". ${MAIN_DIR}/config/set_env.sh; chrt -r 99 time -p ${TARGET_ABS_PATH} ${ARGS}"
41 | 
42 | sudo chown ${USER} ${RES_DIR}/perf.data
43 | 
44 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST}
45 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST}
46 | 
47 | echo "  2) perf report"
48 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" |
49 |     sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt
50 | 
51 | 
52 | echo "  3) Capture Functions that cause most L2/L3 misses"
53 | python3 read_func.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${TARGET_BASENAME} #${SCHED_NAME}
54 | 
55 | echo "  4) perf annotate --stdio  &  Capturing all deliquent load PCs for each function ...."
56 | echo -n > ${RES_DIR}/all_PClist.txt
57 | 
58 | IFS=$'\n'
59 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do
60 |   echo "    Function Name: $FUNC"
61 |   echo "    L2 misses "
62 |   FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g')
63 | 
64 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
65 |     sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' |
66 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt
67 | 
68 |   echo "    L3 misses "
69 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
70 |     sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' |
71 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt
72 | 
73 | 
74 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
75 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
76 | 
77 |   unset IFS
78 |   for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do
79 |     echo -n "${TARGET_ABS_PATH} " >> ${RES_DIR}/all_PClist.txt
80 |     echo ${PC} >> ${RES_DIR}/all_PClist.txt
81 |   done
82 |   IFS=$'\n'
83 | 
84 |   sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt
85 | done
86 | unset IFS
87 | 
88 | #rm -r ${TMP_DIR} 2> /dev/null
89 | echo "Instruction capture done!"
90 | 


--------------------------------------------------------------------------------
/profile/scripts/capture_cm_inst_process.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | SAMPLING_RATE_IN_CNTS=$1
 3 | TARGET=$2
 4 | TARGET_PID=$3
 5 | 
 6 | TARGET_ABS_PATH=$(readlink -f $TARGET)
 7 | TARGET_BASENAME=$(basename $TARGET_ABS_PATH)
 8 | 
 9 | cd "$(dirname "$0")"
10 | 
11 | MAIN_DIR=../../
12 | BUILD_DIR=${MAIN_DIR}/build
13 | CRLIST=${MAIN_DIR}/crlist.txt
14 | RES_DIR=../results
15 | TMP_DIR=../results/tmp
16 | SCHED_NAME=sched.out
17 | #TARGET=$(cat ${CRLIST} | awk '{print $1}')
18 | 
19 | echo "Capture deliquent load PCs ... "
20 | echo "Target object: "
21 | #echo -n "$TARGET_ABS_PATH " > $CRLIST
22 | #echo $ARGS >> $CRLIST
23 | cat $CRLIST
24 | 
25 | mkdir -p ${RES_DIR}
26 | 
27 | rm -r ${TMP_DIR} 2> /dev/null
28 | mkdir -p ${TMP_DIR}
29 | 
30 | echo "  1) perf record L2/L3 misses"
31 | # TODO: determine sampling freq, type of events
32 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g
33 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp
34 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp
35 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES
36 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\
37 |         -b \
38 |         -c${SAMPLING_RATE_IN_CNTS} \
39 |         -o ${RES_DIR}/perf.data \
40 |         -p ${TARGET_PID} -- sleep 20
41 | echo "done"
42 | sudo chown ${USER} ${RES_DIR}/perf.data
43 | echo "hmm"
44 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST}
45 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${CRLIST}
46 | 
47 | echo "  2) perf report"
48 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" |
49 |     sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt
50 | 
51 | 
52 | echo "  3) Capture Functions that cause most L2/L3 misses"
53 | python3 read_func_process.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${TARGET_BASENAME} #${SCHED_NAME}
54 | 
55 | echo "  4) perf annotate --stdio  &  Capturing all deliquent load PCs for each function ...."
56 | echo -n > ${RES_DIR}/all_PClist.txt
57 | 
58 | IFS=$'\n'
59 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do
60 |   echo "    Function Name: $FUNC"
61 |   echo "    L2 misses "
62 |   FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g')
63 | 
64 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
65 |     sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' |
66 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt
67 | 
68 |   echo "    L3 misses "
69 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
70 |     sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' |
71 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt
72 | 
73 | 
74 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
75 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
76 | 
77 |   unset IFS
78 |   for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do
79 |     echo -n "${TARGET_ABS_PATH} " >> ${RES_DIR}/all_PClist.txt
80 |     echo ${PC} >> ${RES_DIR}/all_PClist.txt
81 |   done
82 |   IFS=$'\n'
83 | 
84 |   sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt
85 | done
86 | unset IFS
87 | 
88 | #rm -r ${TMP_DIR} 2> /dev/null
89 | echo "Instruction capture done!"
90 | 


--------------------------------------------------------------------------------
/profile/scripts/capture_cm_inst_scav.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | SAMPLING_RATE_IN_CNTS=$1
 3 | 
 4 | cd "$(dirname "$0")"
 5 | 
 6 | MAIN_DIR=../../
 7 | BUILD_DIR=${MAIN_DIR}/apps/build
 8 | SCAVLIST=${MAIN_DIR}/scav.txt
 9 | RES_DIR=../results
10 | TMP_DIR=../results/tmp
11 | SCHED_NAME=template
12 | TARGET=$(cat ${SCAVLIST} | awk '{print $1}')
13 | 
14 | echo "Capture deliquent load PCs ... "
15 | echo "Target object: "
16 | #echo -n "$TARGET_ABS_PATH " > $SCAVLIST
17 | #echo $ARGS >> $SCAVLIST
18 | cat $SCAVLIST
19 | 
20 | mkdir -p ${RES_DIR}
21 | 
22 | rm -r ${TMP_DIR} 2> /dev/null
23 | mkdir -p ${TMP_DIR}
24 | 
25 | echo "  1) perf record L2/L3 misses"
26 | # TODO: determine sampling freq, type of events
27 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -- $benchmark_path/$benchmark_name 1 1 $input_graphs_path/$g
28 | EVENT1=cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp
29 | EVENT2=cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp
30 | EVENT3=BR_INST_RETIRED.ALL_BRANCHES
31 | sudo perf record -e ${EVENT1},${EVENT2},${EVENT3}\
32 |         -b \
33 |         -c${SAMPLING_RATE_IN_CNTS} \
34 |         -o ${RES_DIR}/perf.data -- sudo sh -c ". ${MAIN_DIR}/config/set_env.sh; chrt -r 99 ${BUILD_DIR}/${SCHED_NAME} 1"
35 | 
36 | sudo chown ${USER} ${RES_DIR}/perf.data
37 | 
38 | #perf record -e cpu/event=0xd1,umask=0x10,name=MEM_LOAD_RETIRED.L2_MISS/ppp,cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp,cycles:u -b -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${SCAVLIST}
39 | #perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp -F1000 -o ${RES_DIR}/perf.data -- ${BUILD_DIR}/${SCHED_NAME} ${SCAVLIST}
40 | 
41 | echo "  2) perf report"
42 | perf report -i ${RES_DIR}/perf.data --sort comm,dso,symbol -t"\$" |
43 |     sed '/BR_INST_RETIRED.ALL_BRANCHES/q' | grep "%" > ${TMP_DIR}/perf_report.txt
44 | 
45 | 
46 | echo "  3) Capture Functions that cause most L2/L3 misses"
47 | python3 read_func.py ${TMP_DIR}/perf_report.txt ${TMP_DIR}/fn_list.txt ${SCHED_NAME}
48 | 
49 | echo "  4) perf annotate --stdio  &  Capturing all deliquent load PCs for each function ...."
50 | echo -n > ${RES_DIR}/all_PClist.txt
51 | 
52 | IFS=$'\n'
53 | for FUNC in $(cat ${TMP_DIR}/fn_list.txt | awk -F'[\t]' '{print $1}'); do
54 |   echo "    Function Name: $FUNC"
55 |   echo "    L2 misses "
56 |   FUNC_NO_SPACE=$(echo $FUNC | sed 's/ /_/g')
57 | 
58 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
59 |     sed -n '/MEM_LOAD_RETIRED.L2_MISS/,/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/{p;/\(MEM_LOAD_RETIRED.L3_MISS\|BR_INST_RETIRED\)/q}' |
60 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt
61 | 
62 |   echo "    L3 misses "
63 |   perf annotate -i ${RES_DIR}/perf.data --stdio -M intel "${FUNC}" |
64 |     sed -n '/MEM_LOAD_RETIRED.L3_MISS/,/BR_INST_RETIRED/{p;/BR_INST_RETIRED/q}' |
65 |     sed '1d' | sed '$d' > ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt
66 | 
67 | 
68 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l2_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
69 |   python3 llc_missed_pcs_rfile.py ${TMP_DIR}/fn_${FUNC_NO_SPACE}_l3_annotate.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_percent.txt ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt
70 | 
71 |   unset IFS
72 |   for PC in $(cat ${TMP_DIR}/fn_${FUNC_NO_SPACE}_PClist.txt); do
73 |     echo -n "${TARGET} " >> ${RES_DIR}/all_PClist.txt
74 |     echo ${PC} >> ${RES_DIR}/all_PClist.txt
75 |   done
76 |   IFS=$'\n'
77 | 
78 |   sort -u ${RES_DIR}/all_PClist.txt -o ${RES_DIR}/all_PClist.txt
79 | done
80 | unset IFS
81 | 
82 | #rm -r ${TMP_DIR} 2> /dev/null
83 | echo "Instruction capture done!"
84 | 


--------------------------------------------------------------------------------
/profile/scripts/compute_ld_prob.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import time
  3 | sub_cf_dict_list = []
  4 | cf_dict = {}
  5 | lat_dict = {}
  6 | pred_dict = {}
  7 | l2m_dict = {}
  8 | l3m_dict = {}
  9 | symbol_map = {}
 10 | 
 11 | SAMPLING_RATE_IN_CNT=1
 12 | SYS_LBR_ENTRIES=32
 13 | LBR_SUBSAMPLE_SIZE=32
 14 | is_scav_profile = 0
 15 | 
 16 | last_match = None
 17 | def __symbol_map_lookup(addr: int) -> tuple:
 18 |     global last_match
 19 |     if last_match is not None:
 20 |         if last_match[1][1] <= addr and addr <= last_match[1][2]:
 21 |             return last_match[0], addr - last_match[1][0]
 22 | 
 23 |     for key, value in symbol_map.items():
 24 |         if value[1] <= addr and addr <= value[2]: # addr is in the range of the symbol
 25 |             last_match = (key, value)
 26 |             return key, addr - value[0] # compute offset from the object local address (not symbol local address)
 27 |     return None, 0
 28 | 
 29 | def __localize_address(range: tuple, allow_diff_obj=False) -> tuple:
 30 |     assert(len(range) == 2)
 31 | 
 32 |     src_pc = range[0]
 33 |     dst_pc = range[1]
 34 | 
 35 |     # check address range of src, dst using symbol map
 36 |     # then do two things
 37 |     # 1) filter out if src, dst are not in the same symbol
 38 |     # 2) if src, dst are in the same symbol, then translate them to symbol-local address
 39 |     #    by subtracting the symbol start address from src, dst
 40 |     src_obj_name, src_local_addr = __symbol_map_lookup(src_pc)
 41 |     dst_obj_name, dst_local_addr = __symbol_map_lookup(dst_pc)
 42 | 
 43 |     if src_obj_name is None or dst_obj_name is None:
 44 |         return None
 45 |     if (src_obj_name != dst_obj_name) and (not allow_diff_obj):
 46 |         return None
 47 | 
 48 |     return (src_obj_name, src_local_addr, dst_local_addr)
 49 | 
 50 | def create_control_flow_tuples(sub_cf_dict, sub_lat_dict, branch_src_to_dst_list: list) -> None:
 51 |     src, dst = None, None
 52 |     for i, src_to_dst in enumerate(branch_src_to_dst_list):
 53 |         if i == 0: # first item
 54 |             src = src_to_dst[0]
 55 |         else:
 56 |             dst = src_to_dst[1]
 57 |             cycles = src_to_dst[2]
 58 |             triple = __localize_address((dst, src))
 59 |             if triple is None:
 60 |                 src = src_to_dst[0]
 61 |                 continue
 62 | 
 63 |             if triple in sub_cf_dict:
 64 |                 sub_cf_dict[triple] += 1
 65 |                 sub_lat_dict[triple] = (sub_lat_dict[triple][0] + cycles, sub_lat_dict[triple][1] + 1)
 66 |             else:
 67 |                 sub_cf_dict[triple] = 1
 68 |                 sub_lat_dict[triple] = (cycles, 1)
 69 | 
 70 |             src = src_to_dst[0]
 71 | 
 72 | 
 73 | def create_predecessor_profile(sub_pred_dict, branch_src_to_dst_list: list) -> None:
 74 |     for i, src_to_dst in enumerate(branch_src_to_dst_list):
 75 |         src = src_to_dst[0]
 76 |         dst = src_to_dst[1]
 77 |         triple = __localize_address((src, dst), allow_diff_obj=True)
 78 |         if triple is None:
 79 |             continue
 80 | 
 81 |         local_addr_src = (triple[0], triple[1])
 82 |         local_addr_dst = (triple[0], triple[2])
 83 | 
 84 |         if local_addr_dst in sub_pred_dict:
 85 |             #do something
 86 |             if local_addr_src in sub_pred_dict[local_addr_dst]:
 87 |                 sub_pred_dict[local_addr_dst][local_addr_src] += 1
 88 |             else:
 89 |                 sub_pred_dict[local_addr_dst][local_addr_src] = 1
 90 |         else:
 91 |             sub_pred_dict[local_addr_dst] = {}
 92 |             sub_pred_dict[local_addr_dst][local_addr_src] = 1
 93 | 
 94 | def aggregate_sub_dicts(sub_dicts: list) -> None:
 95 |     sub_cf_dicts = [sub_dict[0] for sub_dict in sub_dicts]
 96 |     sub_lat_dicts = [sub_dict[1] for sub_dict in sub_dicts]
 97 |     sub_pred_dicts = [sub_dict[2] for sub_dict in sub_dicts]
 98 | 
 99 |     for sub_cf_dict in sub_cf_dicts:
100 |         for key, value in sub_cf_dict.items():
101 |             if key in cf_dict:
102 |                 cf_dict[key] += value
103 |             else:
104 |                 cf_dict[key] = value
105 | 
106 |     if is_scav_profile == 0:
107 |         return
108 | 
109 |     for sub_lat_dict in sub_lat_dicts:
110 |         for key, value in sub_lat_dict.items():
111 |             if key in lat_dict:
112 |                 lat_dict[key] = (lat_dict[key][0] + value[0], lat_dict[key][1] + value[1])
113 |             else:
114 |                 lat_dict[key] = value
115 | 
116 |     for key, value in lat_dict.items():
117 |         lat_dict[key] = (value[0] / value[1]) # cycles / cnt
118 | 
119 |     for sub_pred_dict in sub_pred_dicts:
120 |         for key, value in sub_pred_dict.items():
121 |             if key in pred_dict:
122 |                 for key2, value2 in value.items():
123 |                     if key2 in pred_dict[key]:
124 |                         pred_dict[key][key2] += value2
125 |                     else:
126 |                         pred_dict[key][key2] = value2
127 |             else:
128 |                 pred_dict[key] = value
129 | 
130 | 
131 | def sub_process_lbr(lines):
132 |     sub_cf_dict = {}
133 |     sub_lat_dict = {}
134 |     sub_pred_dict = {}
135 |     for line in lines:
136 |         tokens = line.split()
137 |         if len(tokens) < 2:
138 |             continue
139 | 
140 |         # Note: python3 uses 8-byte integers by default (64-bit)
141 |         # So no need to worry about overflow in conversion
142 |         # (branch src, branch dst, cycles elapsed since the last LRB)
143 |         branch_src_to_dst_list = [(int(token.split("/")[0].strip(),16), \
144 |                                 int(token.split("/")[1].strip(),16), \
145 |                                 int(token.split("/")[5].strip()))
146 |                                 for token in tokens]
147 | 
148 |         create_control_flow_tuples(sub_cf_dict, sub_lat_dict, branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE])
149 |         if is_scav_profile == 1:
150 |             create_predecessor_profile(sub_pred_dict, branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE])
151 |     return sub_cf_dict, sub_lat_dict, sub_pred_dict
152 | 
153 | 
154 | def process_lbr(trace_filename: str) -> None:
155 |     f = open(trace_filename, "r")
156 |     lines = f.readlines()
157 | 
158 |     f.close()
159 |     src_to_dst_count = {}
160 | 
161 |     NUM_CORES = 56
162 |     # split lines into N processes
163 |     # each process generate control_flow_tuples and aggregate later
164 |     sublines = [[] for i in range(NUM_CORES)]
165 |     for i in range(NUM_CORES):
166 |         sublines[i] = lines[i::NUM_CORES]
167 | 
168 |     import multiprocessing
169 |     results = []
170 |     with multiprocessing.Pool(NUM_CORES) as p:
171 |         results = p.map(sub_process_lbr, sublines)
172 | 
173 |         p.close()
174 |         p.join()
175 |     print([len(dic) if dic is not None else 0 for dic in results])
176 |     aggregate_sub_dicts(results)
177 | 
178 | 
179 |     #for line in lines:
180 |     #    tokens = line.split()
181 |     #    if len(tokens) < 2:
182 |     #        continue
183 |     #
184 |     #    # Note: python3 uses 8-byte integers by default (64-bit)
185 |     #    # So no need to worry about overflow in conversion
186 |     #    branch_src_to_dst_list = [(int(token.split("/")[0].strip(),16), \
187 |     #                              int(token.split("/")[1].strip(),16))
188 |     #                              for token in tokens]
189 |     #
190 |     #    create_control_flow_tuples(branch_src_to_dst_list[:LBR_SUBSAMPLE_SIZE])
191 | 
192 | 
193 |     print("BB execution summary:")
194 |     for key, value in cf_dict.items():
195 |         print(f"{key[0]} {hex(key[1])} {hex(key[2])} {value}")
196 |     print(" ")
197 | 
198 |     print("BB latency summary:")
199 |     for key, value in lat_dict.items():
200 |         print(f"LAT_PROF {key[0]} {key[1]} {key[2]} {value}")
201 |     print(" ")
202 | 
203 |     print("BB predecessor summary:")
204 |     for key, value in pred_dict.items():
205 |         for key2, value2 in value.items():
206 |             print(f"PRED_PROF {key[0]} {key[1]} {key2[0]} {key2[1]} {value2}")
207 |     print(" ")
208 | 
209 |     #with open("cf_dict.txt", "w") as f:
210 |     #    for key, value in cf_dict.items():
211 |     #        f.write(f"{key[0]} {key[1]} {value}\n")
212 | 
213 | def sample_to_approx_total(exec_cnt: int, scale:int = 1) -> int:
214 |     return exec_cnt / scale * SAMPLING_RATE_IN_CNT
215 | 
216 | 
217 | def create_cm_dict(summary_filename: str, level: int) -> None:
218 |     dic = l2m_dict if level == 2 else l3m_dict
219 | 
220 |     f = open(summary_filename, "r")
221 |     lines = f.readlines()
222 |     for line in lines:
223 |         # L2/L3 summary file format:
224 |         # <pc> <sample-count> per line
225 |         pc = int(line.split()[0].strip(), 16)
226 |         sample_cnt = int(line.split()[1].strip())
227 | 
228 |         obj_name, addr = __symbol_map_lookup(pc)
229 |         if obj_name is None:
230 |             continue
231 |         dic[(obj_name, addr)] = sample_to_approx_total(sample_cnt)
232 | 
233 |     f.close()
234 | 
235 | # Compute the probability that each load instruction causes a cache miss
236 | # Assumption: addresses are given in hex and symbol-local address.
237 | def compute_prob(addrlist_filename: str) -> None:
238 |     f = open(addrlist_filename, "r")
239 |     #out = open("exec_counts.txt", "w")
240 |     lines = f.readlines()
241 | 
242 |     for line in lines:
243 |         # Address list file format:
244 |         # <shared-object-name> <symbol-local-address> per line
245 |         obj_name = line.split()[0].strip()
246 |         addr = int(line.split()[1].strip(), 16)
247 |         sample_cnt = 0
248 | 
249 |         # Loop through all control flow tuples to find the ones that include the given address
250 |         for key, value in cf_dict.items():
251 |             if key[0] == obj_name and key[1] <= addr and addr <= key[2]:
252 |                  sample_cnt += value
253 |         exec_cnt = sample_to_approx_total(sample_cnt, LBR_SUBSAMPLE_SIZE)
254 | 
255 |         l2_miss_cnt = l2m_dict[(obj_name, addr)] if (obj_name, addr) in l2m_dict else 0
256 |         l3_miss_cnt = l3m_dict[(obj_name, addr)] if (obj_name, addr) in l3m_dict else 0
257 | 
258 |         # print prob up to two decimal digits after the point
259 |         if exec_cnt == 0:
260 |             print(f"LOAD_PROB {obj_name} {addr} {l2_miss_cnt} {l3_miss_cnt} {exec_cnt} 0.00% 0.00%")
261 |             continue
262 | 
263 |         print(f"LOAD_PROB {obj_name} {addr} {l2_miss_cnt} {l3_miss_cnt} {exec_cnt}" + \
264 |               " {0:.2f}% {1:.2f}%".format(100*l2_miss_cnt/exec_cnt, 100*l3_miss_cnt/exec_cnt))
265 |     f.close()
266 | 
267 | 
268 | def run(lbr_trace_filename: str,
269 |         l2_summary_filename: str,
270 |         l3_summary_filename: str,
271 |         addrlist_filename: str) -> None:
272 | 
273 |     t1 = time.time()
274 |     process_lbr(lbr_trace_filename)
275 |     t2 = time.time()
276 |     create_cm_dict(l2_summary_filename, 2)
277 |     t3 = time.time()
278 |     create_cm_dict(l3_summary_filename, 3)
279 |     t4 = time.time()
280 |     compute_prob(addrlist_filename)
281 |     t5 = time.time()
282 | 
283 |     print(f"process_lbr: {t2 - t1} seconds")
284 |     print(f"create_cm_dictl2: {t3 - t2} seconds")
285 |     print(f"create_cm_dictl3: {t4 - t3} seconds")
286 |     print(f"compute_prob: {t5 - t4} seconds")
287 | 
288 | 
289 | def build_symbol_map(symbolmap_filename: str) -> None:
290 |     f = open(symbolmap_filename, "r")
291 |     lines = f.readlines()
292 |     for line in lines:
293 | #Symbol map file format:
294 | #<shared-object-name> <symbol-start-address> <length> <offset> per line
295 | #we want to maintain (object start address, symbol start address, symbol end address)
296 |         tokens = line.split()
297 |         if len(tokens) < 4:
298 |             continue
299 | #Assuming a dumb linear search
300 |         symbol_map[tokens[0]] = (int(tokens[1], 16) - int(tokens[3], 16), int(tokens[1], 16), int(tokens[1], 16) + int(tokens[2], 16))
301 |     f.close()
302 | 
303 | if __name__ == "__main__":
304 |     if len(sys.argv) < 8:
305 |         print("usage: " + sys.argv[0] + " <lbr-trace> <symbol-map> <address-list> <l2-summary> <l3-summary> <INT: sampling-rate-in-cnt> <is-scav-profile>")
306 |         sys.exit(1)
307 |     lbr_trace_filename = sys.argv[1]
308 |     symbolmap_filename = sys.argv[2]
309 |     addrlist_filename = sys.argv[3]
310 |     l2_summary_filename = sys.argv[4]
311 |     l3_summary_filename = sys.argv[5]
312 |     SAMPLING_RATE_IN_CNT = int(sys.argv[6])
313 |     is_scav_profile = int(sys.argv[7])
314 | 
315 |     start = time.time()
316 |     build_symbol_map(symbolmap_filename)
317 |     end = time.time()
318 |     print(f"Symbol map built in {end - start} seconds")
319 | 
320 |     run(lbr_trace_filename, l2_summary_filename, l3_summary_filename, addrlist_filename)
321 | 


--------------------------------------------------------------------------------
/profile/scripts/compute_ld_prob.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | if [ $# -ne 3 ]; then
 3 |     echo "Usage: $0 <perf.data> <is-scav> <SAMPLING_RATE_IN_CNTS>"
 4 |     exit 1
 5 | fi
 6 | 
 7 | PERF_DATA_PATH=$1
 8 | IS_SCAV=$2
 9 | PERF_ABS_PATH=$(readlink -f $PERF_DATA_PATH)
10 | SAMPLING_RATE_IN_CNTS=$3
11 | #SAMPLING_RATE_IN_CNTS=$(perf script -i ${PERF_ABS_PATH} --header | grep BR_INST | grep "sample_freq" | awk -F'[, ]' '{print $256}')
12 | echo ${SAMPLING_RATE_IN_CNTS}
13 | cd "$(dirname "$0")"
14 | 
15 | MAIN_DIR=../../
16 | BUILD_DIR=${MAIN_DIR}/build
17 | RES_DIR=../results
18 | TMP_DIR=../results/tmp
19 | 
20 | mkdir -p ${TMP_DIR}
21 | 
22 | # symbol map
23 | perf report -i ${PERF_ABS_PATH} --dump-raw-trace | grep PERF_RECORD_MMAP2 | \
24 |     awk -F'[][() ]' '{print $NF, $9, $10, $13}' > ${TMP_DIR}/symbol.map  # <objname> <start-address> <length> <pgoffset>
25 | 
26 | # LBR samples
27 | perf script -i ${PERF_ABS_PATH} -F event,brstack | grep BRANCH | \
28 |     awk -F':' '{print $2}' > ${TMP_DIR}/lbr.samples
29 | 
30 | # L2, L3 summary
31 | perf script -i ${PERF_ABS_PATH} -F event,ip | grep "MEM_LOAD_RETIRED.L2_MISS" | \
32 |     awk '{cnt[$2]+=1} END {for (ip in cnt) print ip, cnt[ip]}' > ${TMP_DIR}/l2.summary
33 | 
34 | perf script -i ${PERF_ABS_PATH} -F event,ip | grep "MEM_LOAD_RETIRED.L3_MISS" | \
35 |     awk '{cnt[$2]+=1} END {for (ip in cnt) print ip, cnt[ip]}' > ${TMP_DIR}/l3.summary
36 | 
37 | # finally, compute prob
38 | python3 compute_ld_prob.py ${TMP_DIR}/lbr.samples \
39 |     ${TMP_DIR}/symbol.map \
40 |     ${RES_DIR}/all_PClist.txt \
41 |     ${TMP_DIR}/l2.summary \
42 |     ${TMP_DIR}/l3.summary \
43 |     ${SAMPLING_RATE_IN_CNTS} \
44 |     $IS_SCAV > ${RES_DIR}/ld_prob.txt
45 | 
46 | cat ${RES_DIR}/ld_prob.txt | grep "LOAD_PROB" | awk -F'[% ]' '{print $2, $3, $7, $9}' > ${RES_DIR}/cmpc_list.txt
47 | if [ $IS_SCAV -eq 1 ]; then
48 |     cat ${RES_DIR}/ld_prob.txt | grep "LAT_PROF" > ${RES_DIR}/lat_prof.txt
49 |     cat ${RES_DIR}/ld_prob.txt | grep "PRED_PROF" > ${RES_DIR}/pred_prof.txt
50 | fi
51 | 
52 | #rm -r ${TMP_DIR} 2> /dev/null
53 | 


--------------------------------------------------------------------------------
/profile/scripts/llc_missed_pcs_rfile.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import re
 3 | 
 4 | pc=0
 5 | percent=[]
 6 | pc_list=[]
 7 | max=0.00
 8 | pos_in_line=0
 9 | 
10 | with open(sys.argv[1]) as file_in:
11 |     lines = []
12 |     for line in file_in:
13 |         lines.append(line.strip())
14 |         #if '40'  in line:
15 |         #if '4'  in line.split()[2]:
16 |         if '.' in line:
17 |             #print(line)
18 |             #print(re.sub(r"\s+", "", line)
19 |             #for value in line.split():
20 |             #print(line.split()[0])
21 |             if(line.split()[0] == ":"):
22 |                 continue
23 |             
24 |             if(int(float(line.split()[0]))> 0):
25 |                 percent_value =int(float(line.split()[0]))
26 |                 percent.append(percent_value)
27 |                 pc_value=line.split()[2]
28 |                 pc_list.append(pc_value[:-1])
29 |                 if(max < percent_value):
30 |                     max = percent_value
31 |                     pc = line.split()[2][:-1]
32 | #                        print("max: ",max)
33 | #                       print("pc: ",pc)
34 | 
35 | 
36 | sum_percent =0
37 | 
38 | output_file_pc_list = open(sys.argv[2], "a")
39 | output_file_most_missed_pc = open(sys.argv[3], "a")
40 | 
41 | sorted_percent= sorted(percent,reverse=True)
42 | for x in range(0,len(sorted_percent)):
43 |     #print("sorted_percent: ", sorted_percent[x])
44 |     for y in range(0,len(percent)):
45 | 
46 |         #if(percent[y]==sorted_percent[x] and sum_percent < 40):
47 |         if(percent[y]==sorted_percent[x] and sum_percent < 70):
48 |             print("          PC: ", pc_list[y], "      percent:   ", percent[y])
49 |             sum_percent = sum_percent + percent[y]
50 |             output_file_pc_list.write("percent:   "+ str(percent[y])+ "      PC:   "+ str(pc_list[y])+"\n")
51 |             output_file_most_missed_pc.write(str(pc_list[y])+"\n")
52 | 
53 | 
54 | #for x in range(0,len(pc_list)):
55 |  #   print("percent: ", percent[x])
56 |   #  output_file_pc_list.write("percent: "+ str(percent[x])+ " pc: "+ str(pc_list[x])+"\n")
57 | 
58 | #output_file_most_missed_pc.write(pc)
59 | 


--------------------------------------------------------------------------------
/profile/scripts/prepare_tbench.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | cd $(dirname "$0")
 3 | 
 4 | TBENCH_DIR=/home/sam/clh/tailbench-clh
 5 | TBENCH_APPS="img-dnn masstree sphinx xapian moses"
 6 | 
 7 | BUILD_DIR=$(readlink -f ../../apps/build)
 8 | 
 9 | if [ -z $1 ]; then
10 |     echo "Usage: ./prepare_tbench.sh [build|run]"
11 |     exit
12 | fi
13 | 
14 | MODE=$1
15 | if [ $MODE == "build" ]; then
16 |     for app in $TBENCH_APPS; do
17 |         if [ $app == "moses" ]; then
18 |             cd $TBENCH_DIR/${app}/moses-cmd
19 |         else
20 |             cd $TBENCH_DIR/${app}
21 |         fi
22 |         make clean
23 |         if [ $app == "sphinx" ]; then
24 |             PKG_CONFIG_PATH=${TBENCH_DIR}/${app}/sphinx-install/lib/pkgconfig make -j16
25 |         elif [ $app == "silo" ]; then
26 |             ./build.sh
27 |         else
28 |             make -j16
29 |         fi
30 |     done
31 | fi
32 | 
33 | if [ $MODE == "install" ]; then
34 |     for app in $TBENCH_APPS; do
35 |         target=$app
36 |         if [ $app == "masstree" ]; then
37 |             target="mttest"
38 |         elif [ $app == "sphinx" ]; then
39 |             target="decoder"
40 |         elif [ $app == "moses" ]; then
41 |             target="moses-cmd/moses"
42 |         elif [ $app == "silo" ]; then
43 |             target="out-perf.masstree/benchmarks/dbtest"
44 |         fi
45 | 
46 |         cd $TBENCH_DIR/${app}
47 |         cp $TBENCH_DIR/${app}/${target}_integrated ${BUILD_DIR}
48 |     done
49 | fi
50 | 


--------------------------------------------------------------------------------
/profile/scripts/read_func.py:
--------------------------------------------------------------------------------
 1 | """
 2 | IN: perf report of L2, L3 cache misses
 3 | OUT: 1)fn_list: list of functions that are responsible for most cache misses
 4 |      2)fn_percent_list: list of percentages of cache misses for each function
 5 | """
 6 | import sys
 7 | import re
 8 | 
 9 | funcNum = 0
10 | percentSum = 0
11 | #funcList = []
12 | #percentList = []
13 | fn_to_pct_map = {}
14 | 
15 | bench = sys.argv[3]
16 | 
17 | MAX_FUNC_NUM = 20
18 | with open(sys.argv[1]) as file_in:
19 |     lines = []
20 |     for line in file_in:
21 |         # check if it is substr
22 |         if line.split('$')[1] in bench: # command = bench
23 |             fn_name_prim = line.split('$')[3]
24 |             fn_name = fn_name_prim.split()[1]
25 |             for i in range(2, len(fn_name_prim.split())):
26 |                 fn_name += " "
27 |                 fn_name += fn_name_prim.split()[i]
28 | 
29 |             overhead_pct = float(line.split("%")[0])
30 | 
31 |             if overhead_pct < 1:
32 |                 continue
33 | 
34 |             if fn_name in fn_to_pct_map:
35 |                 fn_to_pct_map[fn_name] += overhead_pct
36 |             else:
37 |                 if funcNum >= MAX_FUNC_NUM:
38 |                     continue
39 |                 funcNum=funcNum+1
40 |                 fn_to_pct_map[fn_name] = overhead_pct
41 | 
42 | 
43 | #output_file_func_percent_list = open(sys.argv[2], "a")
44 | output_file_func_list = open(sys.argv[2], "a")
45 | 
46 | 
47 | if fn_to_pct_map:
48 |     print("Functions that cause most cache misses (L2, L3 aggregated, total 200%):")
49 |     for fn, pct in fn_to_pct_map.items():
50 |         print(f"{fn} {str(pct)}%");
51 |         output_file_func_list.write(f"{fn}\t{str(pct)}%\n")
52 | 


--------------------------------------------------------------------------------
/profile/scripts/read_func_process.py:
--------------------------------------------------------------------------------
 1 | """
 2 | IN: perf report of L2, L3 cache misses
 3 | OUT: 1)fn_list: list of functions that are responsible for most cache misses
 4 |      2)fn_percent_list: list of percentages of cache misses for each function
 5 | """
 6 | import sys
 7 | import re
 8 | 
 9 | funcNum = 0
10 | percentSum = 0
11 | #funcList = []
12 | #percentList = []
13 | fn_to_pct_map = {}
14 | 
15 | bench = sys.argv[3]
16 | 
17 | MAX_FUNC_NUM = 20
18 | with open(sys.argv[1]) as file_in:
19 |     lines = []
20 |     for line in file_in:
21 |         # check if it is substr
22 |         print(line.split('$')[2], bench)
23 |         if line.split('$')[2] in bench or bench in line.split('$')[2]: # command = bench
24 |             fn_name_prim = line.split('$')[3]
25 |             fn_name = fn_name_prim.split()[1]
26 |             for i in range(2, len(fn_name_prim.split())):
27 |                 fn_name += " "
28 |                 fn_name += fn_name_prim.split()[i]
29 | 
30 |             overhead_pct = float(line.split("%")[0])
31 |             if overhead_pct < 1:
32 |                 continue
33 | 
34 |             if fn_name in fn_to_pct_map:
35 |                 fn_to_pct_map[fn_name] += overhead_pct
36 |             else:
37 |                 if funcNum >= MAX_FUNC_NUM:
38 |                     continue
39 |                 funcNum=funcNum+1
40 |                 fn_to_pct_map[fn_name] = overhead_pct
41 | 
42 | 
43 | #output_file_func_percent_list = open(sys.argv[2], "a")
44 | output_file_func_list = open(sys.argv[2], "a")
45 | 
46 | 
47 | if fn_to_pct_map:
48 |     print("Functions that cause most cache misses (L2, L3 aggregated, total 200%):")
49 |     for fn, pct in fn_to_pct_map.items():
50 |         print(f"{fn} {str(pct)}%");
51 |         output_file_func_list.write(f"{fn}\t{str(pct)}%\n")
52 | 


--------------------------------------------------------------------------------
/profile/scripts/read_lats.py:
--------------------------------------------------------------------------------
 1 | import struct
 2 | import numpy as np
 3 | import sys
 4 | 
 5 | def main():
 6 |     fileName = 'lats.bin'
 7 |     with open(fileName, mode='rb') as file:
 8 |         fileContent = file.read()
 9 | 
10 |     results = [ r for r in struct.iter_unpack("QQQ", fileContent) ]
11 |     queue_times = [ v[0]/1e6 for v in results ]
12 |     svc_times = [ v[1]/1e6 for v in results ]
13 |     sjrn_times = [ v[2]/1e6 for v in results ]
14 | 
15 |     if len(sys.argv) > 1:
16 |         for i, _ in enumerate(queue_times):
17 |             print('{} {} {}'.format(queue_times[i], svc_times[i], sjrn_times[i]))
18 | 
19 |     print('query count: {}'.format(len(queue_times)))
20 |     print('queue time: average {}'.format(np.mean(queue_times)))
21 |     print('service time: average {}'.format(np.mean(svc_times)))
22 |     print('sojourn time: average {}'.format(np.mean(sjrn_times))) 
23 | 
24 |     print('queue time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(queue_times), np.percentile(queue_times, 90), np.percentile(queue_times, 95), np.percentile(queue_times, 99)))
25 |     print('service time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(svc_times), np.percentile(svc_times, 90), np.percentile(svc_times, 95),  np.percentile(svc_times, 99)))
26 |     print('sojourn time: median {}, 90% {}, 95% {}, 99% {}'.format(np.median(sjrn_times), np.percentile(sjrn_times, 90), np.percentile(sjrn_times, 95),  np.percentile(sjrn_times, 99)))
27 | 
28 | if __name__ == "__main__":
29 |      main()
30 | 


--------------------------------------------------------------------------------
/profile/scripts/scav_profile.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sosson97/msh/5cdde8a2fe8f8eee103caf296ff21e170b465c6b/profile/scripts/scav_profile.py


--------------------------------------------------------------------------------