├── .DS_Store ├── README.md ├── blog ├── 2021-07-25-meltdown.md ├── 2021-07-29-user-level-thread-switch.md ├── 2021-08-11-how-linux-finds-physical-address-through-virtual-address.md ├── 2022-02-06-tcp-congestion-control.md ├── 2022-05-03-slack-incident-reading-note.md ├── 2022-10-05-jemalloc.md ├── 2025-01-01.md ├── 2025-01-04-linear-hash.md ├── 2025-01-15-non-blocking-stack-profiler.md └── 2025-01-27-memory-barrier.md ├── images ├── memory-barrier-0.png ├── memory-barrier-invalid-queue.png └── memory-barrier-store-buffer.png └── slides ├── Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key ├── Regression-Test-Selection-for-Rails-Project.key ├── cow-in-xv6.key ├── draft └── .keep ├── erlang-message-passing.key ├── sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key └── sdb-rubykaigi-2025.key /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Personal Blog 2 | ## Articles and Slides 3 | ## 2025 4 | [Understanding Memory Barriers Step by Step - A Summary of "Memory Barriers: A Hardware View for Software Hackers"](./blog/2025-01-27-memory-barrier.md) 2025-01-29 5 | 6 | [Understanding Linear Hash Step by Step](./blog/2025-01-04-linear-hash.md) 2025-01-04 7 | 8 | [Understanding the Page Table Step by Step](./blog/2025-01-01.md) 2025-01-01 9 | 10 | ## 2024 11 | [SDB: a New Ruby Stack Profiler @ RubyConf China 2024](./slides/slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key) 2024-11-30 12 | 13 | [无 root 权限查看 Ruby HTTPS 请求内容](https://ruby-china.org/topics/43886) 2024-09-14 14 | 15 | [Detect Ruby GVL contention through dynamic link library functions](https://ruby-china.org/topics/43883) 2024-09-11 16 | 17 | [Ruby Garbage Collection 101 and Ruby's RGenGC (Restricted Generational GC)](https://ruby-china.org/topics/43798) 2024-07-06 18 | 19 | [Rust2go: calls Go from Rust](https://ruby-china.org/topics/43765) 2024-06-27 20 | 21 | [Use eBPF USDT in Rust](https://github.com/yfractal/blog/issues/15) 2024-05-21 22 | 23 | ## 2023 24 | [How Meltdown Works](https://github.com/yfractal/blog/issues/14) 2023-12-04 25 | 26 | [Exercise Snacks: A Feasible Exercising Strategy for Office Workers](https://github.com/yfractal/blog/issues/12) 2023-05-08 27 | 28 | ## 2022 29 | 30 | [How JeMalloc Works](./blog/2022-10-05-jemalloc.md) 2022-10-05 31 | 32 | MIT 6.824 Distributed Systems Reading Notes 33 | - [Scaling Memcache at Facebook Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1100841630) 2022-04-17 34 | - [The Google File System Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1115903733) 2022-05-03 35 | - [Amazon Aurora Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1140428705) 2022-05-29 36 | - [COPS Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1207530839) 2022-05-30 37 | - [Spark Paper Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1193444846) 2022-06-25 38 | - [Google Spanner Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1225041302) 2022-08-24 39 | 40 | [Regression Test Selection for Rails Project @ internal sharing](./slides/Regression-Test-Selection-for-Rails-Project.key) 2022-06-09 41 | 42 | [Slack’s Incident on 2-22-22 Reading Note](./blog/2022-05-03-slack-incident-reading-note.md) 2022-05-03 43 | 44 | [闲来无事,用 Ruby 撸了个 LSM-Tree](https://ruby-china.org/topics/42363) 2022-05-02 45 | 46 | [How Copy on Write Works in Xv6](./slides/cow-in-xv6.key) 2022-02-06 47 | 48 | [TCP Congestion Control Brief Description by Pseudo Code](./blog/2022-02-06-tcp-congestion-control.md) 2022-02-06 49 | 50 | ### 2021 51 | [How Linux finds physical address through virtual memory](./blog/2021-08-11-how-linux-finds-physical-address-through-virtual-address.md) 2021-08-11 52 | 53 | [User Level Thread Switch](./blog/2021-07-29-user-level-thread-switch.md) 2021-07-29 54 | 55 | [Meltdown Notes](./blog/2021-07-25-meltdown.md) 2021-07-25 56 | 57 | [MIT 6.S081 学习总结](https://ruby-china.org/topics/41485) 2021-07-17 58 | 59 | [Linear Hash Revisit](https://ruby-china.org/topics/40930) 2021-02-22 60 | 61 | [How DBMS Memory Buffer Works](https://ruby-china.org/topics/40932) 2021-02-21 62 | 63 | ### 2020 64 | [Erlang message passing brief introduction @ Beijing Elixir meetup](./slides/erlang-message-passing.key) 2020-10-24 65 | 66 | [Concurrency task schedule brief introduction @ RubyConf China 2020](./slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key) 2020-08-15 67 | 68 | [Raft 笔记](https://ruby-china.org/topics/40018) 2020-06-26 69 | 70 | [Awesome OTP Learning](https://github.com/yfractal/awesome-otp-learning) 2020-02-22 71 | 72 | [Linear Hash 原理及实现](https://ruby-china.org/topics/39466) 2020-01-27 73 | 74 | [Erlang ETS Linear Hash Implementation](https://ruby-china.org/topics/39470) 2020-01-29 75 | 76 | ["Memory Barriers: a Hardware View for Software Hackers" 笔记](https://ruby-china.org/topics/39474) 2020-01-31 77 | 78 | ### 2019 79 | [Mnesia Transaction and Locker @ ShenZhen Elixir Meetup](https://github.com/Pragmatic-Elixir-Meetup/shenzhen-meetup/tree/master/2019-10-27/mnesia-transaction-and-locker) 2019-10-27 80 | 81 | [Erlang 虚拟机窥探 - process 和 scheduler @ ShenZhen Elixir Meetup](https://github.com/Pragmatic-Elixir-Meetup/shenzhen-meetup/tree/master/2019-08-04/erlang%20%E8%99%9A%E6%8B%9F%E6%9C%BA%E7%AA%A5%E6%8E%A2%20-%20%E4%BD%BF%E7%94%A8%20ruby%20%E6%A8%A1%E6%8B%9F%20erlang%20process%20%E5%92%8C%20scheduler) 2019-08-05 82 | 83 | [A Basic Paxos Algorithm Demo Using Erlang](https://ruby-china.org/topics/38909) 2019-08-05 84 | 85 | [分布式共识算法 Paxos -- 如何让所有程序员认可 PHP 才是最好的语言](https://ruby-china.org/topics/38833) 2019-07-12 86 | 87 | [Understand lock free queue algorithm as a concurrency beginner](https://ruby-china.org/topics/38086) 2019-02-04 88 | 89 | [使用 Fiber 实现简单的 CSP (Goroutine channel)](https://ruby-china.org/topics/38041) 2019-01-24 90 | 91 | [HTTP Request Demo by Future and Nio4r](https://ruby-china.org/topics/38404) 2019-04-14 92 | 93 | ### 2018 94 | [自旋锁及 Nginx 实现](https://ruby-china.org/topics/37916) 2018-12-18 95 | 96 | [Erlang 源码阅读 -- scheduler](https://ruby-china.org/topics/37840) 2018-12-01 97 | 98 | [Erlang 源码阅读 -- Number of Active Schedulers](https://ruby-china.org/topics/37874) 2018-12-09 99 | 100 | [使用 Ruby 实现 Erlang "Process"](https://ruby-china.org/topics/37750) 2018-11-11 101 | -------------------------------------------------------------------------------- /blog/2021-07-25-meltdown.md: -------------------------------------------------------------------------------- 1 | # Meltdown Notes 2 | 3 | Meltdown can let us read kernel memory from user space which breaks OS' isolation. 4 | 5 | That means we can use meltdown to read some privilege data such I/O or network buffer. 6 | 7 | Before explain how it works let us to see how OS prevent such things happen. 8 | 9 | ## Memory Isolation 10 | 11 | In user mode data can be accessed only through virtual memory. 12 | 13 | Virtual memory will map virtual address to real(physical) address. 14 | 15 | Before hardware do the mapping will check permission first, if some memory belongs to kernel that memory can't be read in user mode. 16 | 17 | If program dose such thing will cause page fault exception. 18 | 19 | The permission is recored by PTE(page table entry). 20 | 21 | Below picture is RISC-V's PTE. 22 | 23 | ![Screen Shot 2021-07-24 at 11 25 01 PM](https://user-images.githubusercontent.com/3775525/126873177-04e72b82-9e31-4163-a3b7-e53de305c675.png) 24 | 25 | If the U bit is set to 1 means the memory belongs to user. 26 | 27 | Only when program in kernel mode can create or modify the PTE. 28 | 29 | This is how OS provide memory isolation. 30 | 31 | ## Out-of-order Execution 32 | 33 | Out-of-order execution can let us to read kernel memory from user model but the result is invisible in architectural level. 34 | 35 | When CPU run some slow instructions such as load from memory, CPU will run the following instructions first for performance reason. 36 | 37 | Before the slow instruction is finished, the following instructions' changes are invisible in architectural level. This is out-of-order execution. 38 | 39 | Then after the slow instruction has finished, then CPU will make the state changes visible or rollback if some things go wrong. They call this retirement. 40 | 41 | In out-of-order execution hardware will not check exception such as page fault until when CPU do retirement. 42 | 43 | let's see one example: 44 | ``` c 45 | 1 r0 = 46 | 2 r1 = valid // r1 is a register; valid is in RAM 47 | 3 if(r1 == 1){ 48 | 4 r2 = *r0 49 | 5 r3 = r2 + 1 50 | 6 } else { 51 | 7 r3 = 0 52 | 8 } 53 | ``` 54 | 55 | For line 1 needs to load value from memory to register and that may take hundreds of cycles. 56 | 57 | CPU will not wait idle for the load and it will run lien 4 ~ 5 in parallel. 58 | 59 | Suppose `r0` is some kernel address, we know when CPU exec instructions out of order, it will not check exceptions. 60 | 61 | So `r0`'s value has been loaded in some invisible place. 62 | 63 | After `r1` has been loaded, CPU will do retirement. Hardware will see current mode is user mode and the program is reading a kernel address. 64 | 65 | So we should rollback changes and raise page fault exception. 66 | 67 | Hardware designer thought that kind of changes are invisible to program. 68 | 69 | ## Flush-Reload 70 | 71 | In the Out-of-order execution section, we demonstrated how to read kernel memory in user mode but the result is invisible to us. 72 | 73 | In this section will see how to make the result visible. 74 | 75 | ### Cache 76 | CPU has cache for storing recent accessed data. 77 | 78 | The speed for CPU to read data from cache is much more faster than read data from RAM. 79 | 80 | For example, read from cache needs a dozen cycles, read from RAM needs hundreds cycles. 81 | 82 | If we read some address fast means the address has been used recently and slow means it has not been used recently. 83 | 84 | ### Flush-Reload 85 | 86 | Flush-reload will allow you to check a function used the memory at an address `x` or not. 87 | 88 | The steps are: 89 | 90 | 1. ensure x is not cached 91 | 2. call `f()` 92 | 3. record the time takes, say t1 93 | 4. load a byte from address x 94 | 5. record the time again, say t2 95 | 6. if the difference between t1 and t2 is small we can infer x has been used in `f()` 96 | 97 | When CPU do retirement, it will cancel register state but will not flush the cache. 98 | 99 | So we can use flush+reload to get out-of-order execution's internal state. 100 | 101 | ## Meltdown 102 | 103 | Let's see how to read 1 bit at kernel address `r1` 104 | 105 | ``` 106 | 1 char buf[8192] 107 | 2 108 | 3 // the Flush of Flush+Reload 109 | 4 clflush buf[0] 110 | 5 clflush buf[4096] 111 | 6 112 | 7 113 | 8 114 | 9 r1 = 115 | 11 r2 = *r1 // out-of-order 116 | 12 r2 = r2 & 1 // out-of-order 117 | 13 r2 = r2 * 4096 // out-of-order 118 | 14 r3 = buf[r2] // out-of-order 119 | 15 120 | 16 121 | 17 122 | 18 // the Reload of Flush+Reload 123 | 19 a = rdtsc // get time 124 | 20 r0 = buf[0] 125 | 21 b = rdtsc 126 | 22 r1 = buf[4096] 127 | 23 c = rdtsc 128 | 24 if b-a < c-b: 129 | 25 low bit was probably a 0 130 | ``` 131 | 132 | As line 7 is an expensive instruction, CPU will execute lien 9 ~ 14 in parallel. 133 | 134 | `r2 = *r1` will read r1's content. 135 | 136 | Then get the lowest bit through `r2 = r2 & 1`. 137 | 138 | ``` c 139 | r2 = r2 * 4096 140 | r3 = buf[r2] 141 | ``` 142 | 143 | Will cause a read at `buf[0]` or `buf[4096]`. 144 | 145 | For line 19 ~ 25 will do reload part of the flush-reload attack. 146 | 147 | The code will read` r[0]` and get time by `b - a`. 148 | 149 | Then read `r[4096]` and get time takes by `c - b`. 150 | 151 | if `b - a < c - b` means `r[0]` has been read recently and then we can imply r2 == 0, so the lowest bit of address r1 is 0. 152 | 153 | That's how we read one bit of kernel address in user mode. 154 | 155 | ## Additional 156 | 157 | Meltdown only affects some of Intel x86 CPU and ARM CPU and has been fixed. 158 | 159 | Meltdown can get kernel memory relies on user address and kernel address are in same page table. 160 | 161 | OS can fix it by "KAISER"/"KPTI", it doesn't map the kernel in user page table. 162 | 163 | And L1 cache cached virtual address and L3 cache cached physical address. 164 | 165 | When CPU load memory into cache, it loads a trunk of memory. That's why we use ` buf[0]` and `buf[4096]` for checking. 166 | 167 | This article mainly based on 6.S081 2020 Lecture 22 and the meltdown paper. 168 | 169 | For how hardware handle user mode and kernel mode, you can read "The RISC-V Reader" chapter 10. 170 | 171 | Retired explains in stack overflow https://stackoverflow.com/questions/22368835/what-does-intel-mean-by-retired 172 | 173 | Others explains about Meltdown https://www.bilibili.com/video/BV1nb4y1D7ii?from=search&seid=12911120057259798460 174 | 175 | Please let me know if I make any mistakes :pray: 176 | -------------------------------------------------------------------------------- /blog/2021-07-29-user-level-thread-switch.md: -------------------------------------------------------------------------------- 1 | # User Level Thread Switch 2 | 3 | This article will introduce how CPU handle function call and how to do user level thread switch. 4 | 5 | ## Function Call 6 | Let's image a simple computer, all instructions are stored in memory, and CPU will execute the instruction on by one in sequence. 7 | 8 | As we want to pick an instruction to execute, we need to know the instruction's address in memory. 9 | 10 | So we have to find a place to hold such info for CPU, CPU uses registers for this purpose. 11 | 12 | The registers are much more faster than memory, accessing a register may need one CPU cycle and accessing a memory location may need hundred CPU cycles. 13 | 14 | But CPU just has limited registers, may 32 or 64 or more, but can't get as much as we want. 15 | 16 | The register for storing next instruction's address is `PC`. 17 | 18 | After we executed one instruction, we move the `PC` 4 byte forward (suppose 32 bits for each instruction). 19 | 20 | Then CPU will execute the instruction which is point by `PC`. 21 | 22 | We can describe this in code 23 | 24 | ``` c 25 | instructions = [i0, i1, ....in] 26 | 27 | pc = 0 28 | while true 29 | CPU_exec(instructions[pc]) 30 | pc += 4 31 | ``` 32 | Screen Shot 2021-07-28 at 8 45 34 AM 33 | 34 | Now let's to see how loop works. 35 | 36 | For loop we need exec some instructions again and agin, such as instruction-0, instruction-1, .... instruction-n, then instruction-0, instruction-1, .... instruction-n... 37 | 38 | As we know `PC` is pointed to the next instruction to execute, so after we executed instruction-n, we just need to set `PC` to the instruction-0's address, then cpu will loop again. 39 | 40 | Screen Shot 2021-07-28 at 8 52 01 AM 41 | 42 | It's time for function call. 43 | 44 | Suppose there is some code as below: 45 | 46 | ``` c 47 | 1 foo := 48 | 2 instruction0 49 | 3 call bar 50 | 4 instruction1 51 | 5 instruction2 52 | 6 53 | 7 bar := 54 | 8 instruction0 55 | 9 ret 56 | ``` 57 | 58 | For function `foo`, we need execute some instruction then call `bar` function and execute remain instructions. 59 | 60 | For function `bar`, we need execute some instruction and return. 61 | 62 | Screen Shot 2021-07-28 at 9 10 21 AM 63 | 64 | So we need jump to `bar` and jump back. 65 | 66 | When we jump to `bar`, we just need to update `PC` to `bar`'s location. Then CPU will execute `bar`'s instructions. 67 | 68 | After we finished `bar`'s execution code we need jump back to line 4's address. 69 | 70 | For doing this we need one place to hold the line 4's address (for jumping back). 71 | 72 | We can store it in memory but memory is slow, so we store it in register. This special purpose register is called `ra` usually. 73 | 74 | so the call and ret can be defined as below: 75 | 76 | ``` c 77 | call label := 78 | ra <- pc + 4 // assign next instruction for ret, which is line 4's address in our example 79 | pc <- label // jump to the callee, for foo is bar's address 80 | 81 | ret := 82 | pc <- ra // jump back 83 | ``` 84 | 85 | It works for on level function call. 86 | 87 | But what if we have 3 functions? Likes function `f` calls `foo` and `foo` calls `bar`. 88 | 89 | ``` c 90 | 1 0000 f := exec order ra's value description 91 | 2 0004 instruction-x 0 0 92 | 3 0008 call foo 1 000a 93 | 4 000a instruction-x 94 | 5 000e ret 95 | 6 0010 foo := 96 | 7 0014 instruction-x 2 000a 97 | 8 0016 call bar 3 0001e 98 | 9 001e instruction-x 6 001e 99 | 10 0010 ret 7 001e jump to 001e, line 9 again... 100 | 11 0014 bar := 101 | 12 0018 instruction-x 4 001e 102 | 13 001a ret 5 001e jump to 001e, line 9 103 | ``` 104 | 105 | Let's walk above code step by step. 106 | 107 | For line 3, the ra's value is 000a(at line 3), then we jump to foo function(at line 6). 108 | 109 | Then we execute `call bar` at line 8 and the ra's value will be updated to 001e(at line 9). 110 | 111 | After `bar` has been executed, `ra`'s value is 001e(at line 9), so we jump to line 9. 112 | 113 | But when we execute line 10, current `ra`'s value is 001e(at line 9), we jump back to line 9 again. 114 | 115 | It's an infinite loop. 116 | 117 | That happens because at line 8 we overwrite the original `pa`'s value. 118 | 119 | We need many places to store the ongoing functions' return addresses but we only have limited register. 120 | 121 | It is impossible to store all those return address into registers, so we store them in memory. 122 | 123 | For each function call we need some small memory to save `ra` and after the function has been executed we will need get the data back and free the memory. 124 | 125 | For function `f` we need allocate memory and store 000e(line 3) to it and for `foo` we need allocate memory and store 0001e(at line 9), 126 | 127 | after `bar` has been executed then we need get last value(0001e at line 9) back and deallocate memory 128 | 129 | then do same thing for value 000e(line 3). 130 | 131 | The operations are push, push and pop, pop, so we can use stack. 132 | 133 | For stack we need allocates some memory(eg: 4kb) and one pointer which points to the top of the stack. 134 | 135 | When we do push, we move the pointer forward and store `ra` to the last location. 136 | 137 | After a function has been called, we pop the value and move the pointer backward. 138 | 139 | As this happens so often, hardware designer provides one register for us, it is called `sp` usually. 140 | 141 | Register `ra`'s value only has meaning in current function, when we executed an inner function such as `bar`, the `ra` should have different value. 142 | 143 | It is a temporary register, such registers should be saved by caller. 144 | 145 | And there is another kind of register which are preserved across function calls, 146 | 147 | so if callee needs to use such register, he needs to save them and restore them back before return to caller. 148 | 149 | This kind of registers are callee saved registers. 150 | 151 | Above is how we handle function call. 152 | 153 | ## Switch Threads 154 | 155 | NOTICE: Those are based on MIT 6.S081 2020 Multithreading lab. 156 | 157 | We need build a function for switching two thread. 158 | 159 | And the function will be called likes `switch(thread1, thread2)`, thread1 is a variable which stores thread 1's state. 160 | 161 | When call `switch` function we will switch from `thread1` to `thread2`. 162 | 163 | ![Screen Shot 2021-07-28 at 9 54 12 PM](https://user-images.githubusercontent.com/3775525/127334554-fa2caa8f-dc62-41f0-a6ae-17078391a623.png) 164 | 165 | When we call `switch`, CPU will begin execute `switch` instructions. 166 | 167 | And after all `switch`'s instructions have been executed, switch will not jump back to is callee place in thread 1, but will jump to other place in thread 2. 168 | 169 | It works as same as function call except it doesn't return back. 170 | 171 | For normal function all, compiler will help us to save register and handle return address. 172 | 173 | For switch, we need handle those stuff by myself. 174 | 175 | First thing is store current thread's state. Suppose we have structures as below: 176 | 177 | ```c 178 | struct context { 179 | uint64 ra; 180 | uint64 register1; 181 | uint64 register2; 182 | ......... 183 | } 184 | 185 | struct thread { 186 | struct context context; 187 | }; 188 | ``` 189 | 190 | and we call `switch` by `switch(&threa_1->context, &thread_2->context)`. 191 | 192 | We need use assembly code to save thread_1's return address and other registers, 193 | 194 | ```c 195 | switch: 196 | // a0 is thread 1's context address 197 | sd ra, 0(a0) // save thread_1->context.ra into ra register 198 | sd register1, 8(a0) // save thread_2->context.register1 into register1 199 | sd register2, 16(a0) // save thread_2->context.register2 into register2 200 | ... 201 | ``` 202 | 203 | Then we need to load thread_2's context into those registers 204 | ``` 205 | // a1 is thread 2's context address 206 | ld ra, 0(a1) // load thread_2->context.ra into ra register 207 | ld register1, 8(a1) // load thread_2->context.register1 into register1 208 | ld register2, 16(a1) // load thread_2->context.register2 into register2 209 | ``` 210 | 211 | Above code will allow us switch thread1 to thread2, but it doesn't work correlty. 212 | 213 | Because we can have many threads but only have one stack. 214 | 215 | We need different stacks for different user level threads and when we switch thread we need switch stack too. 216 | 217 | So let add one field for storing the stack, the thread structure becomes: 218 | 219 | ```c 220 | struct thread { 221 | char stack[MAX_STACK_SIZE]; // toy code, do not handle stack overflow 222 | struct context context; 223 | }; 224 | ``` 225 | 226 | and as we have different stack, we need add sp to `context` 227 | 228 | ```c 229 | struct context { 230 | uint64 ra; 231 | uint64 sp; 232 | uint64 register1; 233 | uint64 register2; 234 | ......... 235 | } 236 | ``` 237 | 238 | and `switch` becomes: 239 | 240 | ```c 241 | switch: 242 | // a0 is thread 1's context address 243 | sd ra, 0(a0) // save thread_1->context.ra into ra register 244 | sd sp, 8(a0) // save thread_1->context.sp into sp register 245 | sd register1, 16(a0) // save thread_1->context.register1 into register1 246 | 247 | ... 248 | 249 | // a1 is thread 2's context address 250 | ld ra, 0(a1) // load thread_2->context.ra into ra register 251 | ld sp, 8(a1) // load thread_2->context.ra into sp register 252 | ld register1, 16(a1) // load thread_2->context.register1 into register1 253 | 254 | ... 255 | ``` 256 | 257 | When we create a user level thread, we need set the stack's address for the thread's `context.sp` field and set `context.ra` to the function we want to execute. 258 | 259 | So when we switch to the thread, CPU will jump to the `thread->context.ra` and use `thread->context.sp` as its stack. 260 | 261 | The memory layout as below: 262 | 263 | ![Screen Shot 2021-07-29 at 9 20 34 AM](https://user-images.githubusercontent.com/3775525/127416803-5ca48c79-d27e-43e2-9027-5d77f9b157b8.png) 264 | 265 | From this section, we know if we want to have user level thread, 266 | 267 | we need to allocate memory for each user level thread's stack and handle `ra` and other callee registers in `switch` function. 268 | 269 | That's how we handle user level thread switching. 270 | 271 | ## References 272 | 273 | - [What are callee and caller saved registers?](https://stackoverflow.com/questions/9268586/what-are-callee-and-caller-saved-registers/16265609#16265609) 274 | 275 | - [MIT 6.S081 2020 Multithreading lab](https://pdos.csail.mit.edu/6.S081/2020/labs/thread.html) 276 | -------------------------------------------------------------------------------- /blog/2021-08-11-how-linux-finds-physical-address-through-virtual-address.md: -------------------------------------------------------------------------------- 1 | # How Linux finds physical address through virtual memory 2 | 3 | ## Why 4 | 5 | Hardware can map virtual address to physical address or use physical address directly. 6 | 7 | The virtual address to physical address mapping is stored in memory. 8 | 9 | When virtual address is enabled, hardware will help us do the virtual address to physical address mapping. 10 | 11 | Sometimes, kernel needs do same thing. 12 | 13 | For example when kernel allocates new memory for process kernel needs walk through and setup the mapping. 14 | 15 | So kernel needs to know how page table works. 16 | 17 | Now let's see how the hardware handle the virtual addresses. 18 | 19 | ## x86 4-level paging 20 | 21 | X86 supports different kinds of pages, they are similar. So let's consider 4 levels paging and 4KB page only. 22 | 23 | ![Screen Shot 2021-08-11 at 9 14 57 PM](https://user-images.githubusercontent.com/3775525/129035231-13bdac79-ebf8-4b1f-8e51-e988ddfa3eee.png) 24 | 25 | As the image above, register `CR3` is point to start address of global(L3) directory page, and virtual address' 47 ~ 39 bits will used for global(L3) directory page(L3)'s offset. 26 | 27 | Then the content will point to next level directory's start address, then we can use 38 ~ 30 bits of virtual address for offset of upper(L2) directory page. 28 | 29 | Then middle(L1) directory page and finally we arrive at page table(L0). 30 | 31 | ### Linux `follow_page` method 32 | 33 | Linux `follow_page` is used for doing same thing. 34 | 35 | Main follow as before: 36 | 37 | ``` c# 38 | // in mm/gup.c 39 | follow_page(vma, address, flags) 40 | pgd = pgd_offset(mm, address); 41 | follow_p4d_mask(vma, address, pgd, flags, ctx); 42 | p4d = p4d_offset(pgdp, address); // level 2 43 | follow_pud_mask(vma, address, p4d, flags, ctx); 44 | pud = pud_offset(p4dp, address); 45 | follow_pmd_mask(vma, address, pud, flags, ctx); 46 | pmd = pmd_offset(pudp, address); // level 1 47 | follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); 48 | ptep = pte_offset_map(mm, pmd, address, &ptl); // level 0 49 | pte = *ptep; 50 | page = vm_normal_page(vma, address, pte); 51 | ``` 52 | 53 | `pgd_offset`'s defination is: 54 | 55 | ``` c 56 | #define PGDIR_SHIFT 39 57 | #define PTRS_PER_PGD 512 58 | 59 | #define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address)) 60 | 61 | static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address) 62 | { 63 | return (pgd + pgd_index(address)); 64 | }; 65 | 66 | #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) 67 | 68 | ``` 69 | 70 | `#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))` will shift right 39 bits then mark out 9 bits which will give us 47 ~ 39 bits of a virtual address. 71 | 72 | The result is the offset of the page global directory's offset. 73 | 74 | We add it to `pgd` by `pgd + pgd_index(address)` and the result is the next level directory's start address. 75 | 76 | 77 | `p4d_offset` is used for 5-level paging, code as below: 78 | 79 | ``` c 80 | static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) 81 | { 82 | if (!pgtable_l5_enabled()) 83 | return (p4d_t *)pgd; 84 | return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address); 85 | } 86 | ``` 87 | 88 | `pgtable_l5_enabled()` will return `false` we use level 4 paging, so `p4d_offset` will just return `pgd` back. 89 | 90 | Then `pud_offset` for the upper directory, then `pmd_offset` for the middle directory and finally reached the last level by calling `pte_offset`. 91 | 92 | The calling is follow_page: pgd_offset -> p4d_offset -> pud_offset -> pmd_offset -> pte_offset. 93 | 94 | ## Paging 95 | There are different kinds of paging in x84: 32-bit paging, PAE paging, 4-level paging and 5-level paging. 96 | 97 | Basically they are same, but just uses different bits for finding pages. 98 | 99 | More detail can be fond in `Intel® 64 and IA-32 Architectures Software Developer’s Manual` or `Understanding Linux Kernel`. 100 | 101 | So I will not explain all of them. 102 | 103 | There are many interesting things about page table or addressing such as copy on write and mmap. 104 | 105 | I will explain some of them in the following articles. 106 | -------------------------------------------------------------------------------- /blog/2022-02-06-tcp-congestion-control.md: -------------------------------------------------------------------------------- 1 | # TCP Congestion Control Brief Description by Pseudo Code 2 | 3 | ## Introduction 4 | TCP 需要在异构,动态的网络环境下,通过不可靠的下层服务(IP),提供可靠(reliable)的服务,而拥塞控制是 TCP 必不可少的机制。 5 | 6 | TCP 拥塞控制本身是一个非常困难,也非常有趣的问题。 7 | 8 | 本文主要根据 RFC 2001[1] 对 TCP 拥塞控制的状态变化进行一个简单的描述。 9 | 10 | ## TCP Congestion Control Brief Description 11 | TCP 拥塞控制通过调整 cwnd(congestion window) 对流量进行控制,未经过确认(ack)的数据总和不能超过 cwnd 的大小。 12 | 13 | TCP 拥塞控制有三个状态,在不同的状态下,会使用不同的方式调整 cwnd 的大小。 14 | 15 | ### State Changes 16 | #### Slow Start and Congestion Avoidance 17 | slow start 和 congestion avoidance 这两个状态的转换由 cwnd 和 ssthresh(slow start threshold) 控制。 18 | 19 | 当 cwnd > ssthresh 则由 slow start 转换为 congestion avoidance。 20 | 21 | 在 slow start 状态下,当收到新的 ack 的时候,cwnd 会成指数增长,而 congestion avoidance 状态下,则成线性增长。 22 | 23 | 在当前的状态是 slow start 或者是 congestion avoidance 的时候,如果收到 duplicate ACK, 则会将 ssthresh 设置为 window size( min{cwnd, rwnd} )的一半。 24 | 25 | #### Fast Recovery 26 | 当收到 3 个或者更多的重复的 ACK 的时候,则进入 Fast Recovery 状态,在收到新的 ACK 的时候,则 Fast Recovery 结束,进入 Congestion Avoidance。 27 | 28 | 当进入 Fast Recovery 的时候,会将 threshold 设置为 window size( min{cwnd, rwnd} )的一半,重传丢失的 segment,将 cwnd 设置为 ssthresh + 3 倍的 segment size。 29 | 30 | #### Timeout 31 | 当 retransmit timer timeout 的时候,则会将 ssthresh 设置为 window size( min{cwnd, rwnd} )的一半,并将 cwnd 设置为 segment size。即从新开始 slow start。 32 | 33 | ### How Congestion Window Changes 34 | 图片截取自 MIT 6829 Computer Networking L8 [4] 35 | ![Screen Shot 2022-02-06 at 6 26 13 PM](https://user-images.githubusercontent.com/3775525/152676638-f346dd6d-7d2c-4d8c-984c-d1b618085d94.png) 36 | 37 | ### Pseudo Code 38 | TCPCongestionControl#receive 在收到 ack 或者 retransmit timer timeout 的时候会被触发。 39 | 40 | 可以理解成一种 callbck,在有收到 ack 的时候会被调用,在 timeout 事件发生的时候,也会被调用。类似于 Erlang 的 receive。 41 | 42 | ```ruby 43 | class TCPCongestionControl 44 | # rwnd receiver's advertised window 45 | def initialize(segment_size) 46 | # segment size announced by the other end, or the default, 47 | # typically 536 or 512 48 | @segment_size = segment_size 49 | @cwnd = segment_size 50 | @ssthresh = 65535 # bytes 51 | @in_fast_recovery = false 52 | # other logic .... 53 | end 54 | 55 | def receive 56 | # indicate congestion 57 | if current_algorithm == :slow_start && new_ack? 58 | # exponential growth 59 | @cwnd += @segment_size 60 | # may go_to congestion_avoidance state 61 | # Slow start continues until TCP is halfway to where it was when congestion 62 | # occurred (since it recorded half of the window size that caused 63 | # the problem in step 2), and then congestion avoidance takes over. 64 | elsif current_algorithm == :congestion_avoidance && new_ack? 65 | # linear growth 66 | @cwnd += segsize * segsize / @cwnd 67 | elsif three_or_more_duplicate_ack? 68 | # TCP does not know whether a duplicate ACK is caused by a lost segment or just a reordering of segments 69 | # it waits for a small number and assume it is just a a reordering of segments 70 | # 3 or more duplicate ACKs are received in a row, 71 | # it is a strong indication that a segment has been lost 72 | # go_to fast recovery 73 | @in_fast_recovery = true 74 | @ssthresh = current_window_size / 2 75 | retransmit_missing # retransmit directly without waiting timer 76 | @cwnd = @ssthresh + 3 * @segment_size 77 | # This inflates the congestion window by the number of segments that have left the network and which the other end has cached (3). 78 | elsif current_algorithm == :fast_recovery && new_ack? 79 | @cwnd = @ssthresh # go_to congestion_avoidance 80 | @in_fast_recovery = false 81 | elsif (current_algorithm == :slow_start || current_algorithm == :congestion_avoidance) && duplicate_ack? 82 | @ssthresh = current_window_size / 2 # go_to congestion avoidance 83 | elsif current_algorithm == :fast_recovery && duplicate_ack? 84 | @cwnd += @segment_size 85 | # This inflates the congestion window for the additional segment that has left the network. Transmit a packet, if allowed by the new value of cwnd. 86 | elsif timeout? 87 | # slow_star and congestion_avoidance should come into here 88 | # not sure when timout occurs when fast_recevery should come into here too... 89 | @ssthresh = current_window_size / 2 90 | @cwnd = segment_size # go_to slow start 91 | end 92 | 93 | maybe_transmit_packet 94 | end 95 | 96 | private 97 | def current_algorithm 98 | return :fast_recovery if @in_fast_recovery 99 | 100 | @cwnd <= @ssthresh ? :slow_start : :congestion_avoidance 101 | end 102 | 103 | def current_window_size 104 | min(@cwnd, @rwnd) 105 | end 106 | end 107 | ``` 108 | 109 | ## Relative 110 | Congestion Avoidance and Control[2],对用塞控制背后的原理有详细的解释,不过比较难读(我只读懂了部分)。 111 | 112 | Computer Networking: A Top Down Approach 和 Principles of Computer System 都对 TCP 拥塞控有所描述。更深入的可以参考 MIT 的 Computer Networking 课程[3]。 113 | 114 | 在操作系统中,也有类似的问题。之前,在 package 超过系统负载能力的时候,CPU 都被用在了处理中断,而没有时间执行应用层的逻辑,导致没办法完整的处理一个 package。和 Congestion Avoidance and Control 有相似的思路,但是用了不同的方法[4]。 115 | 116 | 工业上,类似的机制有:流量控制(rate limiter),背压(Back pressure)。Facebook 在 memcache cluster 中有使用类似的机制[5]。 117 | 118 | Netfix 有开源 Hystrix[6],但现在已经被 concurrency-limits[7] 所替代,而 concurrency-limits 使用类似 TCP congestion control 的机制。 119 | 120 | ## 参考 121 | - [1] W. Stevens. RFC2001: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, 1997 122 | - [2] V. Jacobson, Congestion Avoidance and Control 123 | - [3] MIT 6829 Computer Networking L8 End-to-End Congestion Control 124 | https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-829-computer-networks-fall-2002/lecture-notes/ 125 | - [4] Jeffrey Mogul. Eliminating Receive Livelock in an Interrupt-driven Kernel, 1996 126 | - [5] Rajesh Nishtala. Scaling Memcache at Facebook, 2013 127 | - [6] Netflix Hystrix https://github.com/Netflix/Hystrix 128 | - [7] Netflix concurrency-limits https://github.com/Netflix/concurrency-limits 129 | 130 | 131 | 132 | 133 | -------------------------------------------------------------------------------- /blog/2022-05-03-slack-incident-reading-note.md: -------------------------------------------------------------------------------- 1 | # Slack’s Incident on 2-22-22 Reading Note 2 | 3 | ## Why reading incident report is interesting? 4 | 5 | We can learn many things from an incident report, such as how to react to an incident, review architecture, and analyze problems in a real case. 6 | 7 | Have to say, it is much more enjoyable to learn from others' incident report. 8 | 9 | ## Affect 10 | 11 | On February 22, 2022 from 6:00 AM PST to 9:14 AM PST, some customers can't access Slack[2]. 12 | 13 | ## Rough Timeline 14 | 15 | 1. received user tickets 16 | 2. received alarms 17 | 3. detected database suffers higher load than usual 18 | 4. query timeout, DB overloaded 19 | 5. decreased the requests acceptance rate => DB back to normal 20 | 6. increased the rate => DB overloaded again 21 | 7. decreased the rate => back to normal => increase by small steps 22 | 23 | ## The Architecture 24 | 25 | The backend server will query the cache first, if the cache missed, query the DB. 26 | 27 | ``` 28 | v = get(k) 29 | 30 | if v is nil { 31 | v = fetch from DB 32 | set(k, v) 33 | } 34 | ``` 35 | 36 | ![Screen Shot 2022-04-29 at 12 26 31 AM](https://user-images.githubusercontent.com/3775525/165799822-50bc0477-60ce-49ee-98b2-4efd6ae6c91d.png) 37 | 38 | Clients(app servers) will send requests to the Mcrouter, then the Mcrouter will proxy the requests to Memcached nodes. 39 | 40 | Consul is used for discovering Memcached nodes. And mcrib will watch Consul for getting alive Memcached nodes. 41 | 42 | Then the info will be used by mcrouter for selecting memcached nodes. 43 | 44 | > When the agent(Consul) restart occurs on a memcached node, the node that leaves the service catalog gets replaced by Mcrib. The new cache node will be empty. 45 | 46 | I'm not really understand this sentence. 47 | 48 | It seems that the consul agent will be deployed on some Memcached nodes, and when the Cousul agent restarts, the Memcached node will be replaced by a new empty node. 49 | 50 | Screen Shot 2022-07-02 at 22 22 52 51 | 52 | ## What happened 53 | 54 | ### Cache failure causes hit rate to decrease 55 | 56 | Upgrade Consul agent => memcached nodes are replaced by empty nodes => cache hit decreased 57 | 58 | ### Cache failure => DB heavy read load + read amplification => DB overload 59 | 60 | A Cache can reduce a lot of loads for DB. When cache failure happens, DB will suffer much more load than usual. 61 | 62 | For Slack, one function/API needs to query many DB shards as it is not sharded well. 63 | 64 | Those caused the DB to be overloaded. 65 | 66 | ### Cascading failure 67 | Then cascading failure happened. 68 | 69 | ![Screen Shot 2022-04-29 at 9 35 05 AM](https://user-images.githubusercontent.com/3775525/165872687-9f9966e7-d2bb-41bc-a095-bae4e874c179.png) 70 | 71 | 72 | ## Thinking in general 73 | 74 | Instead of focusing on the incident closely we can make the problem more general and see what we can learn from previous experiences. 75 | 76 | ### Cache failure 77 | 78 | After we add a cache layer it will have two effects: speed up requests and reduce DB read load[4]. 79 | 80 | Most applications are read-heavy, for example in Facebook, they have two orders of magnitude more reads than writes[3]. 81 | 82 | Another fact is cache is much faster say 10 times than DB usually. 83 | 84 | So if the cache crashed, DB will suffer a really heavy load most time. That means we need to take cache failure seriously. 85 | 86 | For caching, Facebook has published a paper[3] about how they design their cache system and make it available all the time. 87 | 88 | ### High availability 89 | 90 | For achieving high availability, we need to ask "what happens if it fails?"[8]. 91 | 92 | The system may get stuck(livelock[6]) or crash(cascading failure) by any component failure. 93 | 94 | We have two options, fix all potential issues or make the system recover fastly. 95 | 96 | Options one is infeasible because there are too many components and too many potential issues. 97 | 98 | But we can achieve fast recovery, that coverts unspecific problems into specific ones. 99 | 100 | For Google GFS[7], they think "component failures are the norm rather than the exception.", and they handle availability problems through fast recovery and replication. 101 | 102 | ### Overload control 103 | 104 | Most problems are caused by overload and most time the loads are not predictable. 105 | 106 | To avoid such problems, we can design some mechanisms to protect our systems. 107 | 108 | The most obvious way is rate-limiter, it sets a hard value and if throughput is over the value, the rate-limter will drop the over-part. 109 | 110 | One inconvenience of the rate-limiter is we need to change it manually, as we can see in this incident. 111 | 112 | Microservices have complexity dependency whose parts are dynamic changed(software upgrade, hardware config update, or something else), so it's impossible to find a good value for all situations. 113 | 114 | Netflix created [concurrency-limits](https://github.com/Netflix/concurrency-limits) based on such observations. 115 | 116 | Another thing we need to consider is how to provide the best experience for users when we start to drop their requests. 117 | 118 | "Overload Control for Scaling WeChat Microservices"[5] has a good answer to this question. 119 | 120 | In the paper, they drop requests based on features and users, and they use queue time to detect overload config the limit value dynamically. 121 | 122 | ## References 123 | 1. Slack’s Incident on 2-22-22 124 | https://slack.engineering/slacks-incident-on-2-22-22/ 125 | 2. Slack status history 126 | https://status.slack.com/2022-02-22 127 | 3. Scaling Memcache at Facebook at NSDI '13 128 | https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala 129 | 4. MIT 6.824 Spring 2022 LEC 16 130 | https://pdos.csail.mit.edu/6.824/schedule.html 131 | 5. Hao Zhou. Overload Control for Scaling WeChat Microservices, 2018 132 | 6. J. C. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. 1997. 133 | 7. Sanjay Ghemawat. The Google File System 2003 134 | 8. J Armstrong. A history of Erlang. 2007 135 | -------------------------------------------------------------------------------- /blog/2022-10-05-jemalloc.md: -------------------------------------------------------------------------------- 1 | # How JeMalloc Works 2 | 3 | ## Introduction 4 | 5 | A memory allocator needs to suit hardware(multi-processor + cache line) and use memory efficiently(fast + less fragment). 6 | 7 | By default, JeMalloc uses Thread-Specific-Data and mutiple arenas for every CPU for reducing lock contention. 8 | 9 | For handling cache line related things, JeMalloc relies on multiple arenas and allows applications to pad the allocations. 10 | 11 | JeMalloc allocates different size memory from different memory blocks(slab/extent) which can reduce fragmentation and locate memory quickly. 12 | 13 | This article will introduce how JeMalloc 5.2.1[1] allocates and deallocates memory. And for simplicity, I will focus on small memory allocation and will not cover all details. 14 | 15 | ## Big Picture -- The Hierarchy 16 | 17 | We can image the memory managed in 3 levels: `tcache`, `arena`, and OS. It works like CPU cache -- memory will be allocated from a higher level, if the higher level doesn't have available memory, memory will be allocated from the next level and fill in the higher level, now the memory can be allocated. 18 | 19 | Screen Shot 2022-09-25 at 11 31 51 20 | 21 | `tcache` has many cache_bins, `cache_bin`'s `avail` field will point to regions -- `region` is a unit of memory and will be returned when an application calls `malloc`. 22 | 23 | `arena` has many extents/slabs, a slab will have many regions for small memory. When there is no available memory in a `cache_bin`, the `cache_bin` will try to get memory from `extent/slab.bin` through the current `arena`. If the arena doesn't have an available extent/slab too, JeMalloc will try to allocate memory from OS. 24 | 25 | Screen Shot 2022-09-25 at 11 44 50 26 | 27 | Now let's see how JeMalloc's every component works. 28 | 29 | ## Small Memory Allocation 30 | 31 | ### TCache 32 | 33 | Screen Shot 2022-09-26 at 08 36 28 34 | 35 | When an application allocates memory, JeMalloc will try to find available memory from `tcache`, `tcache` is thread-specific data, as the data is owned by a thread, we do not need to lock it when we use it. The thread-specific data is get by `pthread_setspecific` and set by `pthread_getspecific`. 36 | 37 | Relate data structures are `tsd`, `tcache` and `cache_bin`. `tsd` is used for fast access `tcache` and others. `tcache` has many `cache_bin`, different `cache_bin` is used for allocating different size memory. 38 | 39 | For example, `tcache.bins_small[0]` is used for allocating 0 ~ 8 bytes memory, `tcache.bins_small[1]` is for 9 ~ 16 bytes, the calculation is in `sz_size2index_compute`. 40 | 41 | After finding the right `cache_bin` through size, we can find the available memory address through `cache_bin.avail`, it contains the same size memory(JeMalloc calls this region) addresses, the code is in `cache_bin_alloc_easy`. 42 | 43 | If the `cache_bin` doesn't have available memory, JeMalloc needs to get memory from the next level -- arena, fill the `cache_bin`, and allocate the memory again. Simplified code as below: 44 | 45 | ``` c 46 | JEMALLOC_ALWAYS_INLINE void * 47 | tcache_alloc_small(tsd_t *tsd, arena_t *arena, tcache_t *tcache, 48 | size_t size, szind_t binind, bool zero, bool slow_path) { 49 | // ... 50 | bin = tcache_small_bin_get(tcache, binind); 51 | ret = cache_bin_alloc_easy(bin, &tcache_success); 52 | assert(tcache_success == (ret != NULL)); 53 | if (unlikely(!tcache_success)) { 54 | bool tcache_hard_success; 55 | arena = arena_choose(tsd, arena); 56 | 57 | ret = tcache_alloc_small_hard(tsd_tsdn(tsd), arena, tcache, 58 | bin, binind, &tcache_hard_success); 59 | } 60 | // ... 61 | 62 | return ret; 63 | } 64 | ``` 65 | 66 | ### Arena 67 | 68 | Screen Shot 2022-09-26 at 08 37 57 69 | 70 | `arena` is used for managing `extent`, `extent` is a unit of memory, it may contain multiple OS pages. `extent` is allocated from or deallocated to OS through OS provides system call such as `mmap`, relates code is in `src/pages.c`. And small size `extent` is called `slab` in JeMalloc. 71 | 72 | Likes `tcache`, `arena` has many `bin` which is used for managing different size memory for the application. `bin` has a `slab` pointer( `bin.slabcur`) and a collections of non-full slab/extent(`bin.slabs_nonfull`). 73 | 74 | When JeMalloc trying to find available memory in a `bin`, JeMalloc will try `bin.slabcur` first, if fails it will try to get a `slab` in `bin.slabs_nonfull` and assign it to the `bin.slabcur`. 75 | 76 | And if `bin.slabs_nonfull` has no available extent/slab too, JeMalloc will try to find the memory in `arena`'s extents(the order is `extents_dirty` -> `extents_muzzy` -> `extents_retained` -> `extent_avail`, `extents_dirty`, `extents_muzzy` and `extents_retained` are used for purging, will cover in other sections). If all those do not have available extents, JeMalloc will allocate memory from the OS. 77 | 78 | ## Purging 79 | When an application frees memory, JeMalloc needs to determine how to return the memory. JeMalloc will return memory to `tcache` first. When `tcache` is full, JeMalloc will return free memory to `extent`, and arena will handle how and when returning the unused extents to the system. 80 | 81 | ### Application -> `tcache` -> extent 82 | When an application frees a small memory, the trace will be `je_free` -> `ifree` -> `idalloctm` -> `tcache_dalloc_small`. 83 | 84 | In `tcache_dalloc_small`, the deallocation relates code is: 85 | 86 | ```c 87 | JEMALLOC_ALWAYS_INLINE void 88 | tcache_dalloc_small(tsd_t *tsd, tcache_t *tcache, void *ptr, szind_t binind, 89 | bool slow_path) { 90 | // ... 91 | if (unlikely(!cache_bin_dalloc_easy(bin, bin_info, ptr))) { 92 | tcache_bin_flush_small(tsd, tcache, bin, binind, 93 | (bin_info->ncached_max >> 1)); 94 | bool ret = cache_bin_dalloc_easy(bin, bin_info, ptr); 95 | assert(ret); 96 | } 97 | // ... 98 | } 99 | ``` 100 | 101 | `cache_bin_dalloc_easy` will retun memory to current `cache_bin`. If `cache_bin` is full (`bin->ncached == bin_info->ncached_max`), JeMalloc will call `tcache_bin_flush_small`. 102 | 103 | In `tcache_bin_flush_small`, it will return half(`bin_info->ncached_max >> 1` in tcache_dalloc_small) to extent, `arena_dalloc_bin_locked_impl` will do this work: 104 | 105 | 106 | ```c 107 | static void 108 | arena_dalloc_bin_locked_impl(tsdn_t *tsdn, arena_t *arena, bin_t *bin, 109 | szind_t binind, extent_t *slab, void *ptr, bool junked) { 110 | // ... 111 | if (nfree == bin_info->nregs) { 112 | arena_dissociate_bin_slab(arena, slab, bin); 113 | arena_dalloc_bin_slab(tsdn, arena, slab, bin); 114 | } else if (nfree == 1 && slab != bin->slabcur) { 115 | arena_bin_slabs_full_remove(arena, bin, slab); 116 | arena_bin_lower_slab(tsdn, arena, slab, bin); 117 | } 118 | // ... 119 | } 120 | ``` 121 | 122 | If `nfree == bin_info->nregs`, it means slab is empty, so need to call `arena_dissociate_bin_slab` and then `arena_dalloc_bin_slab`. `arena_dalloc_bin_slab` will append the free slab(extent) to current `arena->extents_dirty` by calling `arena_extents_dirty_dalloc`. 123 | 124 | 125 | ## extent -> system 126 | 127 | JeMalloc is a user level memory allocator, we need to consider other applications' memory usage in the current system, so JeMalloc needs to return free memory back when necessary. And after an arena returns free memory to the system, other arenas can use it too. 128 | 129 | For returning memory to system, we need to call a system call and in the system call, the system needs to zero the memory and mark the memory as unused. If we return too much memory at one step, it may hurt performance. 130 | 131 | And when an `extent` is free, the application may use it in a very short time. For such a case, we'd better not return to the system too quickly. 132 | 133 | JeMalloc uses "decay" to handle those things. We already know free slab(extent) is appended to `arena->extents_dirty` by `arena_extents_dirty_dalloc`. Extents in `arena->extents_dirty` will be moved to `arena->extents_muzzy` and the extents in `arena->extents_muzzy` will be returned to the system. 134 | 135 | Let's see how the "decay" works and then see how the extents are returned from `arena->extents_dirty` to `arena->extents_muzzy`, and then returned to the system. 136 | 137 | ### Decay 138 | JeMalloc uses a `ticker` to count down how many times an arena allocated/deallocated memory. And when the `ticker` reaches 0, it will trigger `arena_decay` which will calculate how long passed since the last time and call `arena_decay` to do purging work when needed. 139 | 140 | The ticker is initialized by `ticker_init(&arenas_tdata[i].decay_ticker, DECAY_NTICKS_PER_UPDATE);` in `arena_tdata_get_hard`, the value of `DECAY_NTICKS_PER_UPDATE` is 1000. 141 | 142 | `arena_decay_ticks` is used for handling tick related thing and it will be called when memory is allocated or deallocated, such as by `arena_malloc_small`. 143 | 144 | For `arena_decay_ticks`, it will call `ticker_ticks(decay_ticker, nticks)` for updating the count. After subtract `nticks` (`ticker->tick -= nticks`), if the `ticker->tick` is less than 0, JeMalloc will reset it to 0 and return true. Then `arena_decay` will be called. 145 | 146 | ``` c 147 | JEMALLOC_ALWAYS_INLINE void 148 | arena_decay_ticks(tsdn_t *tsdn, arena_t *arena, unsigned nticks) { 149 | // ... 150 | if (unlikely(ticker_ticks(decay_ticker, nticks))) { 151 | arena_decay(tsdn, arena, false, false); 152 | } 153 | } 154 | ``` 155 | 156 | After we know how the `arena_decay` is trigged, let's see how `arena_decay` works. 157 | 158 | For high level, JeMalloc wants to return memory back smoothly, it means when JeMalloc find n unused pages, instead of returning the n pages in one time, it will return them in serval seconds(default dirty decay is 10 seconds and stored in `DIRTY_DECAY_MS_DEFAULT`) and in several steps(default is 200 and it is in `SMOOTHSTEP_NSTEPS`.) 159 | 160 | In `struct arena_decay`, `backlog` is used for recording how many unused dirty pages were generated during each of the past SMOOTHSTEP_NSTEPS decay epochs. `nunpurged` is used for recording the number of unpurged pages at beginning of current epoch. 161 | 162 | The current epoch `backlog` is calculated by `current_npages - decay->nunpurged` and `backlog` saves the number in reverse order, so the current epoch number is stored in the last index, see `arena_decay_backlog_update_last`. 163 | 164 | And when epochs passed, JeMalloc will move the backlog forward by `memmove(decay->backlog, &decay->backlog[nadvance_z], (SMOOTHSTEP_NSTEPS - nadvance_z) * sizeof(size_t));`, and as time may pass more than one epoch, needs set uncalculated backlog to 0 by `memset(&decay->backlog[SMOOTHSTEP_NSTEPS - nadvance_z], 0, (nadvance_z-1) * sizeof(size_t));`, all those are in `arena_decay_backlog_update`. 165 | 166 | JeMalloc uses `Smoothstep`[2] function to calculate how many pages need to remain. For example, when x = 1, y = 1, that means no pages need to be purged, and when x = 0, y = 0, that means all the pages need to be purged. 167 | 168 | ![220px-Smoothstep_and_Smootherstep svg](https://user-images.githubusercontent.com/3775525/193720890-e69d2517-2560-4f76-a2f7-1d2067805dfd.png) 169 | 170 | 171 | For avoiding calculation, JeMalloc generates those values by `smoothstep.sh`, and the value is encoded by binary fixed point representation and then they are stored in `h_steps`. 172 | 173 | `arena_decay_backlog_npages_limit` is used for calculating how many pages need to remain. 174 | 175 | ``` c 176 | static size_t 177 | arena_decay_backlog_npages_limit(const arena_decay_t *decay) { 178 | // ... 179 | /* 180 | * For each element of decay_backlog, multiply by the corresponding 181 | * fixed-point smoothstep decay factor. Sum the products, then divide 182 | * to round down to the nearest whole number of pages. 183 | */ 184 | sum = 0; 185 | for (i = 0; i < SMOOTHSTEP_NSTEPS; i++) { 186 | sum += decay->backlog[i] * h_steps[i]; 187 | } 188 | npages_limit_backlog = (size_t)(sum >> SMOOTHSTEP_BFP); 189 | 190 | return npages_limit_backlog; 191 | } 192 | 193 | ``` 194 | 195 | ### dirty -> muzzy and muzzy -> system 196 | 197 | Now let's see how the extents are moved from `dirt` to` muzzy` and how `muzzy` returns them to the system. 198 | 199 | `arena_decay` will handle dirty and muzzy extents by calling `arena_decay_dirty` and `arena_decay_muzzy`. `arena_maybe_decay` will do the decay job for `arena_decay_dirty` and `arena_decay_muzzy`. 200 | 201 | In `arena_maybe_decay`, if `decay->time_ms` is 0, will call `arena_decay_to_limit` and the `limit` params is 0, that means deallocate all free extents. 202 | 203 | And if `decay->time_ms` is not zero, will handle epoch advance related things, calculate limit by `arena_decay_backlog_npages_limit(decay)` and use the limit to call `arena_decay_to_limit`. 204 | 205 | `arena_decay_stashed` will do purging work for `arena_decay_to_limit`. 206 | 207 | In `arena_decay_stashed`, it will check ‘extents' state, if the state is `extent_state_dirty`, it will add the extents to `arena->extents_muzzy`, and if the state is `extent_state_muzzy`, the extent will return to system by `extent_dalloc_wrapper`. 208 | 209 | Default dirty decay time 10 seconds(`DIRTY_DECAY_MS_DEFAULT`) and default muzzy decay time is 0(MUZZY_DECAY_MS_DEFAULT). 210 | 211 | ## Others 212 | 213 | Memory allocation is important for application performance, there are many different implementation[3][4][5][6] or designed for special purpose[7][8]. 214 | 215 | JeMalloc provides per CPU arena but this feature has been disabled by default. 216 | 217 | TCMalloc uses per-thread mode first and it doesn't work well when an application has a large thread count. Then TCMalloc migrated to the per-CPU approach[8]. 218 | 219 | Slab allocator[5] is used for kernel object caching which has specific consideration about cache line usage. JeMalloc relies on multiple arenas and allows applications to pad the memory[3]. 220 | 221 | The cache line will be helpfull in some cases but not all cases. CPHash[9] does some comparison about this, CPHash is a hash table which designed for reducing cache missing. In its benchamrk, when the hash table's memory usage is bigger than L3 cache, the performance wil drop and is limitted by DRAM. 222 | 223 | Erlang VM creates one thread for every CPU which is used for scheduling Erlang processes, so it's easier to do per-CPU related things. And Erlang has different allocators for different usages, such as `ets_alloc` for in-memory key-value storage, `temp_alloc` for temporary allocations, and it can specify different fit algorithms for different allocators[6]. 224 | 225 | MICA[7] is in-memory key-value storage. In its cache mode, memory management is very interesting. It organizes its memory as a fix-sized circular log, and key-value items will be appended to the tail of the log. When the circular log is full, items will be evicted from the head. So no need for garbage collection and defragmentation. 226 | 227 | LAMA is designed for solving slab calcification[11], this problem may occur in the default Memcached server, Memcached allocates different size memory from the different slabs. The memory allocator needs to consider how to evict item(s) when no free memory. LAMA is designed for distributing memory to different slabs dynamically based on the access pattern for reducing the miss rate. 228 | 229 | ## Summary 230 | In this article, instead of describing the detail such as each structure's fields or how its extent rtree works, I focus on introducing the relationship between different structures and how those structures work together for allocating and deallocating memory. Hope it is helpful for people who want to understand how JeMalloc works. 231 | 232 | 233 | ## References 234 | 1. JeMalloc 235 | https://github.com/jemalloc/jemalloc/releases/tag/5.2.0 236 | 2. Smoothstep 237 | https://en.wikipedia.org/wiki/Smoothstep 238 | 3. A Scable Concurrent malloc(3) Implementation for FreeBSD 239 | 4. TCMalloc 240 | https://github.com/google/tcmalloc 241 | 5. The Slab Allocator: An Object-Caching Kernel Memory Allocator 242 | 6. Erlang memory allocator 243 | https://github.com/erlang/otp/blob//OTP-25.1.1/erts/emulator/beam/erl_alloc.c 244 | 7. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage 245 | 8. TCMalloc : Thread-Caching Malloc 246 | https://google.github.io/tcmalloc/design.html 247 | 9. CPHash: A Cache-Partitioned Hash Table 248 | 10. LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache 249 | 11. Caching with twemcache. 250 | https://blog.twitter.com/2012/caching-with-twemcache 251 | -------------------------------------------------------------------------------- /blog/2025-01-01.md: -------------------------------------------------------------------------------- 1 | # Understanding the Page Table Step by Step 2 | 3 | The concept of the Page Table can be challenging to grasp, and it has puzzled me for a long time. However, during a recent revisit to xv6, a simple Unix-like teaching operating system, I realized that it becomes much easier to understand by thinking of it as a specialized hash map and breaking it down into smaller steps. 4 | 5 | 6 | ## What is a Page Table? 7 | 8 | A page table is a specialized hash map that maps contiguous virtual memory to contiguous physical memory. Then we can use a large array to store the page table. Finally, multi-level page tables help reduce memory usage by allowing lazy memory allocation. 9 | 10 | 11 | ## Step 1: Mapping Contiguous Virtual Memory to Contiguous Physical Memory 12 | 13 | Let’s simplify the problem further: mapping virtual memory to physical memory. In programming, we can use a hash map to achieve this. 14 | 15 | ```c 16 | uint64 virtual_memory_to_physical_memory(uint64 virtual_addr) { 17 | // Each process has its own hash map, mapping the same virtual memory 18 | // to different physical memory. 19 | HashMap map = current_process_map(); 20 | 21 | return map[virtual_addr]; 22 | } 23 | ``` 24 | 25 | Due to spatial locality, we prefer to map contiguous virtual memory to contiguous physical memory. Instead of requiring the entire virtual memory to be contiguous in physical memory, we divide it into smaller units called pages. Within a page, all its virtual memory is mapped to the same physical memory page which fulfills performance requirements and allows us to allocate physical memory lazily. 26 | 27 | Screenshot 2025-01-01 at 15 10 32 28 | 29 | 30 | Now, each page table entry corresponds to the starting address of a page. Assuming a page size of 4KB, the last 12 bits of the virtual address are irrelevant for determining the index. 31 | 32 | `uint64 index = virtual_addr >> 12;` 33 | 34 | 35 | The last 12 bits of the virtual address act as an offset. The mapping function now looks like this: 36 | 37 | ```c 38 | uint64 virtual_addr_to_physical_addr(uint64 virtual_addr) { 39 | HashMap page_table = current_process_map(); 40 | uint64 index = virtual_addr >> 12; 41 | uint64 offset = virtual_addr & 0xFFF; 42 | 43 | return page_table[index] + offset; 44 | } 45 | ``` 46 | 47 | ## Step 2: Representing the Page Table in Memory 48 | 49 | Now let’s consider how to represent the hash map in memory. Since `virtual_addr_to_physical_addr` maps different virtual addresses to the starting addresses of physical pages and the indices are continuous, we can use an array to represent the hash map. Each array item is a page table entry (PTE). 50 | 51 | In memory, the structure looks like this: 52 | 53 | Screenshot 2025-01-01 at 15 09 56 54 | 55 | ## Step 3: Multi-Level Page Tables 56 | 57 | From the above image, we notice an obvious problem: it needs to allocate a large page table, even though most of the virtual memory might not be in use. For example, to map 64GB of virtual memory to 4KB pages, we need an array of length 64 * 1024 * 1024 / 4 = 2 ^ 24. If each entry requires 8 bytes, the page table would require 32MB of memory. 58 | 59 | To reduce memory usage, we can use multi-level page tables. For example, a two-level page table setup allows the root page table's entries to point to secondary page tables. Using a 4KB page to store a page table, each can contain 2 ^ 12 addresses, pointing to other page tables. Each of these secondary page tables can map 60 | 512 × 4 KB = 2 MB of virtual memory. Thus, one root page table can map 2 × 512 = 1024MB or 1GB of virtual memory. Memory for secondary page tables is allocated only when necessary. 61 | 62 | Screenshot 2025-01-01 at 16 54 33 63 | 64 | The function for two-level page tables is as follows: 65 | ```c 66 | uint64 virtual_addr_to_physical_addr(uint64 virtual_addr) { 67 | uint64 *root_page_table = current_process_map(); 68 | uint64 root_index = (virtual_addr >> (9 + 12)) & 0x1FF; 69 | 70 | uint64 *page_table = root_page_table[root_index]; 71 | uint64 page_table_index = (virtual_addr >> 12) & 0x1FF; 72 | 73 | uint64 offset = virtual_addr & 0xFFF; 74 | 75 | return page_table[page_table_index] + offset; 76 | } 77 | ``` 78 | 79 | ## Others 80 | 81 | There are many other aspects of page tables, such as storing page permissions in PTEs and using the Translation Lookaside Buffer (TLB) to cache PTEs. These topics are beyond the scope of this article but are covered in the xv6 textbook and the MIT operating systems course. 82 | 83 | In computer science, many concepts appear complex at first. However, by breaking them into smaller steps, their underlying ideas become clear and manageable. Taking baby steps is often the golden rule. 84 | -------------------------------------------------------------------------------- /blog/2025-01-04-linear-hash.md: -------------------------------------------------------------------------------- 1 | # Understanding Linear Hash Step by Step 2 | 3 | Many computer science concepts seem complex at first; however, if we break them down step by step, they become much clearer. In this article, I will describe how a hash table works, how resizing works, and how linear hashing improves resizing. By the end, you will see that linear hashing is straightforward to understand. 4 | 5 | ## Hash Table 6 | A hash table uses an array to store key-value pairs. When we insert a pair, we use a hash function to calculate the key’s index and then insert the pair at that index in the array. 7 | 8 | ```ruby 9 | def insert(key, val) 10 | index = hash(key) % @array.length 11 | @array[index] = [key, val] 12 | end 13 | ``` 14 | 15 | Since two different keys may have the same hash value (`hash(key1) == hash(key2)`), we handle this by using a linked list to store multiple pairs at the same index. 16 | 17 | ```ruby 18 | def insert(key, val) 19 | index = hash(key) % @array.length 20 | node = Node.new(key, val) 21 | if @array[index] 22 | @array[index].next = node 23 | else 24 | @array[index] = node 25 | end 26 | end 27 | ``` 28 | 29 | ## Resizing & Linear Hash 30 | From the code above, we see that if a newly inserted key’s hash value matches an existing key’s hash value, the new pair is appended to the linked list at that index. If such collisions occur frequently, the linked list grows longer, and the hash table becomes slower. 31 | 32 | The most simple solution is to double the array size. But after doubling the array size, we must rearrange the key-value pairs. 33 | 34 | Suppose the array’s length is 4, with key1’s hash value being 0 and key2’s hash value being 4. Since the array’s length is 4, both of these keys are stored at index 0. After doubling the array size, key1 should be stored at index 0 and key2 at index 4. After resizing, we must rearrange these pairs. This process is called resizing or rehashing. 35 | 36 | Screenshot 2025-01-03 at 00 27 45 37 | 38 | This basic resizing algorithm resolves the conflict issue, but if the array becomes very large, we must move many pairs to their new locations, which can consume much CPU in a short time. 39 | 40 | To improve this, instead of doubling the array size, we could increase the size one step at a time. For example, we add one element to the array at a time and move the corresponding pairs to the new array item. Now, we need to consider how to calculate the index. 41 | 42 | Suppose we use the old hash function `hash(key) % 4` and get an index of 1, 2, or 3. In this case, we know there is no new array item for those keys—they are correctly indexed by the old hash function. However, if the index is 0, after adding the new item, its correct index could be 0 or 4, so we need to use a new hash function `hash(key) % 8` to find the correct index. 43 | 44 | With two hash functions (`hash(key) % 4` and `hash(key) % 8`), we need to determine which one to use. The solution is simple: we can maintain a cursor to track which elements have already been resized. First, we use the old hash function to calculate the index. If the index is less than the cursor, it means we have already created a new array element for it, so we use the new hash function. If the index is greater than or equal to the cursor, we know it doesn't have a new array element, so the old hash function is correct. 45 | 46 | Screenshot 2025-01-03 at 00 30 34 47 | 48 | ```ruby 49 | def index(key) 50 | i = hash(key) % @n 51 | 52 | if i < @cursor 53 | i = hash(key) % (@n * 2) 54 | end 55 | 56 | i 57 | end 58 | ``` 59 | 60 | ## Summary 61 | This article doesn't cover all the details, such as how to maintain the `@n`, but I hope it gives you a clear understanding of linear hashing. Learning an algorithm’s implementation may seem complex at first, but once we understand the ideas behind it, everything becomes much clearer. 62 | -------------------------------------------------------------------------------- /blog/2025-01-15-non-blocking-stack-profiler.md: -------------------------------------------------------------------------------- 1 | # How SDB Scans the Ruby Stack Without the GVL 2 | 3 | ## Introduction 4 | 5 | Several Ruby stack profilers already exist, but most of them depend on the Ruby Global VM Lock (GVL), which blocks applications while scanning the Ruby stack. Based on my testing, a Ruby stack profiler can increase request latency by up to 10% when sampling at a 1ms interval. Rbspy's async mode is an exception, as it scans the Ruby stack from a separate process. However, this approach can lead to errors[1] because it accesses complex Ruby data structures without acquiring the GVL. 6 | 7 | In this article, I will introduce how SDB scans Ruby stacks without the GVL while maintaining safety and delivering the results we want. It allows us to scan the Ruby stack without increasing application delay. I believe this implementation is interesting as it demonstrates the benefits of releasing the GVL in a performance-sensitive library. 8 | 9 | 10 | ## Scanning Stacks Without GVL 11 | 12 | The Ruby GVL ensures the integrity of Ruby data but impacts concurrency. 13 | 14 | It seems that we must hold the GVL to access Ruby data. However, when we look closely, if the data itself can guarantee atomic, it doesn’t need the GVL’s protection. For example, it is safe to read 64-bit aligned data since its updates are atomic — there are no partial updates[2][3]. An ISeq’s address is an example of such data. If we know where an ISeq’s address is stored, it is safe to read it without holding the GVL. 15 | 16 | The Ruby stack contains ISeq data in an array of `rb_control_frame_struct` objects, allocated when Ruby creates a thread. It is safe to read it as long as Ruby has not reclaimed the stack. Without the GVL, we might encounter empty or outdated data (due to lack of memory barriers), but we will not read invalid data (e.g., addresses from a different ISeq). 17 | 18 | To ensure we stop scanning a thread's stack once it has been reclaimed, we can use meta-programming to hook into Ruby's thread lifecycle. The following example demonstrates how to wrap thread creation and deletion: 19 | 20 | ```ruby 21 | module ThreadInitializePatch 22 | def initialize(*args, &block) 23 | old_block = block 24 | 25 | block = ->() do 26 | Sdb.thread_created(Thread.current) 27 | result = old_block.call(*args) 28 | Sdb.thread_deleted(Thread.current) 29 | result 30 | end 31 | 32 | super(&block) 33 | end 34 | end 35 | 36 | Thread.prepend(ThreadInitializePatch) 37 | ``` 38 | 39 | Before Ruby reclaims a thread, it calls `Sdb.thread_deleted`, which removes the thread from SDB's scanning list. This ensures safe stack scanning without the GVL. 40 | 41 | ## Symbolization 42 | 43 | The data we collected are ISeq addresses, which are not readable to humans.SDB uses eBPF to capture ISeq creation events during compilation or loading (e.g., via bootsnap). Then, we can use an offline program to translate these addresses into human-readable symbols. 44 | Ruby's memory compaction can move ISeq objects to new addresses. To handle this, we can use eBPF to capture memory compaction events too. This feature is still in progress, as Ruby's memory compaction is not enabled by default and is rarely used in production applications. 45 | 46 | SDB’s architecture looks like this: 47 | image 48 | 49 | 50 | ## Ensuring Fully Correctness 51 | 52 | Data races may occur if the Ruby VM updates the stack while SDB is reading it. However, resolving this issue completely is not the primary goal of a stack profiler, as minor inaccuracies do not typically affect its usage in identifying performance bottlenecks. 53 | 54 | Similar issues exist in other profilers like rbspy[4], py-spy[5], and async-profiler[6]. The key question is whether the profiler can identify slow paths in the application. Data races can only occur if the stack is updated faster than the scanner can read it, which typically happens with extremely fast functions that have minimal impact on overall performance. 55 | 56 | To manage this, we can use a generation number for optimistic concurrency control. When a stack is pushed, we increment the generation number by one and check this number before and after scanning the stack.[7] 57 | 58 | ## Summary 59 | 60 | This article introduces how SDB scans the Ruby stack without relying on the GVL. By not blocking the application, it enables faster stack scanning and more accurate latency measurements, making a true always-on stack profiler possible. While releasing the GVL resolves the main challenges faced by Ruby stack profilers, it is not sufficient to create a fully non-blocking stack profiler. SDB employs additional concurrency techniques to achieve this, such as spinlocks for synchronizing threads between the Ruby VM and SDB, memory barriers for trace IDs, and a left-right style [7] message queue for its new symbolizer. SDB is still under development, with features such as memory compaction support and detailed evaluations and comparisons based on a Rails application coming soon. 61 | 62 | ## References 63 | 64 | 1. [Should I pause a Ruby process to collect its stack? 65 | ](https://jvns.ca/blog/2018/01/15/should-i-pause-a-ruby-process-to-collect-its-stack/#what-happens-if-i-don-t-pause-the-ruby-process-i-m-profiling) 66 | 2. https://pdos.csail.mit.edu/6.1810/2024/lec/l-rcu.txt 67 | 3. [Ruby Memory Model](https://docs.google.com/document/d/1pVzU8w_QF44YzUCCab990Q_WZOdhpKolCIHaiXG-sPw/edit?tab=t.0) 68 | 4. https://github.com/rbspy/rbspy 69 | 5. https://github.com/benfred/py-spy 70 | 6. https://github.com/async-profiler/async-profiler 71 | 7. Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads 72 | -------------------------------------------------------------------------------- /blog/2025-01-27-memory-barrier.md: -------------------------------------------------------------------------------- 1 | # Understanding Memory Barriers Step by Step - A Summary of "Memory Barriers: A Hardware View for Software Hackers" 2 | 3 | ## Introduction 4 | 5 | This article is a summary of "Memory Barriers: A Hardware View for Software Hackers." and instead of covering all the details, I will focus on explaining how things come. Since memory barriers are quite complex, it is highly recommended to read the original paper for a more thorough understanding. 6 | 7 | CPUs reorder memory references to improve performance. However, in certain situations—such as synchronization primitives—we need memory barriers to ensure correctness. 8 | 9 | A highly simplified story is (The "→" represents causation): 10 | 11 | *Multiple CPU cores + cache lines → data is replicated across different cache lines → Cache-Coherence Protocols are used to guarantee consistency → these protocols sometimes require blocking CPU execution → asynchronous messaging is introduced but causes state inconsistencies → memory barriers are used to guarantee partial ordering.* 12 | 13 | Modern computers have multiple CPU cores, and each core has its own cache. This means a variable(address) might be replicated across different CPU caches. To prevent inconsistencies or data loss for a single variable across multiple caches, cache-coherence protocols are used. 14 | 15 | Cache-coherence protocols rely on message exchanges between caches and main memory. These exchanges introduce latency, as CPUs must wait for acknowledgments, which is extremely slow compared to CPU processing speeds. For example, when a CPU performs a store operation, it might need to wait for acknowledgments from other CPUs, blocking its execution. To mitigate such stalls, CPUs proceed with local state updates without waiting for responses or quickly send acknowledgments without fully applying changes to the cache line. This behavior is similar to how changes within a database transaction remain visible only to that transaction until it is committed. Once the CPU receives all necessary responses, the updated state becomes globally visible to all CPUs. 16 | 17 | While this approach works well for single variables, problems arise when multiple memory addresses and concurrent operations across CPUs are involved. **Memory barriers** address these issues by providing ordering guarantees. When a memory barrier is inserted, all memory operations (reads or/and writes) before it are guaranteed to complete before any memory operations after it begin. Using a database analogy, a memory barrier acts like a "commit," ensuring that all prior store and/or load operations are finalized before continuing. This enforces a *happens-before* relationship. 18 | 19 | In the following sections, I will introduce the Cache-Coherence Protocols, store buffers and invalidate queues. 20 | 21 | ## Cache-Coherence Protocols 22 | 23 | memory-barrier 24 | 25 | As data can be replicated to different CPU caches, cache-coherence protocols are required to ensure consistency. 26 | 27 | Basically, a CPU can update a cache line only when that cache line has exclusive ownership of the cached data. For example, if a cache line is in a "shared" state (i.e., multiple caches have replicas of the same address), the CPU must send an "invalidate" message to all other CPUs and wait for their responses. Upon receiving this message, the other CPUs remove the corresponding data from their caches and respond. Once the CPU that initiated the invalidation receives all the responses, it gains exclusive ownership of the cache line and can modify it. This process is described in Transition (h) and Transition (b) of the paper. 28 | 29 | While there are more details in the paper, the basic idea is the same: when a CPU wants to modify a cache line, it must have the exclusive ownership of the data. 30 | 31 | ## Store Buffers 32 | 33 | ### Why We Need Store Buffers 34 | 35 | 1. Cache-coherence protocols ensure consistency but can be slow. 36 | For example, a CPU must wait for responses from all other CPUs when writing to a shared cache line. 37 | 2. Store buffers are added to avoid blocking. 38 | The CPU can record its write operation in the buffer and continue executing other instructions. 39 | 3. The data is removed from the store buffer after the CPU receives all necessary acknowledgments. 40 | 41 | Now the CPU becomes: 42 | 43 | memory-barrier 44 | 45 | ### Store Forwarding 46 | 47 | As a CPU writes data to both the cache line and store buffer, for reading its own update, it needs to read from the store buffer (not directly from the cache). 48 | 49 | 50 | For example, the following code is executed by a single CPU: 51 | 52 | ```c 53 | 1 a = 1 54 | 2 b = a + 1 55 | ``` 56 | 57 | At line 1(`a = 1`), the value 1 is stored in the store buffer. At line 2(`b= a + 1`), the CPU may need to read a. If it reads `a` from the cache directly, it would see the old value (0) because the new value hasn't been applied to the cache yet. To avoid this inconsistency, the CPU reads `a` from the store buffer, which has the updated value. This guarantee is called self-consistency, and the mechanism is referred to as store forwarding. 58 | 59 | 60 | ### Store Buffers and Memory Barriers 61 | 62 | The next issue arises from the interaction of multiple variables and multiple CPUs. 63 | 64 | ```c 65 | 1. void foo(void) { 66 | 2. a = 1; 67 | 3. b = 1; 68 | 4. } 69 | 5. 70 | 6. void bar(void) { 71 | 7. while (b == 0) continue; 72 | 8. assert(a == 1); 73 | 9. } 74 | ``` 75 | 76 | Here, CPU 0 executes `foo`, and CPU 1 executes `bar`. If all instructions are executed in order, the assertion (`assert(a == 1)`) cannot fail. Surprisingly, hardware does not always guarantee this ordering. 77 | 78 | 79 | Assume CPU 0 owns b in its cache but not a, and CPU 1 has a in its cache. When CPU 0 executes line 2 (`a = 1`), it stores the value in its store buffer because it doesn’t yet own `a`. When CPU 0 executes line 3 (`b = 1`), it updates b directly in its cache, as it already owns it. CPU 1 then executes line 7 (`while (b == 0)`), and upon detecting that `b` is 1, it proceeds to line 8 (`assert(a == 1)`). Since CPU 1 hasn’t received an "invalidate" message for `a`, it reads the old value (0) from its cache, causing the assertion to fail. 80 | 81 | The failure occurs because after CPU 0 executes line 2 (`a = 1`), the value is still in its store buffer and not visible to other CPUs. However, line 3 (`b = 1`) is directly applied to the cache, making it visible. As a result, from CPU 1’s perspective, `b = 1` happens before `a = 1`, leading to an incorrect ordering. 82 | 83 | 84 | To prevent such issues, a memory barrier need to be inserted to enforce the correct ordering of operations. This guarantees that all store operations before the barrier are visible to other CPUs before any store operations after the barrier are applied. The modified code would look like this: 85 | 86 | ```c 87 | 1. void foo(void) { 88 | 2. a = 1; 89 | 3. smp_mb(); 90 | 4. b = 1; 91 | 5. } 92 | 6. 93 | 7. void bar(void) { 94 | 8. while (b == 0) continue; 95 | 9. assert(a == 1); 96 | 10. } 97 | ``` 98 | 99 | In this case, when CPU 0 executes line 3 (`smp_mb()`), it marks all current store-buffer entries (namely, `a = 1`). Then, when it executes line 4 (`b = 1`), since there is a marked entry, instead of immediately applying `b = 1` to its cache, it saves it in its store buffer (but as an unmarked entry), even if the CPU already owns the cache line for b. Thus, the memory barrier at line 3 ensures that other CPUs observe `a = 1` happening before `b = 1`. 100 | 101 | ## Invalidate Queues 102 | ### Why We Need Invalidate Queues 103 | - When a cache line is busy, the CPU might fall behind in processing "invalidate" messages. 104 | - Then the store buffer could be full and it stalls the CPU's execution. 105 | - To address the speed discrepancy, a new queue is added to each CPU. 106 | When an "invalidate" message arrives at the queue, the CPU can immediately acknowledge it. 107 | - The CPU ensures that, when placing an entry into the invalidate queue, it processes the entry before transmitting any protocol messages related to that cache line. 108 | 109 | Now the CPU looks like this: 110 | 111 | memory-barrier 112 | 113 | ### Invalidate Queues and Memory Barriers 114 | 115 | After the new queue has been added, it introduces additional inconsistency. As the state has been applied to the store buffer, not the cache line, the CPU can see "outdated" data. 116 | 117 | And this inconsistency causes our example to fail again. Let’s examine how this happens: 118 | 119 | ```c 120 | 1. void foo(void) { 121 | 2. a = 1; 122 | 3. smp_mb(); 123 | 4. b = 1; 124 | 5. } 125 | 6. 126 | 7. void bar(void) { 127 | 8. while (b == 0) continue; 128 | 9. assert(a == 1); 129 | 10. } 130 | ``` 131 | 132 | 133 | Suppose `a` is in the "shared" state and resides in the caches of both CPU 0 and CPU 1. Meanwhile, `b` is owned by CPU 0. 134 | 135 | When CPU 1 executes line 8 (`while (b == 0) continue;`), it doesn’t have `b`'s cache line and sends a read message to CPU 0. CPU 0 then executes line 2 (`a = 1`) and line 3 (`smp_mb()`) and waits for a response from CPU 1. CPU 1 receives CPU 0’s “invalidate” message for `a`, queues it, and immediately sends an acknowledgment (step 3 in section 4.3 of the paper). However, note that CPU 1 has only queued the "invalidate" message, meaning `a = 1` is not yet visible to CPU 1. 136 | 137 | Next, CPU 0 executes line 4 (`b = 1`). Since it owns `b`, it updates the cache line with `b = 1`. Then CPU 0 receives the read message for `b` from CPU 1 and responds with `b = 1` (which has already been applied to CPU 0’s cache line, so it's visible to CPU 1). After CPU 1 receives `b = 1`, it exits the while loop and executes line 9 (`assert(a == 1)`). However, since CPU 1 has not yet processed the "invalidate" message for `a`, `a` is still 0, and the assertion fails. 138 | 139 | 140 | The solution is to add an additional memory barrier between lines 8 and 9. The updated code is: 141 | 142 | ```c 143 | 1. void foo(void) { 144 | 2. a = 1; 145 | 3. smp_mb(); 146 | 4. b = 1; 147 | 5. } 148 | 6. 149 | 7. void bar(void) { 150 | 8. while (b == 0) continue; 151 | 9 smp_mb(); 152 | 10. assert(a == 1); 153 | 11. } 154 | ``` 155 | 156 | Now, after CPU 1 finishes line 8(`while (b == 0) continue`) and executes line 9(`smp_mb()`), the memory barrier ensures that CPU 1 must stall until it processes all preexisting messages in its invalidation queue. Therefore, by the time CPU 1 reaches line 10, `a = 1` has already been applied, and the assertion succeeds as expected. 157 | 158 | ## Summary 159 | 160 | This article explains the basic concepts from "Memory Barriers: A Hardware View for Software Hackers," focusing on the problems and solutions in memory consistency for systems with multiple CPU cores and caches. 161 | 162 | Multiple CPU cores and cache lines are introduced to improve parallel performance. Then, cache-coherence protocols are used to guarantee correctness. However, as these protocols are slow in certain situations, new buffers/queues are added. Finally, memory barriers are used to provide ordering. It is similar to a database transaction, with internal state, order, and the point of visibility for other CPU cores. 163 | 164 | It doesn't cover many details, such as the specifics of the cache-coherence protocols, but I believe that to understand a concept, it's important to grasp the big picture—what the problem is and how it is solved. 165 | -------------------------------------------------------------------------------- /images/memory-barrier-0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-0.png -------------------------------------------------------------------------------- /images/memory-barrier-invalid-queue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-invalid-queue.png -------------------------------------------------------------------------------- /images/memory-barrier-store-buffer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-store-buffer.png -------------------------------------------------------------------------------- /slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key -------------------------------------------------------------------------------- /slides/Regression-Test-Selection-for-Rails-Project.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/Regression-Test-Selection-for-Rails-Project.key -------------------------------------------------------------------------------- /slides/cow-in-xv6.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/cow-in-xv6.key -------------------------------------------------------------------------------- /slides/draft/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/draft/.keep -------------------------------------------------------------------------------- /slides/erlang-message-passing.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/erlang-message-passing.key -------------------------------------------------------------------------------- /slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key -------------------------------------------------------------------------------- /slides/sdb-rubykaigi-2025.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/sdb-rubykaigi-2025.key --------------------------------------------------------------------------------