├── .DS_Store
├── README.md
├── blog
    ├── 2021-07-25-meltdown.md
    ├── 2021-07-29-user-level-thread-switch.md
    ├── 2021-08-11-how-linux-finds-physical-address-through-virtual-address.md
    ├── 2022-02-06-tcp-congestion-control.md
    ├── 2022-05-03-slack-incident-reading-note.md
    ├── 2022-10-05-jemalloc.md
    ├── 2025-01-01.md
    ├── 2025-01-04-linear-hash.md
    ├── 2025-01-15-non-blocking-stack-profiler.md
    └── 2025-01-27-memory-barrier.md
├── images
    ├── memory-barrier-0.png
    ├── memory-barrier-invalid-queue.png
    └── memory-barrier-store-buffer.png
└── slides
    ├── Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key
    ├── Regression-Test-Selection-for-Rails-Project.key
    ├── cow-in-xv6.key
    ├── draft
        └── .keep
    ├── erlang-message-passing.key
    ├── sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key
    └── sdb-rubykaigi-2025.key


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/.DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Personal Blog
  2 | ## Articles and Slides
  3 | ## 2025
  4 | [Understanding Memory Barriers Step by Step - A Summary of "Memory Barriers: A Hardware View for Software Hackers"](./blog/2025-01-27-memory-barrier.md) 2025-01-29
  5 | 
  6 | [Understanding Linear Hash Step by Step](./blog/2025-01-04-linear-hash.md) 2025-01-04
  7 | 
  8 | [Understanding the Page Table Step by Step](./blog/2025-01-01.md) 2025-01-01
  9 | 
 10 | ## 2024
 11 | [SDB: a New Ruby Stack Profiler @ RubyConf China 2024](./slides/slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key) 2024-11-30
 12 | 
 13 | [无 root 权限查看 Ruby HTTPS 请求内容](https://ruby-china.org/topics/43886) 2024-09-14
 14 | 
 15 | [Detect Ruby GVL contention through dynamic link library functions](https://ruby-china.org/topics/43883) 2024-09-11
 16 | 
 17 | [Ruby Garbage Collection 101 and Ruby's RGenGC (Restricted Generational GC)](https://ruby-china.org/topics/43798) 2024-07-06
 18 | 
 19 | [Rust2go: calls Go from Rust](https://ruby-china.org/topics/43765) 2024-06-27
 20 | 
 21 | [Use eBPF USDT in Rust](https://github.com/yfractal/blog/issues/15) 2024-05-21
 22 | 
 23 | ## 2023
 24 | [How Meltdown Works](https://github.com/yfractal/blog/issues/14) 2023-12-04
 25 | 
 26 | [Exercise Snacks: A Feasible Exercising Strategy for Office Workers](https://github.com/yfractal/blog/issues/12) 2023-05-08
 27 | 
 28 | ## 2022
 29 | 
 30 | [How JeMalloc Works](./blog/2022-10-05-jemalloc.md) 2022-10-05
 31 | 
 32 | MIT 6.824 Distributed Systems Reading Notes
 33 |   - [Scaling Memcache at Facebook Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1100841630) 2022-04-17
 34 |   - [The Google File System Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1115903733) 2022-05-03
 35 |   - [Amazon Aurora Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1140428705) 2022-05-29
 36 |   - [COPS Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1207530839) 2022-05-30
 37 |   - [Spark Paper Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1193444846) 2022-06-25
 38 |   - [Google Spanner Reading Note](https://github.com/yfractal/blog/issues/8#issuecomment-1225041302) 2022-08-24
 39 | 
 40 | [Regression Test Selection for Rails Project @ internal sharing](./slides/Regression-Test-Selection-for-Rails-Project.key) 2022-06-09
 41 | 
 42 | [Slack’s Incident on 2-22-22 Reading Note](./blog/2022-05-03-slack-incident-reading-note.md) 2022-05-03
 43 | 
 44 | [闲来无事，用 Ruby 撸了个 LSM-Tree](https://ruby-china.org/topics/42363) 2022-05-02
 45 | 
 46 | [How Copy on Write Works in Xv6](./slides/cow-in-xv6.key) 2022-02-06
 47 | 
 48 | [TCP Congestion Control Brief Description by Pseudo Code](./blog/2022-02-06-tcp-congestion-control.md) 2022-02-06
 49 | 
 50 | ### 2021
 51 | [How Linux finds physical address through virtual memory](./blog/2021-08-11-how-linux-finds-physical-address-through-virtual-address.md) 2021-08-11
 52 | 
 53 | [User Level Thread Switch](./blog/2021-07-29-user-level-thread-switch.md) 2021-07-29
 54 | 
 55 | [Meltdown Notes](./blog/2021-07-25-meltdown.md) 2021-07-25
 56 | 
 57 | [MIT 6.S081 学习总结](https://ruby-china.org/topics/41485) 2021-07-17
 58 | 
 59 | [Linear Hash Revisit](https://ruby-china.org/topics/40930) 2021-02-22
 60 | 
 61 | [How DBMS Memory Buffer Works](https://ruby-china.org/topics/40932) 2021-02-21
 62 | 
 63 | ### 2020
 64 | [Erlang message passing brief introduction @ Beijing Elixir meetup](./slides/erlang-message-passing.key) 2020-10-24
 65 | 
 66 | [Concurrency task schedule brief introduction @ RubyConf China 2020](./slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key) 2020-08-15
 67 | 
 68 | [Raft 笔记](https://ruby-china.org/topics/40018)  2020-06-26
 69 | 
 70 | [Awesome OTP Learning](https://github.com/yfractal/awesome-otp-learning) 2020-02-22
 71 | 
 72 | [Linear Hash 原理及实现](https://ruby-china.org/topics/39466) 2020-01-27
 73 | 
 74 | [Erlang ETS Linear Hash Implementation](https://ruby-china.org/topics/39470) 2020-01-29
 75 | 
 76 | ["Memory Barriers: a Hardware View for Software Hackers" 笔记](https://ruby-china.org/topics/39474) 2020-01-31
 77 | 
 78 | ### 2019
 79 | [Mnesia Transaction and Locker @ ShenZhen Elixir Meetup](https://github.com/Pragmatic-Elixir-Meetup/shenzhen-meetup/tree/master/2019-10-27/mnesia-transaction-and-locker) 2019-10-27
 80 | 
 81 | [Erlang 虚拟机窥探 - process 和 scheduler @ ShenZhen Elixir Meetup](https://github.com/Pragmatic-Elixir-Meetup/shenzhen-meetup/tree/master/2019-08-04/erlang%20%E8%99%9A%E6%8B%9F%E6%9C%BA%E7%AA%A5%E6%8E%A2%20-%20%E4%BD%BF%E7%94%A8%20ruby%20%E6%A8%A1%E6%8B%9F%20erlang%20process%20%E5%92%8C%20scheduler) 2019-08-05
 82 | 
 83 | [A Basic Paxos Algorithm Demo Using Erlang](https://ruby-china.org/topics/38909) 2019-08-05
 84 | 
 85 | [分布式共识算法 Paxos -- 如何让所有程序员认可 PHP 才是最好的语言](https://ruby-china.org/topics/38833) 2019-07-12
 86 | 
 87 | [Understand lock free queue algorithm as a concurrency beginner](https://ruby-china.org/topics/38086) 2019-02-04
 88 | 
 89 | [使用 Fiber 实现简单的 CSP (Goroutine channel)](https://ruby-china.org/topics/38041) 2019-01-24
 90 | 
 91 | [HTTP Request Demo by Future and Nio4r](https://ruby-china.org/topics/38404) 2019-04-14
 92 | 
 93 | ### 2018
 94 | [自旋锁及 Nginx 实现](https://ruby-china.org/topics/37916) 2018-12-18
 95 | 
 96 | [Erlang 源码阅读 -- scheduler](https://ruby-china.org/topics/37840) 2018-12-01
 97 | 
 98 | [Erlang 源码阅读 -- Number of Active Schedulers](https://ruby-china.org/topics/37874) 2018-12-09
 99 | 
100 | [使用 Ruby 实现 Erlang "Process"](https://ruby-china.org/topics/37750) 2018-11-11
101 | 


--------------------------------------------------------------------------------
/blog/2021-07-25-meltdown.md:
--------------------------------------------------------------------------------
  1 | # Meltdown Notes
  2 | 
  3 | Meltdown can let us read kernel memory from user space which breaks OS' isolation.
  4 | 
  5 | That means we can use meltdown to read some privilege data such I/O or network buffer.
  6 | 
  7 | Before explain how it works let us to see how OS prevent such things happen.
  8 | 
  9 | ## Memory Isolation
 10 | 
 11 | In user mode data can be accessed only through virtual memory.
 12 | 
 13 | Virtual memory will map virtual address to real(physical) address.
 14 | 
 15 | Before hardware do the mapping will check permission first, if some memory belongs to kernel that memory can't be read in user mode.
 16 | 
 17 | If program dose such thing will cause page fault exception.
 18 | 
 19 | The permission is recored by PTE(page table entry). 
 20 | 
 21 | Below picture is RISC-V's PTE.
 22 | 
 23 | ![Screen Shot 2021-07-24 at 11 25 01 PM](https://user-images.githubusercontent.com/3775525/126873177-04e72b82-9e31-4163-a3b7-e53de305c675.png)
 24 | 
 25 | If the U bit is set to 1 means the memory belongs to user.
 26 | 
 27 | Only when program in kernel mode can create or modify the PTE.
 28 | 
 29 | This is how OS provide memory isolation.
 30 | 
 31 | ## Out-of-order Execution
 32 | 
 33 | Out-of-order execution can let us to read kernel memory from user model but the result is invisible in architectural level.
 34 | 
 35 | When CPU run some slow instructions such as load from memory, CPU will run the following instructions first for performance reason.
 36 | 
 37 | Before the slow instruction is finished, the following instructions' changes are invisible in architectural level. This is out-of-order execution.
 38 | 
 39 | Then after the slow instruction has finished, then CPU will make the state changes visible or rollback if some things go wrong. They call this retirement.
 40 | 
 41 | In out-of-order execution hardware will not check exception such as page fault until when CPU do retirement.
 42 | 
 43 | let's see one example:
 44 | ``` c
 45 | 1 r0 = <something>
 46 | 2 r1 = valid    // r1 is a register; valid is in RAM
 47 | 3 if(r1 == 1){
 48 | 4   r2 = *r0
 49 | 5   r3 = r2 + 1
 50 | 6 } else {
 51 | 7  r3 = 0
 52 | 8 }
 53 | ```
 54 | 
 55 | For line 1 needs to load value from memory to register and that may take hundreds of cycles.
 56 | 
 57 | CPU will not wait idle for the load and it will run lien 4 ~ 5 in parallel.
 58 | 
 59 | Suppose `r0` is some kernel address, we know when CPU exec instructions out of order, it will not check exceptions.
 60 | 
 61 | So `r0`'s value has been loaded in some invisible place.
 62 | 
 63 | After `r1` has been loaded, CPU will do retirement. Hardware will see current mode is user mode and the program is reading a kernel address. 
 64 | 
 65 | So we should rollback changes and raise page fault exception.
 66 | 
 67 | Hardware designer thought that kind of changes are invisible to program.
 68 | 
 69 | ## Flush-Reload
 70 | 
 71 | In the Out-of-order execution section, we demonstrated how to read kernel memory in user mode but the result is invisible to us.
 72 | 
 73 | In this section will see how to make the result visible.
 74 | 
 75 | ### Cache
 76 | CPU has cache for storing recent accessed data.
 77 | 
 78 | The speed for CPU to read data from cache is much more faster than read data from RAM.
 79 | 
 80 | For example, read from cache needs a dozen cycles, read from RAM needs hundreds cycles.
 81 | 
 82 | If we read some address fast means the address has been used recently and slow means it has not been used recently.
 83 | 
 84 | ### Flush-Reload
 85 | 
 86 | Flush-reload will allow you to check a function used the memory at an address `x` or not.
 87 | 
 88 | The steps are:
 89 | 
 90 | 1. ensure x is not cached
 91 | 2. call `f()`
 92 | 3. record the time takes, say t1
 93 | 4. load a byte from address x
 94 | 5. record the time again, say t2
 95 | 6. if the difference between t1 and t2 is small we can infer x has been used in `f()`
 96 | 
 97 | When CPU do retirement, it will cancel register state but will not flush the cache.
 98 | 
 99 | So we can use flush+reload to get out-of-order execution's internal state.
100 | 
101 | ## Meltdown
102 | 
103 | Let's see how to read 1 bit at kernel address `r1`
104 | 
105 | ```
106 | 1  char buf[8192]
107 | 2
108 | 3  // the Flush of Flush+Reload
109 | 4  clflush buf[0]
110 | 5  clflush buf[4096]
111 | 6 
112 | 7  <some expensive instruction like divide>
113 | 8
114 | 9  r1 = <a kernel virtual address>
115 | 11 r2 = *r1         // out-of-order
116 | 12 r2 = r2 & 1      // out-of-order
117 | 13 r2 = r2 * 4096   // out-of-order
118 | 14 r3 = buf[r2]     // out-of-order
119 | 15 
120 | 16 <handle the page fault from "r2 = *r1">
121 | 17 
122 | 18 // the Reload of Flush+Reload
123 | 19 a = rdtsc // get time
124 | 20 r0 = buf[0]
125 | 21 b = rdtsc
126 | 22 r1 = buf[4096]
127 | 23 c = rdtsc
128 | 24 if b-a < c-b:
129 | 25   low bit was probably a 0
130 | ```
131 | 
132 | As line 7 is an expensive instruction, CPU will execute lien 9 ~ 14 in parallel.
133 | 
134 | `r2 = *r1` will read r1's content.
135 | 
136 | Then get the lowest bit through `r2 = r2 & 1`.
137 | 
138 | ``` c
139 | r2 = r2 * 4096
140 | r3 = buf[r2]
141 | ```
142 | 
143 | Will cause a read at `buf[0]` or `buf[4096]`.
144 | 
145 | For line 19 ~ 25 will do reload part of the flush-reload attack.
146 | 
147 | The code will read` r[0]` and get time by `b - a`.
148 | 
149 | Then read `r[4096]` and get time takes by `c - b`.
150 | 
151 | if `b - a < c - b` means `r[0]` has been read recently and then we can imply r2 == 0, so the lowest bit of address r1 is 0.
152 | 
153 | That's how we read one bit of kernel address in user mode.
154 | 
155 | ## Additional
156 | 
157 | Meltdown only affects some of Intel x86 CPU and ARM CPU and has been fixed.
158 | 
159 | Meltdown can get kernel memory relies on user address and kernel address are in same page table.
160 | 
161 | OS can fix it by "KAISER"/"KPTI", it doesn't map the kernel in user page table.
162 | 
163 | And L1 cache cached virtual address and L3 cache cached physical address.
164 | 
165 | When CPU load memory into cache, it loads a trunk of memory. That's why we use ` buf[0]` and `buf[4096]` for checking.
166 | 
167 | This article mainly based on 6.S081 2020 Lecture 22 and the meltdown paper.
168 | 
169 | For how hardware handle user mode and kernel mode, you can read "The RISC-V Reader" chapter 10.
170 | 
171 | Retired explains in stack overflow https://stackoverflow.com/questions/22368835/what-does-intel-mean-by-retired
172 | 
173 | Others explains about Meltdown https://www.bilibili.com/video/BV1nb4y1D7ii?from=search&seid=12911120057259798460
174 | 
175 | Please let me know if I make any mistakes :pray:
176 | 


--------------------------------------------------------------------------------
/blog/2021-07-29-user-level-thread-switch.md:
--------------------------------------------------------------------------------
  1 | # User Level Thread Switch
  2 | 
  3 | This article will introduce how CPU handle function call and how to do user level thread switch.
  4 | 
  5 | ## Function Call
  6 | Let's image a simple computer, all instructions are stored in memory, and CPU will execute the instruction on by one in sequence.
  7 | 
  8 | As we want to pick an instruction to execute, we need to know the instruction's address in memory.
  9 | 
 10 | So we have to find a place to hold such info for CPU, CPU uses registers for this purpose.
 11 | 
 12 | The registers are much more faster than memory, accessing a register may need one CPU cycle and accessing a memory location may need hundred CPU cycles.
 13 | 
 14 | But CPU just has limited registers, may 32 or 64 or more, but can't get as much as we want.
 15 | 
 16 | The register for storing next instruction's address is `PC`.
 17 | 
 18 | After we executed one instruction, we move the `PC` 4 byte forward (suppose 32 bits for each instruction).
 19 | 
 20 | Then CPU will execute the instruction which is point by `PC`.
 21 | 
 22 | We can describe this in code
 23 | 
 24 | ``` c
 25 | instructions = [i0, i1, ....in]
 26 | 
 27 | pc = 0
 28 | while true
 29 |     CPU_exec(instructions[pc])
 30 |     pc += 4
 31 | ```
 32 | <img width="570" alt="Screen Shot 2021-07-28 at 8 45 34 AM" src="https://user-images.githubusercontent.com/3775525/127245579-1c7def90-a730-49d8-8ec5-1367f07e92af.png">
 33 | 
 34 | Now let's to see how loop works.
 35 | 
 36 | For loop we need exec some instructions again and agin, such as instruction-0, instruction-1, .... instruction-n, then instruction-0, instruction-1, .... instruction-n...
 37 | 
 38 | As we know `PC` is pointed to the next instruction to execute, so after we executed instruction-n, we just need to set `PC` to the instruction-0's address, then cpu will loop again.
 39 | 
 40 | <img width="223" alt="Screen Shot 2021-07-28 at 8 52 01 AM" src="https://user-images.githubusercontent.com/3775525/127246041-86e12d78-f2a6-4f7c-8a3d-a4129600f8f5.png">
 41 | 
 42 | It's time for function call.
 43 | 
 44 | Suppose there is some code as below:
 45 | 
 46 | ``` c
 47 | 1 foo :=
 48 | 2  instruction0
 49 | 3  call bar
 50 | 4  instruction1
 51 | 5  instruction2
 52 | 6
 53 | 7 bar :=
 54 | 8  instruction0
 55 | 9  ret
 56 | ```
 57 | 
 58 | For function `foo`, we need execute some instruction then call `bar` function and execute remain instructions.
 59 | 
 60 | For function `bar`, we need execute some instruction and return.
 61 | 
 62 | <img width="412" alt="Screen Shot 2021-07-28 at 9 10 21 AM" src="https://user-images.githubusercontent.com/3775525/127247529-dae03166-0b96-433e-b666-5316a0bb6573.png">
 63 | 
 64 | So we need jump to `bar` and jump back.
 65 | 
 66 | When we jump to `bar`, we just need to update `PC` to `bar`'s location. Then CPU will execute `bar`'s instructions.
 67 | 
 68 | After we finished `bar`'s execution code we need jump back to line 4's address.
 69 | 
 70 | For doing this we need one place to hold the line 4's address (for jumping back).
 71 | 
 72 | We can store it in memory but memory is slow, so we store it in register. This special purpose register is called `ra` usually.
 73 | 
 74 | so the call and ret can be defined as below:
 75 | 
 76 | ``` c
 77 | call label :=
 78 |   ra <- pc + 4 // assign next instruction for ret, which is line 4's address in our example
 79 |   pc <- label  // jump to the callee, for foo is bar's address
 80 | 
 81 | ret :=
 82 |   pc <- ra // jump back
 83 | ```
 84 | 
 85 | It works for on level function call.
 86 | 
 87 | But what if we have 3 functions? Likes function `f` calls `foo` and `foo` calls `bar`.
 88 | 
 89 | ``` c
 90 | 1  0000  f :=                    exec order       ra's value      description
 91 | 2  0004    instruction-x            0                 0
 92 | 3  0008    call foo                 1                 000a
 93 | 4  000a    instruction-x
 94 | 5  000e    ret
 95 | 6  0010  foo :=
 96 | 7  0014    instruction-x            2                 000a
 97 | 8  0016    call bar                 3                 0001e
 98 | 9  001e    instruction-x            6                 001e
 99 | 10 0010    ret                      7                 001e        jump to 001e, line 9 again...
100 | 11 0014  bar :=
101 | 12 0018    instruction-x            4                 001e
102 | 13 001a    ret                      5                 001e         jump to 001e, line 9
103 | ```
104 | 
105 | Let's walk above code step by step.
106 | 
107 | For line 3, the ra's value is 000a(at line 3), then we jump to foo function(at line 6).
108 | 
109 | Then we execute `call bar` at line 8 and the ra's value will be updated to 001e(at line 9).
110 | 
111 | After `bar` has been executed, `ra`'s value is 001e(at line 9), so we jump to line 9.
112 | 
113 | But when we execute line 10, current `ra`'s value is 001e(at line 9), we jump back to line 9 again.
114 | 
115 | It's an infinite loop.
116 | 
117 | That happens because at line 8 we overwrite the original `pa`'s value.
118 | 
119 | We need many places to store the ongoing functions' return addresses but we only have limited register.
120 | 
121 | It is impossible to store all those return address into registers, so we store them in memory.
122 | 
123 | For each function call we need some small memory to save `ra` and after the function has been executed we will need get the data back and free the memory.
124 | 
125 | For function `f` we need allocate memory and store 000e(line 3) to it and for `foo` we need allocate memory and store 0001e(at line 9),
126 | 
127 | after `bar` has been executed then we need get last value(0001e at line 9) back and deallocate memory 
128 | 
129 | then do same thing for value 000e(line 3).
130 | 
131 | The operations are push, push and pop, pop, so we can use stack.
132 | 
133 | For stack we need allocates some memory(eg: 4kb) and one pointer which points to the top of the stack.
134 | 
135 | When we do push, we move the pointer forward and store `ra` to the last location. 
136 | 
137 | After a function has been called, we pop the value and move the pointer backward.
138 | 
139 | As this happens so often, hardware designer provides one register for us, it is called `sp` usually.
140 | 
141 | Register `ra`'s value only has meaning in current function, when we executed an inner function such as `bar`, the `ra` should have different value. 
142 | 
143 | It is a temporary register, such registers should be saved by caller.
144 | 
145 | And there is another kind of register which are preserved across function calls, 
146 | 
147 | so if callee needs to use such register, he needs to save them and restore them back before return to caller.
148 | 
149 | This kind of registers are callee saved registers.
150 | 
151 | Above is how we handle function call.
152 | 
153 | ## Switch Threads
154 | 
155 | NOTICE: Those are based on MIT 6.S081 2020 Multithreading lab.
156 | 
157 | We need build a function for switching two thread. 
158 | 
159 | And the function will be called likes `switch(thread1, thread2)`, thread1 is a variable which stores thread 1's state.
160 | 
161 | When call `switch` function we will switch from `thread1` to `thread2`.
162 | 
163 | ![Screen Shot 2021-07-28 at 9 54 12 PM](https://user-images.githubusercontent.com/3775525/127334554-fa2caa8f-dc62-41f0-a6ae-17078391a623.png)
164 | 
165 | When we call `switch`, CPU will begin execute `switch` instructions.
166 | 
167 | And after all `switch`'s instructions have been executed, switch will not jump back to is callee place in thread 1, but will jump to other place in thread 2.
168 | 
169 | It works as same as function call except it doesn't return back.
170 | 
171 | For normal function all, compiler will help us to save register and handle return address.
172 | 
173 | For switch, we need handle those stuff by myself.
174 | 
175 | First thing is store current thread's state. Suppose we have structures as below:
176 | 
177 | ```c
178 | struct context {
179 |   uint64 ra;
180 |   uint64 register1;
181 |   uint64 register2;
182 |   .........
183 | }
184 | 
185 | struct thread {
186 |   struct context context;
187 | };
188 | ```
189 | 
190 | and we call `switch` by `switch(&threa_1->context, &thread_2->context)`.
191 | 
192 | We need use assembly code to save thread_1's return address and other registers,
193 | 
194 | ```c
195 | switch:
196 |     // a0 is thread 1's context address
197 |    sd ra,        0(a0)     // save thread_1->context.ra into ra register
198 |    sd register1, 8(a0)    // save thread_2->context.register1 into register1
199 |    sd register2, 16(a0)  // save thread_2->context.register2 into register2
200 |    ...
201 | ```
202 | 
203 | Then we need to load thread_2's context into those registers
204 | ```
205 |    // a1 is thread 2's context address
206 |    ld ra,         0(a1)  // load thread_2->context.ra into ra register
207 |    ld register1,  8(a1)  // load thread_2->context.register1 into register1
208 |    ld register2, 16(a1) // load thread_2->context.register2 into register2
209 | ```
210 | 
211 | Above code will allow us switch thread1 to thread2, but it doesn't work correlty.
212 | 
213 | Because we can have many threads but only have one stack.
214 | 
215 | We need different stacks for different user level threads and when we switch thread we need switch stack too.
216 | 
217 | So let add one field for storing the stack, the thread structure becomes:
218 | 
219 | ```c
220 | struct thread {
221 |   char       stack[MAX_STACK_SIZE]; // toy code, do not handle stack overflow
222 |   struct     context context;
223 | };
224 | ```
225 | 
226 | and as we have different stack, we need add sp to `context`
227 | 
228 | ```c
229 | struct context {
230 |   uint64 ra;
231 |   uint64 sp;
232 |   uint64 register1;
233 |   uint64 register2;
234 |   .........
235 | }
236 | ```
237 | 
238 | and `switch` becomes:
239 | 
240 | ```c
241 | switch:
242 |    // a0 is thread 1's context address
243 |    sd ra,        0(a0)   // save thread_1->context.ra into ra register
244 |    sd sp,        8(a0)    // save thread_1->context.sp into sp register
245 |    sd register1, 16(a0)    // save thread_1->context.register1 into register1
246 | 
247 |    ...
248 | 
249 |    // a1 is thread 2's context address
250 |    ld ra,        0(a1)   // load thread_2->context.ra into ra register
251 |    ld sp,        8(a1)    // load thread_2->context.ra into sp register
252 |    ld register1, 16(a1)  // load thread_2->context.register1 into register1
253 | 
254 |    ...
255 | ```
256 | 
257 | When we create a user level thread, we need set the stack's address for the thread's `context.sp` field and set `context.ra` to the function we want to execute.
258 | 
259 | So when we switch to the thread, CPU will jump to the `thread->context.ra` and use `thread->context.sp` as its stack.
260 | 
261 | The memory layout as below:
262 | 
263 | ![Screen Shot 2021-07-29 at 9 20 34 AM](https://user-images.githubusercontent.com/3775525/127416803-5ca48c79-d27e-43e2-9027-5d77f9b157b8.png)
264 | 
265 | From this section, we know if we want to have user level thread, 
266 | 
267 | we need to allocate memory for each user level thread's stack and handle `ra` and other callee registers in `switch` function.
268 | 
269 | That's how we handle user level thread switching.
270 | 
271 | ## References
272 | 
273 | - [What are callee and caller saved registers?](https://stackoverflow.com/questions/9268586/what-are-callee-and-caller-saved-registers/16265609#16265609)
274 | 
275 | - [MIT 6.S081 2020 Multithreading lab](https://pdos.csail.mit.edu/6.S081/2020/labs/thread.html)
276 | 


--------------------------------------------------------------------------------
/blog/2021-08-11-how-linux-finds-physical-address-through-virtual-address.md:
--------------------------------------------------------------------------------
  1 | # How Linux finds physical address through virtual memory
  2 | 
  3 | ## Why
  4 | 
  5 | Hardware can map virtual address to physical address or use physical address directly.
  6 | 
  7 | The virtual address to physical address mapping is stored in memory.
  8 | 
  9 | When virtual address is enabled, hardware will help us do the virtual address to physical address mapping.
 10 | 
 11 | Sometimes, kernel needs do same thing.
 12 | 
 13 | For example when kernel allocates new memory for process kernel needs walk through and setup the mapping.
 14 | 
 15 | So kernel needs to know how page table works.
 16 | 
 17 | Now let's see how the hardware handle the virtual addresses.
 18 | 
 19 | ## x86 4-level paging
 20 | 
 21 | X86 supports different kinds of pages, they are similar. So let's consider 4 levels paging and 4KB page only.
 22 | 
 23 | ![Screen Shot 2021-08-11 at 9 14 57 PM](https://user-images.githubusercontent.com/3775525/129035231-13bdac79-ebf8-4b1f-8e51-e988ddfa3eee.png)
 24 | 
 25 | As the image above, register `CR3` is point to start address of global(L3) directory page, and virtual address' 47 ~ 39 bits will used for global(L3) directory page(L3)'s offset.
 26 | 
 27 | Then the content will point to next level directory's start address, then we can use 38 ~ 30 bits of virtual address for offset of upper(L2) directory page.
 28 | 
 29 | Then middle(L1) directory page and finally we arrive at page table(L0).
 30 | 
 31 | ### Linux `follow_page` method
 32 | 
 33 | Linux `follow_page` is used for doing same thing.
 34 | 
 35 | Main follow as before:
 36 | 
 37 | ``` c#
 38 | // in mm/gup.c
 39 | follow_page(vma, address, flags)
 40 | pgd = pgd_offset(mm, address);
 41 | follow_p4d_mask(vma, address, pgd, flags, ctx);
 42 |   p4d = p4d_offset(pgdp, address); // level 2
 43 |   follow_pud_mask(vma, address, p4d, flags, ctx);
 44 |     pud = pud_offset(p4dp, address);
 45 |     follow_pmd_mask(vma, address, pud, flags, ctx);
 46 |       pmd = pmd_offset(pudp, address); // level 1
 47 |       follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 48 |         ptep = pte_offset_map(mm, pmd, address, &ptl); // level 0
 49 |         pte = *ptep;
 50 |         page = vm_normal_page(vma, address, pte);
 51 | ```
 52 | 
 53 | `pgd_offset`'s defination is:
 54 | 
 55 |  ``` c
 56 |  #define PGDIR_SHIFT 39
 57 |  #define PTRS_PER_PGD 512
 58 | 
 59 | #define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address))
 60 | 
 61 |  static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
 62 |  {
 63 |     return (pgd + pgd_index(address));
 64 |  };
 65 | 
 66 |  #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 67 | 
 68 |  ```
 69 | 
 70 | `#define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))` will shift right 39 bits then mark out 9 bits which will give us 47 ~ 39 bits of a virtual address.
 71 | 
 72 | The result is the offset of the page global directory's offset.
 73 | 
 74 | We add it to `pgd` by `pgd + pgd_index(address)` and the result is the next level directory's start address.
 75 | 
 76 | 
 77 | `p4d_offset` is used for 5-level paging, code as below:
 78 | 
 79 | ``` c
 80 | static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 81 | {
 82 | 	if (!pgtable_l5_enabled())
 83 | 		return (p4d_t *)pgd;
 84 | 	return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
 85 | }
 86 | ```
 87 | 
 88 | `pgtable_l5_enabled()` will return `false` we use level 4 paging, so `p4d_offset` will just return `pgd` back.
 89 | 
 90 | Then `pud_offset` for the upper directory, then `pmd_offset` for the middle directory and finally reached the last level by calling `pte_offset`.
 91 | 
 92 | The calling is follow_page: pgd_offset -> p4d_offset -> pud_offset -> pmd_offset -> pte_offset.
 93 | 
 94 | ## Paging
 95 | There are different kinds of paging in x84: 32-bit paging, PAE paging, 4-level paging and 5-level paging.
 96 | 
 97 | Basically they are same, but just uses different bits for finding pages.
 98 | 
 99 | More detail can be fond in `Intel® 64 and IA-32 Architectures Software Developer’s Manual` or `Understanding Linux Kernel`.
100 | 
101 | So I will not explain all of them.
102 | 
103 | There are many interesting things about page table or addressing such as copy on write and mmap.
104 | 
105 | I will explain some of them in the following articles.
106 | 


--------------------------------------------------------------------------------
/blog/2022-02-06-tcp-congestion-control.md:
--------------------------------------------------------------------------------
  1 | # TCP Congestion Control Brief Description by Pseudo Code
  2 | 
  3 | ## Introduction
  4 |    TCP 需要在异构，动态的网络环境下，通过不可靠的下层服务(IP)，提供可靠（reliable）的服务，而拥塞控制是 TCP 必不可少的机制。
  5 | 
  6 |    TCP 拥塞控制本身是一个非常困难，也非常有趣的问题。
  7 | 
  8 |    本文主要根据 RFC 2001[1] 对 TCP 拥塞控制的状态变化进行一个简单的描述。
  9 | 
 10 | ## TCP Congestion Control Brief Description
 11 |    TCP 拥塞控制通过调整 cwnd(congestion window) 对流量进行控制，未经过确认(ack)的数据总和不能超过 cwnd 的大小。
 12 | 
 13 |    TCP 拥塞控制有三个状态，在不同的状态下，会使用不同的方式调整 cwnd 的大小。
 14 | 
 15 | ### State Changes
 16 | #### Slow Start and Congestion Avoidance
 17 | slow start 和 congestion avoidance 这两个状态的转换由 cwnd 和 ssthresh(slow start threshold) 控制。
 18 | 
 19 | 当 cwnd > ssthresh 则由 slow start 转换为 congestion avoidance。
 20 | 
 21 | 在  slow start 状态下，当收到新的 ack 的时候，cwnd 会成指数增长，而 congestion avoidance 状态下，则成线性增长。
 22 | 
 23 | 在当前的状态是 slow start 或者是 congestion avoidance 的时候，如果收到 duplicate ACK, 则会将 ssthresh 设置为 window size( min{cwnd, rwnd} )的一半。
 24 | 
 25 | #### Fast Recovery
 26 | 当收到 3 个或者更多的重复的 ACK 的时候，则进入 Fast Recovery 状态，在收到新的 ACK 的时候，则 Fast Recovery 结束，进入 Congestion Avoidance。
 27 | 
 28 | 当进入 Fast Recovery 的时候，会将 threshold 设置为 window size( min{cwnd, rwnd} )的一半，重传丢失的 segment，将 cwnd 设置为 ssthresh + 3 倍的 segment size。
 29 | 
 30 | #### Timeout
 31 | 当 retransmit timer timeout 的时候，则会将 ssthresh 设置为 window size( min{cwnd, rwnd} )的一半，并将 cwnd 设置为 segment size。即从新开始 slow start。
 32 | 
 33 | ### How Congestion Window Changes
 34 | 图片截取自 MIT 6829 Computer Networking L8 [4]
 35 | ![Screen Shot 2022-02-06 at 6 26 13 PM](https://user-images.githubusercontent.com/3775525/152676638-f346dd6d-7d2c-4d8c-984c-d1b618085d94.png)
 36 | 
 37 | ### Pseudo Code
 38 | TCPCongestionControl#receive 在收到 ack 或者 retransmit timer timeout 的时候会被触发。
 39 | 
 40 | 可以理解成一种 callbck，在有收到 ack 的时候会被调用，在 timeout 事件发生的时候，也会被调用。类似于 Erlang 的 receive。
 41 | 
 42 | ```ruby
 43 | class TCPCongestionControl
 44 |   # rwnd receiver's advertised window
 45 |   def initialize(segment_size)
 46 |     # segment size announced by the other end, or the default,
 47 |     # typically 536 or 512
 48 |     @segment_size = segment_size
 49 |     @cwnd = segment_size
 50 |     @ssthresh =  65535 # bytes
 51 |     @in_fast_recovery = false
 52 |     # other logic ....
 53 |   end
 54 | 
 55 |   def receive
 56 |     # indicate congestion
 57 |     if current_algorithm == :slow_start && new_ack?
 58 |       # exponential growth
 59 |       @cwnd += @segment_size
 60 |       # may go_to congestion_avoidance state
 61 |       # Slow start continues until TCP is halfway to where it was when congestion
 62 |       # occurred (since it recorded half of the window size that caused
 63 |       # the problem in step 2), and then congestion avoidance takes over.
 64 |     elsif current_algorithm == :congestion_avoidance && new_ack?
 65 |       # linear growth
 66 |       @cwnd += segsize * segsize / @cwnd
 67 |     elsif three_or_more_duplicate_ack?
 68 |       # TCP does not know whether a duplicate ACK is caused by a lost segment or just a reordering of segments
 69 |       # it waits for a small number and assume it is just a a reordering of segments
 70 |       # 3 or more duplicate ACKs are received in a row,
 71 |       # it is a strong indication that a segment has been lost
 72 |       # go_to fast recovery
 73 |       @in_fast_recovery = true
 74 |       @ssthresh = current_window_size / 2
 75 |       retransmit_missing # retransmit directly without waiting timer
 76 |       @cwnd = @ssthresh + 3 * @segment_size
 77 |     # This inflates the congestion window by the number of segments that have left the network and which the other end has cached (3).
 78 |     elsif current_algorithm == :fast_recovery && new_ack?
 79 |       @cwnd = @ssthresh # go_to congestion_avoidance
 80 |       @in_fast_recovery = false
 81 |     elsif (current_algorithm == :slow_start || current_algorithm == :congestion_avoidance) && duplicate_ack?
 82 |       @ssthresh = current_window_size / 2 # go_to congestion avoidance
 83 |     elsif current_algorithm == :fast_recovery && duplicate_ack?
 84 |       @cwnd += @segment_size
 85 |       # This inflates the congestion window for the additional segment that has left the network.  Transmit a packet, if allowed by the new value of cwnd.
 86 |     elsif timeout?
 87 |       # slow_star and congestion_avoidance should come into here
 88 |       # not sure when timout occurs when fast_recevery should come into here too...
 89 |       @ssthresh = current_window_size / 2
 90 |       @cwnd = segment_size # go_to slow start
 91 |     end
 92 |     
 93 |     maybe_transmit_packet
 94 |   end
 95 | 
 96 |   private
 97 |   def current_algorithm
 98 |     return :fast_recovery if @in_fast_recovery
 99 | 
100 |     @cwnd <= @ssthresh ? :slow_start : :congestion_avoidance
101 |   end
102 | 
103 |   def current_window_size
104 |     min(@cwnd, @rwnd)
105 |   end
106 | end
107 | ```
108 | 
109 | ## Relative
110 |    Congestion Avoidance and Control[2]，对用塞控制背后的原理有详细的解释，不过比较难读（我只读懂了部分）。
111 | 
112 |    Computer Networking: A Top Down Approach 和 Principles of Computer System 都对 TCP 拥塞控有所描述。更深入的可以参考 MIT 的 Computer Networking 课程[3]。
113 | 
114 |    在操作系统中，也有类似的问题。之前，在 package 超过系统负载能力的时候，CPU 都被用在了处理中断，而没有时间执行应用层的逻辑，导致没办法完整的处理一个 package。和 Congestion Avoidance and Control 有相似的思路，但是用了不同的方法[4]。
115 | 
116 |    工业上，类似的机制有：流量控制(rate limiter)，背压(Back pressure)。Facebook 在 memcache cluster 中有使用类似的机制[5]。
117 | 
118 |    Netfix 有开源 Hystrix[6]，但现在已经被 concurrency-limits[7] 所替代，而 concurrency-limits 使用类似 TCP congestion control 的机制。
119 | 
120 | ## 参考
121 |    - [1] W. Stevens. RFC2001: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, 1997
122 |    - [2] V. Jacobson, Congestion Avoidance and Control
123 |    - [3] MIT 6829 Computer Networking L8 End-to-End Congestion Control
124 |      https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-829-computer-networks-fall-2002/lecture-notes/
125 |    - [4] Jeffrey Mogul. Eliminating Receive Livelock in an Interrupt-driven Kernel, 1996
126 |    - [5] Rajesh Nishtala. Scaling Memcache at Facebook, 2013
127 |    - [6] Netflix Hystrix https://github.com/Netflix/Hystrix
128 |    - [7] Netflix concurrency-limits https://github.com/Netflix/concurrency-limits
129 | 
130 | 
131 | 
132 | 
133 | 


--------------------------------------------------------------------------------
/blog/2022-05-03-slack-incident-reading-note.md:
--------------------------------------------------------------------------------
  1 | # Slack’s Incident on 2-22-22 Reading Note
  2 | 
  3 | ## Why reading incident report is interesting?
  4 | 
  5 | We can learn many things from an incident report, such as how to react to an incident, review architecture, and analyze problems in a real case.
  6 | 
  7 | Have to say, it is much more enjoyable to learn from others' incident report.
  8 | 
  9 | ## Affect
 10 | 
 11 | On February 22, 2022 from 6:00 AM PST to 9:14 AM PST, some customers can't access Slack[2].
 12 | 
 13 | ## Rough Timeline
 14 | 
 15 | 1. received user tickets
 16 | 2. received alarms 
 17 | 3. detected database suffers higher load than usual
 18 | 4. query timeout, DB overloaded
 19 | 5. decreased the requests acceptance rate => DB back to normal
 20 | 6. increased the rate => DB overloaded again
 21 | 7. decreased the rate => back to normal => increase by small steps
 22 | 
 23 | ## The Architecture
 24 | 
 25 | The backend server will query the cache first, if the cache missed, query the DB.
 26 | 
 27 | ``` 
 28 | v = get(k)
 29 | 
 30 | if v is nil {
 31 |   v = fetch from DB
 32 |   set(k, v)
 33 | }
 34 | ```
 35 | 
 36 | ![Screen Shot 2022-04-29 at 12 26 31 AM](https://user-images.githubusercontent.com/3775525/165799822-50bc0477-60ce-49ee-98b2-4efd6ae6c91d.png)
 37 | 
 38 | Clients(app servers) will send requests to the Mcrouter, then the Mcrouter will proxy the requests to Memcached nodes.
 39 | 
 40 | Consul is used for discovering Memcached nodes. And mcrib will watch Consul for getting alive Memcached nodes. 
 41 | 
 42 | Then the info will be used by mcrouter for selecting memcached nodes.
 43 | 
 44 | > When the agent(Consul) restart occurs on a memcached node, the node that leaves the service catalog gets replaced by Mcrib. The new cache node will be empty.
 45 | 
 46 | I'm not really understand this sentence.
 47 | 
 48 | It seems that the consul agent will be deployed on some Memcached nodes, and when the Cousul agent restarts, the Memcached node will be replaced by a new empty node.
 49 | 
 50 | <img width="684" alt="Screen Shot 2022-07-02 at 22 22 52" src="https://user-images.githubusercontent.com/3775525/177004761-a94c9461-88d5-4426-9201-6122fa07b4c8.png">
 51 | 
 52 | ## What happened
 53 | 
 54 | ### Cache failure causes hit rate to decrease
 55 | 
 56 | Upgrade Consul agent => memcached nodes are replaced by empty nodes => cache hit decreased
 57 | 
 58 | ### Cache failure => DB heavy read load + read amplification => DB overload
 59 | 
 60 | A Cache can reduce a lot of loads for DB. When cache failure happens, DB will suffer much more load than usual.
 61 | 
 62 | For Slack, one function/API needs to query many DB shards as it is not sharded well.
 63 | 
 64 | Those caused the DB to be overloaded.
 65 | 
 66 | ### Cascading failure
 67 | Then cascading failure happened.
 68 | 
 69 | ![Screen Shot 2022-04-29 at 9 35 05 AM](https://user-images.githubusercontent.com/3775525/165872687-9f9966e7-d2bb-41bc-a095-bae4e874c179.png)
 70 | 
 71 | 
 72 | ## Thinking in general
 73 | 
 74 | Instead of focusing on the incident closely we can make the problem more general and see what we can learn from previous experiences.
 75 | 
 76 | ### Cache failure
 77 | 
 78 | After we add a cache layer it will have two effects: speed up requests and reduce DB read load[4].
 79 | 
 80 | Most applications are read-heavy, for example in Facebook, they have two orders of magnitude more reads than writes[3].
 81 | 
 82 | Another fact is cache is much faster say 10 times than DB usually.
 83 | 
 84 | So if the cache crashed, DB will suffer a really heavy load most time. That means we need to take cache failure seriously.
 85 | 
 86 | For caching, Facebook has published a paper[3] about how they design their cache system and make it available all the time.
 87 | 
 88 | ### High availability
 89 | 
 90 | For achieving high availability, we need to ask "what happens if it fails?"[8].
 91 | 
 92 | The system may get stuck(livelock[6]) or crash(cascading failure) by any component failure.
 93 | 
 94 | We have two options, fix all potential issues or make the system recover fastly.
 95 | 
 96 | Options one is infeasible because there are too many components and too many potential issues.
 97 | 
 98 | But we can achieve fast recovery, that coverts unspecific problems into specific ones.
 99 | 
100 | For Google GFS[7], they think "component failures are the norm rather than the exception.", and they handle availability problems through fast recovery and replication.
101 | 
102 | ### Overload control
103 | 
104 | Most problems are caused by overload and most time the loads are not predictable.
105 | 
106 | To avoid such problems, we can design some mechanisms to protect our systems.
107 | 
108 | The most obvious way is rate-limiter, it sets a hard value and if throughput is over the value, the rate-limter will drop the over-part.
109 | 
110 | One inconvenience of the rate-limiter is we need to change it manually, as we can see in this incident.
111 | 
112 | Microservices have complexity dependency whose parts are dynamic changed(software upgrade, hardware config update, or something else), so it's impossible to find a good value for all situations.
113 | 
114 | Netflix created [concurrency-limits](https://github.com/Netflix/concurrency-limits) based on such observations.
115 | 
116 | Another thing we need to consider is how to provide the best  experience for users when we start to drop their requests.
117 | 
118 | "Overload Control for Scaling WeChat Microservices"[5] has a good answer to this question.
119 | 
120 | In the paper, they drop requests based on features and users, and they use queue time to detect overload config the limit value dynamically.
121 | 
122 | ## References
123 | 1. Slack’s Incident on 2-22-22
124 |    https://slack.engineering/slacks-incident-on-2-22-22/
125 | 2. Slack status history
126 |    https://status.slack.com/2022-02-22
127 | 3. Scaling Memcache at Facebook at NSDI '13
128 |    https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala
129 | 4. MIT 6.824 Spring 2022 LEC 16
130 |    https://pdos.csail.mit.edu/6.824/schedule.html
131 | 5. Hao Zhou. Overload Control for Scaling WeChat Microservices, 2018
132 | 6. J. C. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. 1997.
133 | 7. Sanjay Ghemawat. The Google File System 2003
134 | 8. J Armstrong. A history of Erlang. 2007
135 | 


--------------------------------------------------------------------------------
/blog/2022-10-05-jemalloc.md:
--------------------------------------------------------------------------------
  1 | # How JeMalloc Works
  2 | 
  3 | ## Introduction
  4 | 
  5 | A memory allocator needs to suit hardware(multi-processor + cache line) and use memory efficiently(fast + less fragment).
  6 | 
  7 | By default, JeMalloc uses Thread-Specific-Data and mutiple arenas for every CPU for reducing lock contention.
  8 | 
  9 | For handling cache line related things, JeMalloc relies on multiple arenas and allows applications to pad the allocations.
 10 | 
 11 | JeMalloc allocates different size memory from different memory blocks(slab/extent) which can reduce fragmentation and locate memory quickly.
 12 | 
 13 | This article will introduce how JeMalloc 5.2.1[1] allocates and deallocates memory. And for simplicity, I will focus on small memory allocation and will not cover all details.
 14 | 
 15 | ## Big Picture -- The Hierarchy
 16 | 
 17 | We can image the memory managed in 3 levels: `tcache`, `arena`, and OS. It works like CPU cache -- memory will be allocated from a higher level, if the higher level doesn't have available memory, memory will be allocated from the next level and fill in the higher level, now the memory can be allocated.
 18 | 
 19 | <img width="1110" alt="Screen Shot 2022-09-25 at 11 31 51" src="https://user-images.githubusercontent.com/3775525/192127017-c6653401-c4ec-40d7-874b-97432bee0a80.png">
 20 | 
 21 | `tcache` has many cache_bins, `cache_bin`'s `avail` field will point to regions -- `region` is a unit of memory and will be returned when an application calls `malloc`.
 22 | 
 23 | `arena` has many extents/slabs, a slab will have many regions for small memory. When there is no available memory in a `cache_bin`, the `cache_bin` will try to get memory from `extent/slab.bin` through the current `arena`. If the arena doesn't have an available extent/slab too, JeMalloc will try to allocate memory from OS.
 24 | 
 25 | <img width="1040" alt="Screen Shot 2022-09-25 at 11 44 50" src="https://user-images.githubusercontent.com/3775525/192127352-a8cf98ae-45ff-47b9-9ba6-1f2242d0ea92.png">
 26 | 
 27 | Now let's see how JeMalloc's every component works.
 28 | 
 29 | ## Small Memory Allocation
 30 | 
 31 | ### TCache
 32 | 
 33 | <img width="285" alt="Screen Shot 2022-09-26 at 08 36 28" src="https://user-images.githubusercontent.com/3775525/192173979-ff30eaf8-00ea-4478-81b5-41565128c8e7.png">
 34 | 
 35 | When an application allocates memory, JeMalloc will try to find available memory from `tcache`, `tcache` is thread-specific data, as the data is owned by a thread, we do not need to lock it when we use it. The thread-specific data is get by `pthread_setspecific` and set by `pthread_getspecific`.
 36 | 
 37 | Relate data structures are `tsd`, `tcache` and `cache_bin`. `tsd` is used for fast access `tcache` and others. `tcache` has many `cache_bin`, different `cache_bin` is used for allocating different size memory.
 38 | 
 39 | For example, `tcache.bins_small[0]` is used for allocating 0 ~ 8 bytes memory,  `tcache.bins_small[1]` is for 9 ~ 16 bytes, the calculation is in `sz_size2index_compute`.
 40 | 
 41 | After finding the right `cache_bin` through size, we can find the available memory address through `cache_bin.avail`, it contains the same size memory(JeMalloc calls this region) addresses, the code is in `cache_bin_alloc_easy`.
 42 | 
 43 | If the `cache_bin` doesn't have available memory, JeMalloc needs to get memory from the next level -- arena, fill the `cache_bin`, and allocate the memory again. Simplified code as below:
 44 | 
 45 | ``` c
 46 | JEMALLOC_ALWAYS_INLINE void *
 47 | tcache_alloc_small(tsd_t *tsd, arena_t *arena, tcache_t *tcache,
 48 |                    size_t size, szind_t binind, bool zero, bool slow_path) {
 49 |     // ...
 50 |     bin = tcache_small_bin_get(tcache, binind);
 51 |     ret = cache_bin_alloc_easy(bin, &tcache_success);
 52 |     assert(tcache_success == (ret != NULL));
 53 |     if (unlikely(!tcache_success)) {
 54 |         bool tcache_hard_success;
 55 |         arena = arena_choose(tsd, arena);
 56 | 
 57 |         ret = tcache_alloc_small_hard(tsd_tsdn(tsd), arena, tcache,
 58 |                                       bin, binind, &tcache_hard_success);
 59 |     }
 60 |     // ...
 61 | 
 62 |     return ret;
 63 | }
 64 | ```
 65 | 
 66 | ### Arena
 67 | 
 68 | <img width="651" alt="Screen Shot 2022-09-26 at 08 37 57" src="https://user-images.githubusercontent.com/3775525/192174072-2803a3e0-6e47-4679-9d8b-ffc1329dc07a.png">
 69 | 
 70 | `arena` is used for managing `extent`, `extent` is a unit of memory, it may contain multiple OS pages. `extent` is allocated from or deallocated to OS through OS provides system call such as `mmap`, relates code is in `src/pages.c`. And small size `extent` is called `slab` in JeMalloc.
 71 | 
 72 | Likes `tcache`, `arena` has many `bin` which is used for managing different size memory for the application. `bin` has a `slab` pointer( `bin.slabcur`) and a collections of non-full slab/extent(`bin.slabs_nonfull`).
 73 | 
 74 | When JeMalloc trying to find available memory in a `bin`, JeMalloc will try `bin.slabcur` first, if fails it will try to get a `slab` in `bin.slabs_nonfull` and assign it to the `bin.slabcur`.
 75 | 
 76 | And if `bin.slabs_nonfull` has no available extent/slab too, JeMalloc will try to find the memory in `arena`'s extents(the order is `extents_dirty` -> `extents_muzzy` -> `extents_retained` -> `extent_avail`, `extents_dirty`, `extents_muzzy` and `extents_retained` are used for purging, will cover in other sections). If all those do not have available extents, JeMalloc will allocate memory from the OS.
 77 | 
 78 | ## Purging
 79 | When an application frees memory, JeMalloc needs to determine how to return the memory. JeMalloc will return memory to `tcache` first. When `tcache` is full, JeMalloc will return free memory to `extent`, and arena will handle how and when returning the unused extents to the system.
 80 | 
 81 | ### Application -> `tcache` -> extent
 82 | When an application frees a small memory, the trace will be `je_free` -> `ifree` -> `idalloctm` -> `tcache_dalloc_small`.
 83 | 
 84 | In `tcache_dalloc_small`, the deallocation relates code is:
 85 | 
 86 | ```c
 87 | JEMALLOC_ALWAYS_INLINE void
 88 | tcache_dalloc_small(tsd_t *tsd, tcache_t *tcache, void *ptr, szind_t binind,
 89 |                     bool slow_path) {
 90 |     // ...
 91 |     if (unlikely(!cache_bin_dalloc_easy(bin, bin_info, ptr))) {
 92 |         tcache_bin_flush_small(tsd, tcache, bin, binind,
 93 |                                (bin_info->ncached_max >> 1));
 94 |         bool ret = cache_bin_dalloc_easy(bin, bin_info, ptr);
 95 |         assert(ret);
 96 |     }
 97 |     // ...
 98 | }
 99 |  ```
100 |  
101 | `cache_bin_dalloc_easy` will retun memory to current `cache_bin`. If `cache_bin` is full (`bin->ncached == bin_info->ncached_max`), JeMalloc will call `tcache_bin_flush_small`.
102 | 
103 | In `tcache_bin_flush_small`, it will return half(`bin_info->ncached_max >> 1` in tcache_dalloc_small) to extent, `arena_dalloc_bin_locked_impl` will do this work:
104 |    
105 | 
106 | ```c
107 | static void
108 | arena_dalloc_bin_locked_impl(tsdn_t *tsdn, arena_t *arena, bin_t *bin,
109 |                              szind_t binind, extent_t *slab, void *ptr, bool junked) {
110 |     // ...
111 |     if (nfree == bin_info->nregs) {
112 |         arena_dissociate_bin_slab(arena, slab, bin);
113 |         arena_dalloc_bin_slab(tsdn, arena, slab, bin);
114 |     } else if (nfree == 1 && slab != bin->slabcur) {
115 |         arena_bin_slabs_full_remove(arena, bin, slab);
116 |         arena_bin_lower_slab(tsdn, arena, slab, bin);
117 |     }
118 |     // ...
119 | }
120 | ```
121 | 
122 | If `nfree == bin_info->nregs`, it means slab is empty, so need to call `arena_dissociate_bin_slab` and then `arena_dalloc_bin_slab`. `arena_dalloc_bin_slab` will append the free slab(extent) to current `arena->extents_dirty` by calling `arena_extents_dirty_dalloc`.
123 | 
124 | 
125 | ## extent -> system
126 | 
127 | JeMalloc is a user level memory allocator, we need to consider other applications' memory usage in the current system, so JeMalloc needs to return free memory back when necessary. And after an arena returns free memory to the system, other arenas can use it too.
128 | 
129 | For returning memory to system, we need to call a system call and in the system call, the system needs to zero the memory and mark the memory as unused. If we return too much memory at one step, it may hurt performance.
130 | 
131 | And when an `extent` is free, the application may use it in a very short time. For such a case, we'd better not return to the system too quickly.
132 | 
133 | JeMalloc uses "decay" to handle those things. We already know free slab(extent) is appended to `arena->extents_dirty` by `arena_extents_dirty_dalloc`. Extents in `arena->extents_dirty` will be moved to `arena->extents_muzzy` and the extents in `arena->extents_muzzy` will be returned to the system.
134 | 
135 | Let's see how the "decay" works and then see how the extents are returned from `arena->extents_dirty` to `arena->extents_muzzy`, and then returned to the system.
136 | 
137 | ### Decay
138 | JeMalloc uses a `ticker` to count down how many times an arena allocated/deallocated memory. And when the `ticker` reaches 0, it will trigger `arena_decay` which will calculate how long passed since the last time and call `arena_decay` to  do purging work when needed.
139 | 
140 | The ticker is initialized by `ticker_init(&arenas_tdata[i].decay_ticker, DECAY_NTICKS_PER_UPDATE);` in `arena_tdata_get_hard`, the value of `DECAY_NTICKS_PER_UPDATE` is 1000.
141 | 
142 | `arena_decay_ticks` is used for handling tick related thing and it will be called when memory is allocated or deallocated, such as by `arena_malloc_small`.
143 | 
144 | For `arena_decay_ticks`, it will call `ticker_ticks(decay_ticker, nticks)` for updating the count. After subtract `nticks` (`ticker->tick -= nticks`), if the `ticker->tick` is less than 0, JeMalloc will reset it to 0 and return true. Then `arena_decay` will be called.
145 | 
146 | ``` c
147 | JEMALLOC_ALWAYS_INLINE void
148 | arena_decay_ticks(tsdn_t *tsdn, arena_t *arena, unsigned nticks) {
149 |     // ...
150 |     if (unlikely(ticker_ticks(decay_ticker, nticks))) {
151 |         arena_decay(tsdn, arena, false, false);
152 |     }
153 | }
154 | ```
155 | 
156 | After we know how the `arena_decay` is trigged, let's see how `arena_decay` works.
157 | 
158 | For high level, JeMalloc wants to return memory back smoothly, it means when JeMalloc find n unused pages, instead of returning the n pages in one time, it will return them in serval seconds(default dirty decay is 10 seconds and stored in `DIRTY_DECAY_MS_DEFAULT`) and in several steps(default is 200 and it is in `SMOOTHSTEP_NSTEPS`.)
159 | 
160 | In `struct arena_decay`, `backlog` is used for recording how many unused dirty pages were generated during each of the past SMOOTHSTEP_NSTEPS decay epochs. `nunpurged` is used for recording the number of unpurged pages at beginning of current epoch.
161 | 
162 | The current epoch `backlog` is calculated by `current_npages - decay->nunpurged` and `backlog` saves the number in reverse order, so the current epoch number is stored in the last index, see `arena_decay_backlog_update_last`.
163 | 
164 | And when epochs passed, JeMalloc will move the backlog forward by `memmove(decay->backlog, &decay->backlog[nadvance_z], (SMOOTHSTEP_NSTEPS - nadvance_z) * sizeof(size_t));`, and as time may pass more than one epoch, needs  set uncalculated backlog to 0 by `memset(&decay->backlog[SMOOTHSTEP_NSTEPS - nadvance_z], 0, (nadvance_z-1) * sizeof(size_t));`,  all those are in `arena_decay_backlog_update`.
165 | 
166 | JeMalloc uses `Smoothstep`[2] function to calculate how many pages need to remain. For example, when x = 1, y = 1, that means no pages need to be purged, and when x = 0, y = 0, that means all the pages need to be purged.
167 | 
168 | ![220px-Smoothstep_and_Smootherstep svg](https://user-images.githubusercontent.com/3775525/193720890-e69d2517-2560-4f76-a2f7-1d2067805dfd.png)
169 | 
170 | 
171 | For avoiding calculation, JeMalloc generates those values by `smoothstep.sh`, and the value is encoded by binary fixed point representation and then they are stored in `h_steps`.
172 | 
173 | `arena_decay_backlog_npages_limit` is used for calculating how many pages need to remain.
174 | 
175 | ``` c
176 | static size_t
177 | arena_decay_backlog_npages_limit(const arena_decay_t *decay) {
178 |     // ...
179 |     /*
180 |      * For each element of decay_backlog, multiply by the corresponding
181 |      * fixed-point smoothstep decay factor.  Sum the products, then divide
182 |      * to round down to the nearest whole number of pages.
183 |      */
184 |     sum = 0;
185 |     for (i = 0; i < SMOOTHSTEP_NSTEPS; i++) {
186 |         sum += decay->backlog[i] * h_steps[i];
187 |     }
188 |     npages_limit_backlog = (size_t)(sum >> SMOOTHSTEP_BFP);
189 | 
190 |     return npages_limit_backlog;
191 | }
192 | 
193 | ```
194 | 
195 | ### dirty -> muzzy and muzzy -> system
196 | 
197 | Now let's see how the extents are moved from `dirt` to` muzzy` and how `muzzy` returns them to the system.
198 | 
199 | `arena_decay` will handle dirty and muzzy extents by calling `arena_decay_dirty` and `arena_decay_muzzy`. `arena_maybe_decay` will do the decay job for `arena_decay_dirty` and `arena_decay_muzzy`.
200 | 
201 | In `arena_maybe_decay`, if `decay->time_ms` is 0, will call `arena_decay_to_limit` and the `limit` params is 0, that means deallocate all free extents.
202 | 
203 | And if `decay->time_ms` is not zero, will handle epoch advance related things, calculate limit by `arena_decay_backlog_npages_limit(decay)` and use the limit to call `arena_decay_to_limit`.
204 | 
205 | `arena_decay_stashed` will do purging work for `arena_decay_to_limit`. 
206 | 
207 | In `arena_decay_stashed`, it will check ‘extents' state, if the state is `extent_state_dirty`, it will add the extents to `arena->extents_muzzy`, and if the state is `extent_state_muzzy`, the extent will return to system by `extent_dalloc_wrapper`.
208 | 
209 | Default dirty decay time 10 seconds(`DIRTY_DECAY_MS_DEFAULT`) and default muzzy decay time is 0(MUZZY_DECAY_MS_DEFAULT).
210 | 
211 | ## Others
212 | 
213 | Memory allocation is important for application performance, there are many different implementation[3][4][5][6] or designed for special purpose[7][8].
214 | 
215 | JeMalloc provides per CPU arena but this feature has been disabled by default.
216 | 
217 | TCMalloc uses per-thread mode first and it doesn't work well when an application has a large thread count. Then TCMalloc migrated to the per-CPU approach[8].
218 | 
219 | Slab allocator[5] is used for kernel object caching which has specific consideration about cache line usage. JeMalloc relies on multiple arenas and allows applications to pad the memory[3].
220 | 
221 | The cache line will be helpfull in some cases but not all cases. CPHash[9] does some comparison about this, CPHash is a hash table which designed for reducing cache missing. In its benchamrk, when the hash table's memory usage is bigger than L3 cache, the performance wil drop and is limitted by DRAM.
222 | 
223 | Erlang VM creates one thread for every CPU which is used for scheduling Erlang processes, so it's easier to do per-CPU related things. And Erlang has different allocators for different usages, such as `ets_alloc` for in-memory key-value storage, `temp_alloc` for temporary allocations, and it can specify different fit algorithms for different allocators[6].
224 | 
225 | MICA[7] is in-memory key-value storage. In its cache mode, memory management is very interesting. It organizes its memory as a fix-sized circular log, and key-value items will be appended to the tail of the log. When the circular log is full, items will be evicted from the head. So no need for garbage collection and defragmentation.
226 | 
227 | LAMA is designed for solving slab calcification[11], this problem may occur in the default Memcached server, Memcached allocates different size memory from the different slabs. The memory allocator needs to consider how to evict item(s) when no free memory. LAMA is designed for distributing memory to different slabs dynamically based on the access pattern for reducing the miss rate.
228 | 
229 | ## Summary
230 | In this article, instead of describing the detail such as each structure's fields or how its extent rtree works, I focus on introducing the relationship between different structures and how those structures work together for allocating and deallocating memory. Hope it is helpful for people who want to understand how JeMalloc works.
231 | 
232 | 
233 | ## References
234 | 1. JeMalloc
235 |    https://github.com/jemalloc/jemalloc/releases/tag/5.2.0
236 | 2. Smoothstep
237 |    https://en.wikipedia.org/wiki/Smoothstep
238 | 3. A Scable Concurrent malloc(3) Implementation for FreeBSD
239 | 4. TCMalloc
240 |    https://github.com/google/tcmalloc
241 | 5. The Slab Allocator: An Object-Caching Kernel Memory Allocator
242 | 6. Erlang memory allocator
243 |    https://github.com/erlang/otp/blob//OTP-25.1.1/erts/emulator/beam/erl_alloc.c
244 | 7. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
245 | 8. TCMalloc : Thread-Caching Malloc
246 |    https://google.github.io/tcmalloc/design.html
247 | 9. CPHash: A Cache-Partitioned Hash Table
248 | 10. LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache
249 | 11. Caching with twemcache.
250 |     https://blog.twitter.com/2012/caching-with-twemcache
251 | 


--------------------------------------------------------------------------------
/blog/2025-01-01.md:
--------------------------------------------------------------------------------
 1 | # Understanding the Page Table Step by Step
 2 | 
 3 | The concept of the Page Table can be challenging to grasp, and it has puzzled me for a long time. However, during a recent revisit to xv6, a simple Unix-like teaching operating system, I realized that it becomes much easier to understand by thinking of it as a specialized hash map and breaking it down into smaller steps.
 4 | 
 5 | 
 6 | ## What is a Page Table?
 7 | 
 8 | A page table is a specialized hash map that maps contiguous virtual memory to contiguous physical memory. Then we can use a large array to store the page table. Finally, multi-level page tables help reduce memory usage by allowing lazy memory allocation.
 9 | 
10 | 
11 | ## Step 1: Mapping Contiguous Virtual Memory to Contiguous Physical Memory
12 | 
13 | Let’s simplify the problem further: mapping virtual memory to physical memory. In programming, we can use a hash map to achieve this.
14 | 
15 | ```c
16 | uint64 virtual_memory_to_physical_memory(uint64 virtual_addr) {
17 |   // Each process has its own hash map, mapping the same virtual memory
18 |   // to different physical memory.
19 |   HashMap map = current_process_map();
20 | 
21 |   return map[virtual_addr];
22 | }
23 | ```
24 | 
25 | Due to spatial locality, we prefer to map contiguous virtual memory to contiguous physical memory. Instead of requiring the entire virtual memory to be contiguous in physical memory, we divide it into smaller units called pages. Within a page, all its virtual memory is mapped to the same physical memory page which fulfills performance requirements and allows us to allocate physical memory lazily.
26 | 
27 | <img width="458" alt="Screenshot 2025-01-01 at 15 10 32" src="https://github.com/user-attachments/assets/c89df896-501a-4ddc-8957-397a8437eb7c" />
28 | 
29 | 
30 | Now, each page table entry corresponds to the starting address of a page. Assuming a page size of 4KB, the last 12 bits of the virtual address are irrelevant for determining the index.
31 | 
32 | `uint64 index = virtual_addr >> 12;`
33 | 
34 | 
35 | The last 12 bits of the virtual address act as an offset. The mapping function now looks like this:
36 | 
37 | ```c
38 | uint64 virtual_addr_to_physical_addr(uint64 virtual_addr) {
39 |   HashMap page_table = current_process_map();
40 |   uint64 index = virtual_addr >> 12;
41 |   uint64 offset = virtual_addr & 0xFFF;
42 | 
43 |   return page_table[index] + offset;
44 | }
45 | ```
46 | 
47 | ## Step 2: Representing the Page Table in Memory
48 | 
49 | Now let’s consider how to represent the hash map in memory. Since `virtual_addr_to_physical_addr` maps different virtual addresses to the starting addresses of physical pages and the indices are continuous, we can use an array to represent the hash map. Each array item is a page table entry (PTE).
50 | 
51 | In memory, the structure looks like this:
52 | 
53 | <img width="719" alt="Screenshot 2025-01-01 at 15 09 56" src="https://github.com/user-attachments/assets/937acd69-96d5-4264-8968-2562dac8b9ea" />
54 | 
55 | ## Step 3: Multi-Level Page Tables
56 | 
57 | From the above image, we notice an obvious problem: it needs to allocate a large page table, even though most of the virtual memory might not be in use. For example, to map 64GB of virtual memory to 4KB pages, we need an array of length 64 * 1024 * 1024 / 4 = 2 ^ 24.  If each entry requires 8 bytes, the page table would require 32MB of memory.
58 | 
59 | To reduce memory usage, we can use multi-level page tables. For example, a two-level page table setup allows the root page table's entries to point to secondary page tables. Using a 4KB page to store a page table, each can contain 2 ^ 12 addresses, pointing to other page tables. Each of these secondary page tables can map
60 | 512 × 4 KB = 2 MB of virtual memory. Thus, one root page table can map 2 × 512 = 1024MB or 1GB of virtual memory. Memory for secondary page tables is allocated only when necessary.
61 | 
62 | <img width="350" alt="Screenshot 2025-01-01 at 16 54 33" src="https://github.com/user-attachments/assets/2ec42859-d2da-49a4-b20c-7e4afc9fdbea" />
63 | 
64 | The function for two-level page tables is as follows:
65 | ```c
66 | uint64 virtual_addr_to_physical_addr(uint64 virtual_addr) {
67 |   uint64 *root_page_table = current_process_map();
68 |   uint64 root_index = (virtual_addr >> (9 + 12)) & 0x1FF;
69 | 
70 |   uint64 *page_table = root_page_table[root_index];
71 |   uint64 page_table_index = (virtual_addr >> 12) & 0x1FF;
72 | 
73 |   uint64 offset = virtual_addr & 0xFFF;
74 | 
75 |   return page_table[page_table_index] + offset;
76 | }
77 | ```
78 | 
79 | ## Others
80 | 
81 | There are many other aspects of page tables, such as storing page permissions in PTEs and using the Translation Lookaside Buffer (TLB) to cache PTEs. These topics are beyond the scope of this article but are covered in the xv6 textbook and the MIT operating systems course.
82 | 
83 | In computer science, many concepts appear complex at first. However, by breaking them into smaller steps, their underlying ideas become clear and manageable. Taking baby steps is often the golden rule.
84 | 


--------------------------------------------------------------------------------
/blog/2025-01-04-linear-hash.md:
--------------------------------------------------------------------------------
 1 | # Understanding Linear Hash Step by Step
 2 | 
 3 | Many computer science concepts seem complex at first; however, if we break them down step by step, they become much clearer. In this article, I will describe how a hash table works, how resizing works, and how linear hashing improves resizing. By the end, you will see that linear hashing is straightforward to understand.
 4 | 
 5 | ## Hash Table
 6 | A hash table uses an array to store key-value pairs. When we insert a pair, we use a hash function to calculate the key’s index and then insert the pair at that index in the array.
 7 | 
 8 | ```ruby
 9 | def insert(key, val)
10 |    index = hash(key) % @array.length
11 |    @array[index] = [key, val]
12 | end
13 | ```
14 | 
15 | Since two different keys may have the same hash value (`hash(key1) == hash(key2)`), we handle this by using a linked list to store multiple pairs at the same index.
16 | 
17 | ```ruby
18 | def insert(key, val)
19 |   index = hash(key) % @array.length
20 |   node = Node.new(key, val)
21 |   if @array[index]
22 |     @array[index].next = node
23 |   else
24 |     @array[index] = node
25 |   end
26 | end
27 | ```
28 | 
29 | ## Resizing & Linear Hash
30 | From the code above, we see that if a newly inserted key’s hash value matches an existing key’s hash value, the new pair is appended to the linked list at that index. If such collisions occur frequently, the linked list grows longer, and the hash table becomes slower.
31 | 
32 | The most simple solution is to double the array size. But after doubling the array size, we must rearrange the key-value pairs.
33 | 
34 | Suppose the array’s length is 4, with key1’s hash value being 0 and key2’s hash value being 4. Since the array’s length is 4, both of these keys are stored at index 0. After doubling the array size, key1 should be stored at index 0 and key2 at index 4. After resizing, we must rearrange these pairs. This process is called resizing or rehashing.
35 | 
36 | <img width="360" alt="Screenshot 2025-01-03 at 00 27 45" src="https://github.com/user-attachments/assets/a54bbf13-a4b2-41dc-92ae-1bf26eb9b911" />
37 | 
38 | This basic resizing algorithm resolves the conflict issue, but if the array becomes very large, we must move many pairs to their new locations, which can consume much CPU in a short time.
39 | 
40 | To improve this, instead of doubling the array size, we could increase the size one step at a time. For example, we add one element to the array at a time and move the corresponding pairs to the new array item. Now, we need to consider how to calculate the index.
41 | 
42 | Suppose we use the old hash function `hash(key) % 4` and get an index of 1, 2, or 3. In this case, we know there is no new array item for those keys—they are correctly indexed by the old hash function. However, if the index is 0, after adding the new item, its correct index could be 0 or 4, so we need to use a new hash function `hash(key) % 8` to find the correct index.
43 | 
44 | With two hash functions (`hash(key) % 4` and `hash(key) % 8`), we need to determine which one to use. The solution is simple: we can maintain a cursor to track which elements have already been resized. First, we use the old hash function to calculate the index. If the index is less than the cursor, it means we have already created a new array element for it, so we use the new hash function. If the index is greater than or equal to the cursor, we know it doesn't have a new array element, so the old hash function is correct.
45 | 
46 | <img width="748" alt="Screenshot 2025-01-03 at 00 30 34" src="https://github.com/user-attachments/assets/6b73e6fc-4fb9-4cb9-b657-feb1710f8e84" />
47 | 
48 | ```ruby
49 | def index(key)
50 |   i = hash(key) % @n
51 | 
52 |   if i < @cursor
53 |     i = hash(key) % (@n * 2)
54 |   end
55 | 
56 |   i
57 | end
58 | ```
59 | 
60 | ## Summary
61 | This article doesn't cover all the details, such as how to maintain the `@n`, but I hope it gives you a clear understanding of linear hashing. Learning an algorithm’s implementation may seem complex at first, but once we understand the ideas behind it, everything becomes much clearer.
62 | 


--------------------------------------------------------------------------------
/blog/2025-01-15-non-blocking-stack-profiler.md:
--------------------------------------------------------------------------------
 1 | # How SDB Scans the Ruby Stack Without the GVL
 2 | 
 3 | ## Introduction
 4 | 
 5 | Several Ruby stack profilers already exist, but most of them depend on the Ruby Global VM Lock (GVL), which blocks applications while scanning the Ruby stack. Based on my testing, a Ruby stack profiler can increase request latency by up to 10% when sampling at a 1ms interval. Rbspy's async mode is an exception, as it scans the Ruby stack from a separate process. However, this approach can lead to errors[1] because it accesses complex Ruby data structures without acquiring the GVL.
 6 | 
 7 | In this article, I will introduce how SDB scans Ruby stacks without the GVL while maintaining safety and delivering the results we want. It allows us to scan the Ruby stack without increasing application delay. I believe this implementation is interesting as it demonstrates the benefits of releasing the GVL in a performance-sensitive library.
 8 | 
 9 | 
10 | ## Scanning Stacks Without GVL
11 | 
12 | The Ruby GVL ensures the integrity of Ruby data but impacts concurrency.
13 | 
14 | It seems that we must hold the GVL to access Ruby data. However, when we look closely, if the data itself can guarantee atomic, it doesn’t need the GVL’s protection. For example, it is safe to read 64-bit aligned data since its updates are atomic — there are no partial updates[2][3]. An ISeq’s address is an example of such data. If we know where an ISeq’s address is stored, it is safe to read it without holding the GVL.
15 | 
16 | The Ruby stack contains ISeq data in an array of `rb_control_frame_struct` objects, allocated when Ruby creates a thread. It is safe to read it as long as Ruby has not reclaimed the stack. Without the GVL, we might encounter empty or outdated data (due to lack of memory barriers), but we will not read invalid data (e.g., addresses from a different ISeq).
17 | 
18 | To ensure we stop scanning a thread's stack once it has been reclaimed, we can use meta-programming to hook into Ruby's thread lifecycle. The following example demonstrates how to wrap thread creation and deletion:
19 | 
20 | ```ruby
21 | module ThreadInitializePatch
22 |   def initialize(*args, &block)
23 |     old_block = block
24 | 
25 |     block = ->() do
26 |       Sdb.thread_created(Thread.current)
27 |       result = old_block.call(*args)
28 |       Sdb.thread_deleted(Thread.current)
29 |       result
30 |     end
31 | 
32 |     super(&block)
33 |   end
34 | end
35 | 
36 | Thread.prepend(ThreadInitializePatch)
37 | ```
38 | 
39 | Before Ruby reclaims a thread, it calls `Sdb.thread_deleted`, which removes the thread from SDB's scanning list. This ensures safe stack scanning without the GVL.
40 | 
41 | ## Symbolization
42 | 
43 | The data we collected are ISeq addresses, which are not readable to humans.SDB uses eBPF to capture ISeq creation events during compilation or loading (e.g., via bootsnap). Then, we can use an offline program to translate these addresses into human-readable symbols.
44 | Ruby's memory compaction can move ISeq objects to new addresses. To handle this, we can use eBPF to capture memory compaction events too. This feature is still in progress, as Ruby's memory compaction is not enabled by default and is rarely used in production applications.
45 | 
46 | SDB’s architecture looks like this:
47 | <img width="1292" alt="image" src="https://github.com/user-attachments/assets/01850ce7-63cb-4828-b2ce-6825927d4890" />
48 | 
49 | 
50 | ## Ensuring Fully Correctness
51 | 
52 | Data races may occur if the Ruby VM updates the stack while SDB is reading it. However, resolving this issue completely is not the primary goal of a stack profiler, as minor inaccuracies do not typically affect its usage in identifying performance bottlenecks.
53 | 
54 | Similar issues exist in other profilers like rbspy[4], py-spy[5], and async-profiler[6]. The key question is whether the profiler can identify slow paths in the application.  Data races can only occur if the stack is updated faster than the scanner can read it, which typically happens with extremely fast functions that have minimal impact on overall performance.
55 | 
56 | To manage this, we can use a generation number for optimistic concurrency control. When a stack is pushed, we increment the generation number by one and check this number before and after scanning the stack.[7]
57 | 
58 | ## Summary
59 | 
60 | This article introduces how SDB scans the Ruby stack without relying on the GVL. By not blocking the application, it enables faster stack scanning and more accurate latency measurements, making a true always-on stack profiler possible. While releasing the GVL resolves the main challenges faced by Ruby stack profilers, it is not sufficient to create a fully non-blocking stack profiler. SDB employs additional concurrency techniques to achieve this, such as spinlocks for synchronizing threads between the Ruby VM and SDB, memory barriers for trace IDs, and a left-right style [7] message queue for its new symbolizer. SDB is still under development, with features such as memory compaction support and detailed evaluations and comparisons based on a Rails application coming soon.
61 | 
62 | ## References
63 | 
64 | 1. [Should I pause a Ruby process to collect its stack?
65 | ](https://jvns.ca/blog/2018/01/15/should-i-pause-a-ruby-process-to-collect-its-stack/#what-happens-if-i-don-t-pause-the-ruby-process-i-m-profiling)
66 | 2. https://pdos.csail.mit.edu/6.1810/2024/lec/l-rcu.txt
67 | 3. [Ruby Memory Model](https://docs.google.com/document/d/1pVzU8w_QF44YzUCCab990Q_WZOdhpKolCIHaiXG-sPw/edit?tab=t.0)
68 | 4. https://github.com/rbspy/rbspy
69 | 5. https://github.com/benfred/py-spy
70 | 6. https://github.com/async-profiler/async-profiler
71 | 7. Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads
72 | 


--------------------------------------------------------------------------------
/blog/2025-01-27-memory-barrier.md:
--------------------------------------------------------------------------------
  1 | # Understanding Memory Barriers Step by Step - A Summary of "Memory Barriers: A Hardware View for Software Hackers"
  2 | 
  3 | ## Introduction
  4 | 
  5 | This article is a summary of "Memory Barriers: A Hardware View for Software Hackers." and instead of covering all the details, I will focus on explaining how things come. Since memory barriers are quite complex, it is highly recommended to read the original paper for a more thorough understanding.
  6 | 
  7 | CPUs reorder memory references to improve performance. However, in certain situations—such as synchronization primitives—we need memory barriers to ensure correctness.
  8 | 
  9 | A highly simplified story is (The "→" represents causation):
 10 | 
 11 | *Multiple CPU cores + cache lines → data is replicated across different cache lines → Cache-Coherence Protocols are used to guarantee consistency → these protocols sometimes require blocking CPU execution → asynchronous messaging is introduced but causes state inconsistencies → memory barriers are used to guarantee partial ordering.*
 12 | 
 13 | Modern computers have multiple CPU cores, and each core has its own cache. This means a variable(address) might be replicated across different CPU caches. To prevent inconsistencies or data loss for a single variable across multiple caches, cache-coherence protocols are used.
 14 | 
 15 | Cache-coherence protocols rely on message exchanges between caches and main memory. These exchanges introduce latency, as CPUs must wait for acknowledgments, which is extremely slow compared to CPU processing speeds. For example, when a CPU performs a store operation, it might need to wait for acknowledgments from other CPUs, blocking its execution. To mitigate such stalls, CPUs proceed with local state updates without waiting for responses or quickly send acknowledgments without fully applying changes to the cache line. This behavior is similar to how changes within a database transaction remain visible only to that transaction until it is committed. Once the CPU receives all necessary responses, the updated state becomes globally visible to all CPUs.
 16 | 
 17 | While this approach works well for single variables, problems arise when multiple memory addresses and concurrent operations across CPUs are involved. **Memory barriers** address these issues by providing ordering guarantees. When a memory barrier is inserted, all memory operations (reads or/and writes) before it are guaranteed to complete before any memory operations after it begin. Using a database analogy, a memory barrier acts like a "commit," ensuring that all prior store and/or load operations are finalized before continuing. This enforces a *happens-before* relationship.
 18 | 
 19 | In the following sections, I will introduce the Cache-Coherence Protocols, store buffers and invalidate queues.
 20 | 
 21 | ## Cache-Coherence Protocols
 22 | 
 23 | <img src="../images/memory-barrier-0.png" alt="memory-barrier" width="500">
 24 | 
 25 | As data can be replicated to different CPU caches, cache-coherence protocols are required to ensure consistency.
 26 | 
 27 | Basically, a CPU can update a cache line only when that cache line has exclusive ownership of the cached data. For example, if a cache line is in a "shared" state (i.e., multiple caches have replicas of the same address), the CPU must send an "invalidate" message to all other CPUs and wait for their responses. Upon receiving this message, the other CPUs remove the corresponding data from their caches and respond. Once the CPU that initiated the invalidation receives all the responses, it gains exclusive ownership of the cache line and can modify it. This process is described in Transition (h) and Transition (b) of the paper.
 28 | 
 29 | While there are more details in the paper, the basic idea is the same: when a CPU wants to modify a cache line, it must have the exclusive ownership of the data.
 30 | 
 31 | ## Store Buffers
 32 | 
 33 | ### Why We Need Store Buffers
 34 | 
 35 | 1. Cache-coherence protocols ensure consistency but can be slow.
 36 |    For example, a CPU must wait for responses from all other CPUs when writing to a shared cache line.
 37 | 2. Store buffers are added to avoid blocking.
 38 |    The CPU can record its write operation in the buffer and continue executing other instructions.
 39 | 3. The data is removed from the store buffer after the CPU receives all necessary acknowledgments.
 40 | 
 41 | Now the CPU becomes:
 42 | 
 43 | <img src="../images/memory-barrier-store-buffer.png" alt="memory-barrier" width="500">
 44 | 
 45 | ### Store Forwarding
 46 | 
 47 | As a CPU writes data to both the cache line and store buffer, for reading its own update, it needs to read from the store buffer (not directly from the cache).
 48 | 
 49 | 
 50 | For example, the following code is executed by a single CPU:
 51 | 
 52 | ```c
 53 | 1 a = 1
 54 | 2 b = a + 1
 55 | ```
 56 | 
 57 | At line 1(`a = 1`), the value 1 is stored in the store buffer. At line 2(`b= a + 1`), the CPU may need to read a. If it reads `a` from the cache directly, it would see the old value (0) because the new value hasn't been applied to the cache yet. To avoid this inconsistency, the CPU reads `a` from the store buffer, which has the updated value. This guarantee is called self-consistency, and the mechanism is referred to as store forwarding.
 58 | 
 59 | 
 60 | ### Store Buffers and Memory Barriers
 61 | 
 62 | The next issue arises from the interaction of multiple variables and multiple CPUs.
 63 | 
 64 | ```c
 65 | 1. void foo(void) {
 66 | 2.   a = 1;
 67 | 3.   b = 1;
 68 | 4. }
 69 | 5.
 70 | 6. void bar(void) {
 71 | 7.  while (b == 0) continue;
 72 | 8.  assert(a == 1);
 73 | 9. }
 74 | ```
 75 | 
 76 | Here, CPU 0 executes `foo`, and CPU 1 executes `bar`. If all instructions are executed in order, the assertion (`assert(a == 1)`) cannot fail. Surprisingly, hardware does not always guarantee this ordering.
 77 | 
 78 | 
 79 | Assume CPU 0 owns b in its cache but not a, and CPU 1 has a in its cache. When CPU 0 executes line 2 (`a = 1`), it stores the value in its store buffer because it doesn’t yet own `a`. When CPU 0 executes line 3 (`b = 1`), it updates b directly in its cache, as it already owns it. CPU 1 then executes line 7 (`while (b == 0)`), and upon detecting that `b` is 1, it proceeds to line 8 (`assert(a == 1)`). Since CPU 1 hasn’t received an "invalidate" message for `a`, it reads the old value (0) from its cache, causing the assertion to fail.
 80 | 
 81 | The failure occurs because after CPU 0 executes line 2 (`a = 1`), the value is still in its store buffer and not visible to other CPUs. However, line 3 (`b = 1`) is directly applied to the cache, making it visible. As a result, from CPU 1’s perspective, `b = 1` happens before `a = 1`, leading to an incorrect ordering.
 82 | 
 83 | 
 84 | To prevent such issues, a memory barrier need to be inserted to enforce the correct ordering of operations. This guarantees that all store operations before the barrier are visible to other CPUs before any store operations after the barrier are applied. The modified code would look like this:
 85 | 
 86 | ```c
 87 | 1. void foo(void) {
 88 | 2.   a = 1;
 89 | 3.   smp_mb();
 90 | 4.   b = 1;
 91 | 5. }
 92 | 6.
 93 | 7. void bar(void) {
 94 | 8.  while (b == 0) continue;
 95 | 9.  assert(a == 1);
 96 | 10. }
 97 | ```
 98 | 
 99 | In this case, when CPU 0 executes line 3 (`smp_mb()`), it marks all current store-buffer entries (namely, `a = 1`). Then, when it executes line 4 (`b = 1`), since there is a marked entry, instead of immediately applying `b = 1` to its cache, it saves it in its store buffer (but as an unmarked entry), even if the CPU already owns the cache line for b. Thus, the memory barrier at line 3 ensures that other CPUs observe `a = 1` happening before `b = 1`.
100 | 
101 | ## Invalidate Queues
102 | ### Why We Need Invalidate Queues
103 | - When a cache line is busy, the CPU might fall behind in processing "invalidate" messages.
104 | - Then the store buffer could be full and it stalls the CPU's execution.
105 | - To address the speed discrepancy, a new queue is added to each CPU.
106 |   When an "invalidate" message arrives at the queue, the CPU can immediately acknowledge it.
107 | - The CPU ensures that, when placing an entry into the invalidate queue, it processes the entry before transmitting any protocol messages related to that cache line.
108 | 
109 | Now the CPU looks like this:
110 | 
111 | <img src="../images/memory-barrier-invalid-queue.png" alt="memory-barrier" width="500">
112 | 
113 | ### Invalidate Queues and Memory Barriers
114 | 
115 | After the new queue has been added, it introduces additional inconsistency. As the state has been applied to the store buffer, not the cache line, the CPU can see "outdated" data.
116 | 
117 | And this inconsistency causes our example to fail again. Let’s examine how this happens:
118 | 
119 | ```c
120 | 1. void foo(void) {
121 | 2.   a = 1;
122 | 3.   smp_mb();
123 | 4.   b = 1;
124 | 5. }
125 | 6.
126 | 7. void bar(void) {
127 | 8.  while (b == 0) continue;
128 | 9.  assert(a == 1);
129 | 10. }
130 | ```
131 | 
132 | 
133 | Suppose `a` is in the "shared" state and resides in the caches of both CPU 0 and CPU 1. Meanwhile, `b` is owned by CPU 0.
134 | 
135 | When CPU 1 executes line 8 (`while (b == 0) continue;`), it doesn’t have `b`'s cache line and sends a read message to CPU 0. CPU 0 then executes line 2 (`a = 1`) and line 3 (`smp_mb()`) and waits for a response from CPU 1. CPU 1 receives CPU 0’s “invalidate” message for `a`, queues it, and immediately sends an acknowledgment (step 3 in section 4.3 of the paper). However, note that CPU 1 has only queued the "invalidate" message, meaning `a = 1` is not yet visible to CPU 1.
136 | 
137 | Next, CPU 0 executes line 4 (`b = 1`). Since it owns `b`, it updates the cache line with `b = 1`. Then CPU 0 receives the read message for `b` from CPU 1 and responds with `b = 1` (which has already been applied to CPU 0’s cache line, so it's visible to CPU 1). After CPU 1 receives `b = 1`, it exits the while loop and executes line 9 (`assert(a == 1)`). However, since CPU 1 has not yet processed the "invalidate" message for `a`, `a` is still 0, and the assertion fails.
138 | 
139 | 
140 | The solution is to add an additional memory barrier between lines 8 and 9. The updated code is:
141 | 
142 | ```c
143 | 1. void foo(void) {
144 | 2.   a = 1;
145 | 3.   smp_mb();
146 | 4.   b = 1;
147 | 5. }
148 | 6.
149 | 7. void bar(void) {
150 | 8.  while (b == 0) continue;
151 | 9   smp_mb();
152 | 10. assert(a == 1);
153 | 11. }
154 | ```
155 | 
156 | Now, after CPU 1 finishes line 8(`while (b == 0) continue`) and executes line 9(`smp_mb()`), the memory barrier ensures that CPU 1 must stall until it processes all preexisting messages in its invalidation queue. Therefore, by the time CPU 1 reaches line 10, `a = 1` has already been applied, and the assertion succeeds as expected.
157 | 
158 | ## Summary
159 | 
160 | This article explains the basic concepts from "Memory Barriers: A Hardware View for Software Hackers," focusing on the problems and solutions in memory consistency for systems with multiple CPU cores and caches.
161 | 
162 | Multiple CPU cores and cache lines are introduced to improve parallel performance. Then, cache-coherence protocols are used to guarantee correctness. However, as these protocols are slow in certain situations, new buffers/queues are added. Finally, memory barriers are used to provide ordering. It is similar to a database transaction, with internal state, order, and the point of visibility for other CPU cores.
163 | 
164 | It doesn't cover many details, such as the specifics of the cache-coherence protocols, but I believe that to understand a concept, it's important to grasp the big picture—what the problem is and how it is solved.
165 | 


--------------------------------------------------------------------------------
/images/memory-barrier-0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-0.png


--------------------------------------------------------------------------------
/images/memory-barrier-invalid-queue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-invalid-queue.png


--------------------------------------------------------------------------------
/images/memory-barrier-store-buffer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/images/memory-barrier-store-buffer.png


--------------------------------------------------------------------------------
/slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/Concurrency-task-schedule-brief-introduction@RubyConf-China-2020.key


--------------------------------------------------------------------------------
/slides/Regression-Test-Selection-for-Rails-Project.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/Regression-Test-Selection-for-Rails-Project.key


--------------------------------------------------------------------------------
/slides/cow-in-xv6.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/cow-in-xv6.key


--------------------------------------------------------------------------------
/slides/draft/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/draft/.keep


--------------------------------------------------------------------------------
/slides/erlang-message-passing.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/erlang-message-passing.key


--------------------------------------------------------------------------------
/slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/sdb-a-new-ruby-stack-profiler@RubyConf-China-2024.key


--------------------------------------------------------------------------------
/slides/sdb-rubykaigi-2025.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yfractal/blog/c6b6c8132db61ef1b595e89a0918e23c4276d3b2/slides/sdb-rubykaigi-2025.key


--------------------------------------------------------------------------------