├── README.md ├── docs ├── bthread_basis.md ├── bthread_schedule.md ├── butex.md ├── client_bthread_sync.md ├── client_retry.md ├── client_rpc_exception.md ├── client_rpc_normal.md ├── futex.md ├── io_protobuf.md ├── io_read.md ├── io_write.md ├── linkedlist.md ├── resource_pool.md └── thread_model.md └── images ├── bthread_basis_1.png ├── bthread_basis_2.png ├── bthread_basis_3.png ├── bthread_basis_4.png ├── bthread_schedule_1.png ├── butex_1.png ├── client_bthread_sync_1.png ├── client_send_req_1.png ├── client_send_req_2.png ├── io_write_linklist_1.png ├── io_write_linklist_2.png ├── io_write_linklist_3.png ├── io_write_linklist_4.png └── linklist_1.png /README.md: -------------------------------------------------------------------------------- 1 | # 目录 2 | * brpc的M:N线程模型 3 | * [bthread基础](docs/bthread_basis.md) 4 | * [多核环境下pthread调度执行bthread的过程](docs/bthread_schedule.md) 5 | * [pthread线程间的Futex同步](docs/futex.md) 6 | * [Butex机制：bthread粒度的挂起与唤醒](docs/butex.md) 7 | * Client端执行流程 8 | * [无异常状态下的一次完整RPC请求过程](docs/client_rpc_normal.md) 9 | * [RPC请求可能遇到的多种异常及应对策略](docs/client_rpc_exception.md) 10 | * 重试&Backup Request 11 | * [同一RPC过程中各个bthread间的互斥](docs/client_bthread_sync.md) 12 | * Server端执行流程 13 | * 处理一次RPC请求的完整过程 14 | * 服务器自动限流 15 | * 防雪崩 16 | * 并发读写TCP连接上的数据 17 | * protobuf编程模式 18 | * [多线程向同一TCP连接写入数据](docs/io_write.md) 19 | * [从TCP连接读取数据的并发处理](docs/io_read.md) 20 | * 内存管理 21 | * [ResourcePool：多线程下高效的内存分配与回收](docs/resource_pool.md) 22 | * I/O读写缓冲区 23 | * brpc的实时监控 24 | * bvar库 25 | * 常用性能监控指标 26 | * 基础库 27 | * [侵入式双向链表](docs/linkedlist.md) 28 | * FlatMap哈希表 29 | * 多线程框架下的定时器 30 | -------------------------------------------------------------------------------- /docs/bthread_basis.md: -------------------------------------------------------------------------------- 1 | [线程执行协程与线程调用函数的不同](#线程执行协程与线程调用函数的不同) 2 | 3 | [协程的原理与实现方式](#协程的原理与实现方式) 4 | 5 | [系统线程执行多个协程时的内存布局变化过程](#系统线程执行多个协程时的内存布局变化过程) 6 | 7 | [brpc的bthread任务定义](#brpc的bthread任务定义) 8 | 9 | ## 线程执行协程与线程调用函数的不同 10 | 一个pthread系统线程执行一个函数时，需要在pthread的线程栈上为函数创建栈帧，函数的形式参数和局部变量都分配在栈帧内。函数执行完毕后，按逆序销毁函数局部变量，再销毁栈帧。假设有一个线程A开始执行下面的foo函数： 11 | 12 | ```c++ 13 | void bar(int m) { 14 | // 执行点1 15 | // ... 16 | } 17 | 18 | void foo(int a, int b) { 19 | int c = a + b; 20 | bar(c); 21 | // 执行点2 22 | // ... 23 | } 24 | ``` 25 | 26 | 执行到foo函数中的执行点1时，线程A的栈帧如下图所示： 27 | 28 |

29 | 30 | 线程A从bar()函数返回，执行到执行点2时，先销毁bar()函数的形参m，再销毁bar()的栈帧，从foo()函数返回后，先销毁局部变量c，接着销毁形参b、a，最后销毁foo()的栈帧。 31 | 32 | 像上述这种在foo()函数内调用bar()函数的过程，必须等到bar()函数return后，foo()函数才从bar()函数的返回点恢复执行。 33 | 34 | 一个协程可以看做是一个单独的任务，相应的也有一个任务处理函数。对协程来说最重要的两个操作是yield和resume（yield和resume的实现见下文描述），yield是指一个正在被pthread系统线程执行的协程被挂起，让出cpu的使用权，pthread继续去执行另一个协程的任务函数，从协程角度看是协程中止了运行，从系统线程角度看是pthread继续在运行；resume是指一个被中止的协程的任务函数重新被pthread执行，恢复执行点为上一次yield操作的返回点。 35 | 36 | 有了yield和resume这两个原语，可以实现pthread线程执行流在不同函数间的跳转，只需要将函数作为协程的任务函数即可。一个线程执行一个协程A的任务处理函数taskFunc_A时，如果想要去执行另一个协程B的任务处理函数taskFunc_B，不必等到taskFunc_A执行到return语句，可以在taskFunc_A内执行一个yield语句，然后线程执行流可以从taskFunc_A中跳出，去执行taskFunc_B。如果想让taskFunc_A恢复执行，则调用一个resume语句，让taskFunc_A从yield语句的返回点处开始继续执行，并且taskFunc_A的执行结果不受yield的影响。 37 | 38 | ## 协程的原理与实现方式 39 | 协程有三个组成要素：一个任务函数，一个存储寄存器状态的结构，一个私有栈空间（通常是malloc分配的一块内存，或者static静态区的一块内存）。 40 | 41 | 协程被称作用户级线程就是因为协程有其私有的栈空间，pthread系统线程调用一个普通函数时，函数的栈帧、形参、局部变量都分配在pthread线程的栈上，而pthread执行一个协程的任务函数时，协程任务函数的栈帧、形参、局部变量都分配在协程的私有栈上。 42 | 43 | yield和resume有多种实现方式，可以使用posix的ucontext，boost的fcontext，或者直接用汇编实现。下面用ucontext讲述下如何实现协程的yield和resume： 44 | 45 | 1. posix定义的ucontext数据结构如下： 46 | 47 | ```c++ 48 | typedef struct ucontext 49 | { 50 | unsigned long int uc_flags; 51 | // uc_link指向的ucontext是后继协程的ucontext，当前协程的任务函数return后，会自动将后继协程 52 | // 的ucontext缓存的寄存器值加载到cpu的寄存器中，cpu会去执行后继协程的任务函数（可能是从函数 53 | // 入口点开始执行，也可能是从函数yield调用的返回点恢复执行）。 54 | struct ucontext *uc_link; 55 | // 协程私有栈。 56 | stack_t uc_stack; 57 | // uc_mcontext结构用于缓存协程yield时，cpu各个寄存器的当前值。 58 | mcontext_t uc_mcontext; 59 | __sigset_t uc_sigmask; 60 | } ucontext_t; 61 | ``` 62 | 63 | 2. ucontext的api接口有如下四个： 64 | 65 | - int getcontext(ucontext_t *ucp) 66 | 67 | 将cpu的各个寄存器的当前值存入当前正在被cpu执行的协程A的ucontext_t的uc_mcontext结构中。重要的寄存器有栈底指针寄存器、栈顶指针寄存器、协程A的任务函数下一条将被执行的语句的指令指针寄存器等。 68 | 69 | - int setcontext(const ucontext_t *ucp) 70 | 71 | 将一个协程A的ucontext_t的uc_mcontext结构中缓存的各种寄存器的值加载到cpu的寄存器中，cpu可以根据栈底指针寄存器、栈顶指针寄存器定位到该协程A的私有栈空间，根据指令指针寄存器定位到协程A的任务函数的执行点（可能为函数入口点也可能为函数yield调用的返回点），从而cpu可以去执行协程A的任务函数，并将函数执行过程中产生的局部变量等分配在协程A的私有栈上。 72 | 73 | - void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...) 74 | 75 | 指定一个协程的任务函数func以及func的argc等参数。 76 | 77 | - int swapcontext(ucontext_t *oucp, ucontext_t *ucp) 78 | 79 | 相当于getcontext(oucp) + setcontext(ucp)的原子调用，将cpu寄存器的当前值存入oucp指向的ucontext_t的uc_mcontext结构中，并将ucp指向的ucontext_t的uc_mcontext结构中缓存的寄存器值加载到cpu的寄存器上，目的是让当前协程yield，让ucp对应的协程start或resume。 80 | 81 | 用ucontext实现的一个协程的内存布局如下图所示： 82 | 83 |

84 | 85 | ## 系统线程执行多个协程时的内存布局变化过程 86 | 下面通过一个协程示例程序，展现pthread系统线程执行多个协程时的内存变化过程： 87 | 88 | ```c++ 89 | static ucontext_t ctx[3]; 90 | 91 | static void func_1(void) { 92 | int a; 93 | // 执行点2，协程1在这里yield，pthread线程恢复执行协程2的任务函数，即令协程2 resume。 94 | swapcontext(&ctx[1], &ctx[2]); 95 | // 执行点4，协程1从这里resume恢复执行。 96 | // func_1 return后，由于ctx[1].uc_link = &ctx[0]，将令main函数resume。 97 | } 98 | 99 | static void func_2(void) { 100 | int b; 101 | // 协程2在这里yield，pthread线程去执行协程1的任务函数func_1。 102 | swapcontext(&ctx[2], &ctx[1]); 103 | // 执行点3，协程2从这里resume恢复执行。 104 | // func_2 return后，由于ctx[2].uc_link = &ctx[1]，将令协程1 resume。 105 | } 106 | 107 | int main(int argc, char **argv) { 108 | // 定义协程1和协程2的私有栈。 109 | // 在这个程序中，协程1和协程2都在main函数return之前执行完成， 110 | // 所以将协程私有栈内存区定义为main函数的局部变量是安全的。 111 | char stack_1[1024] = { 0 }; 112 | char stack_2[1024] = { 0 }; 113 | 114 | // 初始化协程1的ucontext_t结构ctx[1]。 115 | getcontext(&ctx[1]); 116 | // 在ctx[1]结构中指定协程1的私有栈stack_1。 117 | ctx[1].uc_stack.ss_sp = stack_1; 118 | ctx[1].uc_stack.ss_size = sizeof(stack_1); 119 | // ctx[0]用于存储执行main函数所在线程的cpu的各个寄存器的值， 120 | // 下面语句的作用是，当协程1的任务函数return后，将ctx[0]中存储的各寄存器的值加载到cpu的寄存器中， 121 | // 也就是pthread线程从main函数之前的yield调用的返回处继续执行。 122 | ctx[1].uc_link = &ctx[0]; 123 | // 指定协程1的任务函数为func_1。 124 | makecontext(&ctx[1], func1, 0); 125 | 126 | // 初始化协程2的ucontext_t结构ctx[2]。 127 | getcontext(&ctx[2]); 128 | // 在ctx[2]结构中指定协程2的私有栈stack_2。 129 | ctx[2].uc_stack.ss_sp = stack_2; 130 | ctx[2].uc_stack.ss_size = sizeof(stack_2); 131 | // 协程2的任务函数return后，pthread线程将从协程1的yield调用的返回点处继续执行。 132 | ctx[2].uc_link = &ctx[1]; 133 | // 指定协程2的任务函数为func_2。 134 | makecontext(&ctx[2], func_2, 0); 135 | 136 | // 执行点1，将cpu当前各寄存器的值存入ctx[0]，将ctx[2]中存储的寄存器值加载到cpu寄存器中， 137 | // 也就是main函数在这里yield，开始执行协程2的任务函数func_2。 138 | swapcontext(&ctx[0], &ctx[2]); 139 | // 执行点5，main函数从这里resume恢复执行。 140 | return 0; 141 | } 142 | ``` 143 | 144 | 在上述程序中，pthread系统线程执行到main函数的执行点1时，内存布局如下图所示，协程1和协程2的私有栈内存是main函数的局部变量，均分配在执行main函数的pthread的线程栈上。并且此时还未执行到swapcontext(&ctx[0], &ctx[2])，所以ctx[0]内的值都是空的： 145 | 146 |

147 | 148 | pthread执行了main函数中的swapcontext(&ctx[0], &ctx[2])后，main函数（也可以认为是一个协程）yield，pthread开始执行协程2的任务函数func_2，在func_2中执行swapcontext(&ctx[2], &ctx[1])后，协程2 yield，pthread开始执行协程1的任务函数func_1，pthread执行到func_1内的执行点2时，内存布局如下图所示，此时main函数和协程2都已被挂起，ctx[0]存储了pthread线程栈的基底地址和栈顶地址，以及main函数执行点5处代码的地址，ctx[2]存储了stack_2的基底地址和栈顶地址，以及func_2函数执行点3处代码的地址，协程1正在被执行过程中，没有被挂起，所以ctx[1]相比之前没有变化： 149 | 150 |

151 | 152 | pthread执行func_1中的swapcontext(&ctx[1], &ctx[2])后，协程1被挂起，ctx[1]存储了stack_1的基底地址和栈顶地址，以及func_1函数执行点4处代码的地址，pthread转去执行协程2的任务函数的下一条代码，也就是协程2被resume，从func_2函数的执行点3处恢复执行，接着func_2就return了，由于ctx[2].uc_link = &ctx[1]，pthread再次转去执行协程1的任务函数的下一条代码，协程1被resume，从func_1函数的执行点4处恢复执行，再接着func_1函数return，又由于ctx[1].uc_link = &ctx[0]，pthread又去执行main函数的下一条代码，main函数被resume，从执行点5处恢复恢复执行，至此协程1和协程2都执行完毕，main函数也将要return了。这个过程可以称作main函数、协程1、协程2分别在一个pthread线程上被调度执行。 153 | 154 | ## brpc的bthread任务定义 155 | 上面的协程示例程序可以认为是实现了N:1用户级线程，即所有协程都在一个系统线程pthread上被调度执行。N:1协程的一个问题就是如果其中一个协程的任务函数在执行阻塞的网络I/O，或者在等待互斥锁，整个pthread系统线程就被挂起，其他的协程当然也无法得到执行了。brpc在N:1协程的基础上做了扩展，实现了M:N用户级线程，即N个pthread系统线程去调度执行M个协程（M远远大于N），一个pthread有其私有的任务队列，队列中存储等待执行的若干协程，一个pthread执行完任务队列中的所有协程后，也可以去其他pthread的任务队列中拿协程任务，即work-steal机制，这样的话如果一个协程在执行较为耗时的操作时，同一任务队列中的其他协程有机会被调度到其他pthread上去执行，从而实现了全局的最大并发。并且brpc也实现了协程级的互斥与唤醒，即Butex机制，通过Butex，一个协程在等待网络I/O或等待互斥锁的时候，会被自动yield让出cpu，在适当时候会被其他协程唤醒，恢复执行。关于Butex的详情参见[这篇文章](butex.md)。 156 | 157 | 在brpc中一个协程任务可以称作一个bthread，一个bthread在内存中表示为一个TaskMeta对象，TaskMeta对象会被分配在ResourcePool中，TaskMeta类的主要的成员变量有： 158 | 159 | - fn & arg：应用程序设置的bthread的任务处理函数及其参数。 160 | 161 | - ContextualStack* stack：ContextualStack结构的定义为： 162 | 163 | ```c++ 164 | struct ContextualStack { 165 | // 缓存cpu寄存器上下文的结构，相当于posix的ucontext结构。 166 | bthread_fcontext_t context; 167 | StackType stacktype; 168 | // bthread私有栈空间。 169 | StackStorage storage; 170 | }; 171 | ``` 172 | 173 | - local_storage：用于记录一些bthread运行状态（如各类统计值）等的一块内存。和ContextualStack::storage不能搞混。 174 | 175 | - version_butex：指向一个Butex对象头节点的指针。 176 | -------------------------------------------------------------------------------- /docs/bthread_schedule.md: -------------------------------------------------------------------------------- 1 | [调度执行bthread的主要数据结构](#调度执行bthread的主要数据结构) 2 | 3 | [一个pthread调度执行私有TaskGroup的任务队列中各个bthread的过程](#一个pthread调度执行私有TaskGroup的任务队列中各个bthread的过程) 4 | 5 | ## 调度执行bthread的主要数据结构 6 | 在一个线上环境系统中，会产生大量的bthread，系统的cpu核数有限，如何让大量的bthread在有限的cpu核心上得到充分调度执行，实现全局的最大并发主要是由TaskGroup对象、TaskControl对象实现的。 7 | 8 | 1. 每一个TaskGroup对象是系统线程pthread的线程私有对象，它内部包含有任务队列，并控制pthread如何执行任务队列中的众多bthread任务。TaskGroup中主要的成员有： 9 | 10 | - _remote_rq：如果一个pthread 1想让pthread 2执行bthread 1，则pthread 1会将bthread 1的tid压入pthread 2的TaskGroup的_remote_rq队列中。 11 | 12 | - _rq：pthread 1在执行从自己私有的TaskGroup中取出的bthread 1时，如果bthread 1执行过程中又创建了新的bthread 2，则bthread 1将bthread 2的tid压入pthread 1的TaskGroup的_rq队列中。 13 | 14 | - _main_tid & _main_stack：一个pthread会在TaskGroup::run_main_task()中执行while()循环，不断获取并执行bthread任务，一个pthread的执行流不是永远在bthread中，比如等待任务时，pthread没有执行任何bthread，执行流就是直接在pthread上。可以将pthread在“等待bthread-获取到bthread-进入bthread执行任务函数之前”这个过程也抽象成一个bthread，称作一个pthread的“调度bthread”或者“主bthread”，它的tid和私有栈就是_main_tid和_main_stack。 15 | 16 | - _cur_meta：当前正在执行的bthread的TaskMeta对象的地址。 17 | 18 | 2. TaskControl对象是全局的单例对象，主要成员有： 19 | 20 | - _pl：ParkingLot类型的数组。ParkingLot对象用于bthread任务的等待-通知。 21 | 22 | - _workers：pthread线程标识符的数组，表示创建了多少个pthread worker线程，每个pthread worker线程应拥有一个线程私有的TaskGroup对象。 23 | 24 | - _groups：TaskGroup对象指针的数组。 25 | 26 | TaskControl和TaskGroup的内存关系如下图所示： 27 | 28 |

29 | 30 | ## 一个pthread调度执行私有TaskGroup的任务队列中各个bthread的过程 31 | 一个pthread调度执行私有TaskGroup任务队列中的各个bthread，这些bthread是在pthread上串行执行的，彼此间不会有竞争。一个bthread的执行过程可能会有三种状态： 32 | 33 | 1. bthread的任务处理函数执行完成。一个bthread的任务函数结束后，该bthread需要负责查看TaskGroup的任务队列中是否还有bthread，如果有，则pthread执行流直接进入下一个bthread的任务函数中去执行；如果没有，则执行流返回pthread的调度bthread，等待其他pthread传递新的bthread； 34 | 35 | 2. bthread在任务函数执行过程中yield挂起，则pthread去执行任务队列中下一个bthread，如果任务队列为空，则执行流返回pthread的调度bthread，等待其他pthread传递新的bthread。挂起的bthread何时恢复运行取决于具体的业务场景，它应该被某个bthread唤醒，与pthread的调度无关。这样的例子有负责向TCP连接写数据的bthread因等待inode输出缓冲可写而被yield挂起、等待Butex互斥锁的bthread被yield挂起等。 36 | 37 | 3. bthread在任务函数执行过程中可以创建新的bthread，因为新的bthread一般是优先级更高的bthread，所以pthread执行流立即进入新bthread的任务函数，原先的bthread被重新加入到任务队列的尾部，不久后它仍然可以被pthread执行。但由于work-steal机制，它不一定会在原先的pthread执行，可能会被steal到其他pthread上执行。 38 | 39 | 按照以上的原则，分析下brpc中的实现过程。 40 | 41 | 1. TaskControl创建一个pthread worker线程和其私有的TaskGroup对象时，pthread在TaskGroup::run_main_task上开启无限循环： 42 | 43 | ```c++ 44 | void TaskGroup::run_main_task() { 45 | bvar::PassiveStatus cumulated_cputime( 46 | get_cumulated_cputime_from_this, this); 47 | std::unique_ptr > > usage_bvar; 48 | 49 | TaskGroup* dummy = this; 50 | bthread_t tid; 51 | // 等待一个可执行的bthread，可能从_rq中取得其他pthead压入的bthread id， 52 | // 也可能从其他pthread worker线程的TaskGroup中steal一个bthread id。 53 | while (wait_task(&tid)) { 54 | // 拿到一个bthread，执行流进入bthread的任务函数。 55 | TaskGroup::sched_to(&dummy, tid); 56 | // run_main_task()恢复执行的开始执行点。 57 | DCHECK_EQ(this, dummy); 58 | DCHECK_EQ(_cur_meta->stack, _main_stack); 59 | // 这里有些疑问，尚不确定何种情景下会执行下面这段代码。 60 | if (_cur_meta->tid != _main_tid) { 61 | TaskGroup::task_runner(1/*skip remained*/); 62 | } 63 | if (FLAGS_show_per_worker_usage_in_vars && !usage_bvar) { 64 | char name[32]; 65 | #if defined(OS_MACOSX) 66 | snprintf(name, sizeof(name), "bthread_worker_usage_%" PRIu64, 67 | pthread_numeric_id()); 68 | #else 69 | snprintf(name, sizeof(name), "bthread_worker_usage_%ld", 70 | (long)syscall(SYS_gettid)); 71 | #endif 72 | usage_bvar.reset(new bvar::PerSecond > 73 | (name, &cumulated_cputime, 1)); 74 | } 75 | } 76 | // stop_main_task() was called. 77 | // Don't forget to add elapse of last wait_task. 78 | current_task()->stat.cputime_ns += butil::cpuwide_time_ns() - _last_run_ns; 79 | } 80 | ``` 81 | 82 | 2. wait_task()函数负责等待一个bthread，如果当前没有bthread可执行，则pthread会挂起。 83 | 84 | 3. TaskGroup::sched_to(TaskGroup** pg, bthread_t next_tid)的作用是根据将要执行的bthread的tid在O(1)时间内定位到bthread的TaskMeta对象的地址（TaskMeta是分配在ResourcePool中的，关于ResourcePool可以参考[这篇文章](resource_pool.md)），并确保bthread的私有栈空间已创建、context结构已分配，进而调用TaskGroup::sched_to(TaskGroup** pg, TaskMeta* next_meta)： 85 | 86 | ```c++ 87 | void TaskGroup::sched_to(TaskGroup** pg, TaskMeta* next_meta) { 88 | TaskGroup* g = *pg; 89 | #ifndef NDEBUG 90 | if ((++g->_sched_recursive_guard) > 1) { 91 | LOG(FATAL) << "Recursively(" << g->_sched_recursive_guard - 1 92 | << ") call sched_to(" << g << ")"; 93 | } 94 | #endif 95 | // Save errno so that errno is bthread-specific. 96 | const int saved_errno = errno; 97 | void* saved_unique_user_ptr = tls_unique_user_ptr; 98 | 99 | // 获取当前正在执行的bthread的TaskMeta对象的地址。 100 | TaskMeta* const cur_meta = g->_cur_meta; 101 | const int64_t now = butil::cpuwide_time_ns(); 102 | const int64_t elp_ns = now - g->_last_run_ns; 103 | g->_last_run_ns = now; 104 | cur_meta->stat.cputime_ns += elp_ns; 105 | if (cur_meta->tid != g->main_tid()) { 106 | // 如果一个bthread在执行过程中生成了新的bthread，会走到这里。 107 | g->_cumulated_cputime_ns += elp_ns; 108 | } 109 | // 递增当前bthread的切换次数。 110 | ++cur_meta->stat.nswitch; 111 | // 递增worker线程pthread上的bthread切换次数。 112 | ++ g->_nswitch; 113 | // Switch to the task 114 | if (__builtin_expect(next_meta != cur_meta, 1)) { 115 | // 将_cur_meta指向下一个将要执行的bthread的TaskMeta对象的指针。 116 | g->_cur_meta = next_meta; 117 | // Switch tls_bls 118 | // tls_bls存储的是当前bthread的一些运行期数据（统计量等），执行切换动作前，将tls_bls的内容复制到 119 | // 当前bthread的私有storage空间中，再将tls_bls重新指向将要执行的bthread的私有storage。 120 | cur_meta->local_storage = tls_bls; 121 | tls_bls = next_meta->local_storage; 122 | 123 | // Logging must be done after switching the local storage, since the logging lib 124 | // use bthread local storage internally, or will cause memory leak. 125 | if ((cur_meta->attr.flags & BTHREAD_LOG_CONTEXT_SWITCH) || 126 | (next_meta->attr.flags & BTHREAD_LOG_CONTEXT_SWITCH)) { 127 | LOG(INFO) << "Switch bthread: " << cur_meta->tid << " -> " 128 | << next_meta->tid; 129 | } 130 | 131 | if (cur_meta->stack != NULL) { 132 | if (next_meta->stack != cur_meta->stack) { 133 | // 这里真正执行bthread的切换。 134 | // 将执行pthread的cpu的寄存器的当前状态存入cur_meta的context中，并将next_meta的context中 135 | // 的数据加载到cpu的寄存器中，开始执行next_meta的任务函数。 136 | jump_stack(cur_meta->stack, next_meta->stack); 137 | // 这里是cur_meta代表的bthread的恢复执行点。 138 | // bthread恢复执行的时候可能被steal到其他pthread上了，需要重置TaskGroup对象的指针g。 139 | // probably went to another group, need to assign g again. 140 | g = tls_task_group; 141 | } 142 | #ifndef NDEBUG 143 | else { 144 | // else pthread_task is switching to another pthread_task, sc 145 | // can only equal when they're both _main_stack 146 | CHECK(cur_meta->stack == g->_main_stack); 147 | } 148 | #endif 149 | } 150 | // else because of ending_sched(including pthread_task->pthread_task) 151 | } else { 152 | LOG(FATAL) << "bthread=" << g->current_tid() << " sched_to itself!"; 153 | } 154 | 155 | while (g->_last_context_remained) { 156 | RemainedFn fn = g->_last_context_remained; 157 | g->_last_context_remained = NULL; 158 | fn(g->_last_context_remained_arg); 159 | g = tls_task_group; 160 | } 161 | 162 | // Restore errno 163 | errno = saved_errno; 164 | tls_unique_user_ptr = saved_unique_user_ptr; 165 | 166 | #ifndef NDEBUG 167 | --g->_sched_recursive_guard; 168 | #endif 169 | *pg = g; 170 | } 171 | ``` 172 | 173 | 4. 一个bthread被执行时，pthread将执行TaskGroup::task_runner()，在这个函数中会去执行TaskMeta对象的fn()，即应用程序设置的bthread任务函数。task_runner()的关键代码如下： 174 | 175 | ```c++ 176 | void TaskGroup::task_runner() { 177 | TaskMeta* const m = g->_cur_meta; 178 | // 执行应用程序设置的任务函数，在任务函数中可能yield让出cpu，也可能产生新的bthread。 179 | m->fn(m->arg); 180 | // 任务函数执行完成后，需要唤起等待该任务函数执行结束的pthread/bthread。 181 | butex_wake_except(m->version_butex, 0); 182 | // 将pthread线程执行流转入下一个可执行的bthread（普通bthread或pthread的调度bthread）。 183 | ending_sched(&g); 184 | } 185 | ``` 186 | 187 | bthread任务函数结束完后会调用ending_sched()，在ending_sched()内会尝试从本地TaskGroup的任务队列中找出下一个bthread，或者从其他pthread的TaskGroup上steal一个bthread，如果没有bthread可用则下一个被执行的就是pthread的“调度bthread”，通过sched_to()将pthread的执行流转入下一个bthread的任务函数。 188 | 189 | 5. 一个bthread在自己的任务函数执行过程中想要挂起时，调用TaskGroup::yield(TaskGroup** pg)，yield()内部会调用TaskGroup::sched(TaskGroup** pg)，sched()也是负责将pthread的执行流转入下一个bthread（普通bthread或调度bthread）的任务函数。挂起的bthread在适当的时候会被其他bthread唤醒，即某个bthread会负责将挂起的bthread的tid重新加入TaskGroup的任务队列。 190 | 191 | 6. 一个bthread 1在自己的任务函数执行过程中需要创建新的bthread 2时，会调用TaskGroup::start_foreground()，在start_foreground()内完成bthread 2的TaskMeta对象的创建，并调用sched_to()让pthread去执行bthread 2的任务函数。pthread在真正执行bthread 2的任务函数前会将bthread 1的tid重新压入TaskGroup的任务队列，bthread 1不久之后会再次被调度执行。 192 | -------------------------------------------------------------------------------- /docs/butex.md: -------------------------------------------------------------------------------- 1 | [bthread粒度挂起与唤醒的设计原理](#bthread粒度挂起与唤醒的设计原理) 2 | 3 | [brpc中Butex的源码解释](#brpc中Butex的源码解释) 4 | 5 | ## bthread粒度挂起与唤醒的设计原理 6 | 由于bthread任务是在pthread系统线程中执行，在需要bthread间互斥的场景下不能使用pthread级别的锁（如pthread_mutex_lock或者C++的unique_lock等），否则pthread会被挂起，不仅当前的bthread中止执行，pthread私有的TaskGroup的任务队列中其他bthread也无法在该pthread上调度执行。因此需要在应用层实现bthread粒度的互斥机制，一个bthread被挂起时，pthread仍然要保持运行状态，保证TaskGroup任务队列中的其他bthread的正常执行不受影响。 7 | 8 | 要实现bthread粒度的互斥，方案如下： 9 | 10 | 1. 在同一个pthread上执行的多个bthread是串行执行的，不需要考虑互斥； 11 | 12 | 2. 如果位于heap内存上或static静态区上的一个对象A可能会被在不同pthread执行的多个bthread同时访问，则为对象A维护一个互斥锁（一般是一个原子变量）和等待队列，同时访问对象A的多个bthread首先要竞争锁，假设三个bthread 1、2、3分别在pthread 1、2、3上执行，bthread 1、bthread 2、bthread 3同时访问heap内存上的一个对象A，这时就产生了竞态，假设bthread 1获取到锁，可以去访问对象A，bthread 2、bthread 3先将自身必要的信息（bthread的tid等）存入等待队列，然后自动yiled，让出cpu，让pthread 2、pthread 3继续去执行各自私有TaskGroup的任务队列中的下一个bthread，这就实现了bthread粒度的挂起； 13 | 14 | 3. bthread 1访问完对象A后，通过查询对象A的互斥锁的等待队列，能够得知bthread 2、bthread 3因等待锁而被挂起，bthread 1负责将bthread 2、3的tid重新压入某个pthread（不一定是之前执行执行bthread 2、3的pthread 2、3）的TaskGroup的任务队列，bthread 2、3就能够再次被pthread执行，这就实现了bthread粒度的唤醒。 15 | 16 | 下面分析下brpc是如何实现bthread粒度的挂起与唤醒的。 17 | 18 | ## brpc中Butex的源码解释 19 | brpc实现bthread互斥的主要代码在src/bthread/butex.cpp中： 20 | 21 | 1. 首先解释下Butex、ButexBthreadWaiter等主要的数据结构： 22 | 23 | ```c++ 24 | struct BAIDU_CACHELINE_ALIGNMENT Butex { 25 | Butex() {} 26 | ~Butex() {} 27 | 28 | // 锁变量的值。 29 | butil::atomic value; 30 | // 等待队列，存储等待互斥锁的各个bthread的信息。 31 | ButexWaiterList waiters; 32 | internal::FastPthreadMutex waiter_lock; 33 | }; 34 | ``` 35 | 36 | ```c++ 37 | // 等待队列实际上是个侵入式双向链表，增减元素的操作都可在O(1)时间内完成。 38 | typedef butil::LinkedList ButexWaiterList; 39 | ``` 40 | 41 | ```c++ 42 | // ButexWaiter是LinkNode的子类，LinkNode里只定义了指向前后节点的指针。 43 | struct ButexWaiter : public butil::LinkNode { 44 | // tids of pthreads are 0 45 | // tid就是64位的bthread id。 46 | // Butex实现了bthread间的挂起&唤醒，也实现了bthread和pthread间的挂起&唤醒， 47 | // 一个pthread在需要的时候可以挂起，等待适当的时候被一个bthread唤醒，线程挂起不需要tid，填0即可。 48 | // pthread被bthread唤醒的例子可参考brpc的example目录下的一些client.cpp示例程序，执行main函数的pthread 49 | // 会被挂起，某个bthread执行完自己的任务后会去唤醒pthread。 50 | bthread_t tid; 51 | 52 | // Erasing node from middle of LinkedList is thread-unsafe, we need 53 | // to hold its container's lock. 54 | butil::atomic container; 55 | }; 56 | ``` 57 | 58 | ```c++ 59 | // bthread需要挂起时，会在栈上创建一个ButexBthreadWaiter对象（对象存储在bthread的私有栈空间内）并加入等待队列。 60 | struct ButexBthreadWaiter : public ButexWaiter { 61 | // 执行bthread的TaskMeta结构的指针。 62 | TaskMeta* task_meta; 63 | TimerThread::TaskId sleep_id; 64 | // 状态标记，根据锁变量当前状态是否发生改变，waiter_state会被设为不同的值。 65 | WaiterState waiter_state; 66 | // expected_value存储的是当bthread竞争互斥锁失败时锁变量的值，由于从bthread竞争互斥锁失败到bthread挂起 67 | // 有一定的时间间隔，在这个时间间隔内锁变量的值可能会发生变化，也许锁已经被释放了，那么之前竞争锁失败的bthread 68 | // 就不应挂起，否则可能永远不会被唤醒了，它应该放弃挂起动作，再去竞争互斥锁。所以一个bthread在执行挂起动作前 69 | // 一定要再次去查看锁变量的当前最新值，只有锁变量当前最新值等于expected_value时才能真正执行挂起动作。 70 | int expected_value; 71 | Butex* initial_butex; 72 | // 指向全局唯一的TaskControl单例对象的指针。 73 | TaskControl* control; 74 | }; 75 | ``` 76 | 77 | ```c++ 78 | // 如果是pthread挂起，则创建ButexPthreadWaiter对象并加入等待队列。 79 | struct ButexPthreadWaiter : public ButexWaiter { 80 | butil::atomic sig; 81 | }; 82 | ``` 83 | 84 | 以上面的bthread 1获得了互斥锁、bthread 2和bthread 3因等待互斥锁而被挂起的场景为例，Butex的内存布局如下图所示，展现了主要的对象间的内存关系，注意ButexBthreadWaiter变量是分配在bthread的私有栈上的： 85 | 86 |

87 | 88 | 2. 执行bthread挂起的函数是butex_wait： 89 | 90 | ```c++ 91 | // arg是指向Butex::value锁变量的指针，expected_value是bthread竞争锁失败时锁变量的值。 92 | int butex_wait(void* arg, int expected_value, const timespec* abstime) { 93 | // 通过arg定位到Butex对象的地址。 94 | Butex* b = container_of(static_cast*>(arg), Butex, value); 95 | // 如果锁变量当前最新值不等于expected_value，则锁的状态发生了变化，当前bthread不再执行挂起动作， 96 | // 直接返回，在外层代码中继续去竞争锁。 97 | if (b->value.load(butil::memory_order_relaxed) != expected_value) { 98 | errno = EWOULDBLOCK; 99 | // Sometimes we may take actions immediately after unmatched butex, 100 | // this fence makes sure that we see changes before changing butex. 101 | butil::atomic_thread_fence(butil::memory_order_acquire); 102 | return -1; 103 | } 104 | TaskGroup* g = tls_task_group; 105 | if (NULL == g || g->is_current_pthread_task()) { 106 | // 当前代码不在bthread中执行而是在直接在pthread上执行，调用butex_wait_from_pthread让pthread挂起。 107 | return butex_wait_from_pthread(g, b, expected_value, abstime); 108 | } 109 | // 创建ButexBthreadWaiter类型的局部变量bbw，bbw是分配在bthread的私有栈空间上的。 110 | ButexBthreadWaiter bbw; 111 | // tid is 0 iff the thread is non-bthread 112 | bbw.tid = g->current_tid(); 113 | bbw.container.store(NULL, butil::memory_order_relaxed); 114 | bbw.task_meta = g->current_task(); 115 | bbw.sleep_id = 0; 116 | bbw.waiter_state = WAITER_STATE_READY; 117 | bbw.expected_value = expected_value; 118 | bbw.initial_butex = b; 119 | bbw.control = g->control(); 120 | 121 | if (abstime != NULL) { 122 | // Schedule timer before queueing. If the timer is triggered before 123 | // queueing, cancel queueing. This is a kind of optimistic locking. 124 | if (butil::timespec_to_microseconds(*abstime) < 125 | (butil::gettimeofday_us() + MIN_SLEEP_US)) { 126 | // Already timed out. 127 | errno = ETIMEDOUT; 128 | return -1; 129 | } 130 | bbw.sleep_id = get_global_timer_thread()->schedule( 131 | erase_from_butex_and_wakeup, &bbw, *abstime); 132 | if (!bbw.sleep_id) { // TimerThread stopped. 133 | errno = ESTOP; 134 | return -1; 135 | } 136 | } 137 | #ifdef SHOW_BTHREAD_BUTEX_WAITER_COUNT_IN_VARS 138 | bvar::Adder& num_waiters = butex_waiter_count(); 139 | num_waiters << 1; 140 | #endif 141 | 142 | // release fence matches with acquire fence in interrupt_and_consume_waiters 143 | // in task_group.cpp to guarantee visibility of `interrupted'. 144 | bbw.task_meta->current_waiter.store(&bbw, butil::memory_order_release); 145 | // pthread在执行任务队列中下一个bthread前，会先执行wait_for_butex()将刚创建的bbw对象放入锁的等待队列。 146 | g->set_remained(wait_for_butex, &bbw); 147 | // 当前bthread yield让出cpu，pthread会从TaskGroup的任务队列中取出下一个bthread去执行。 148 | TaskGroup::sched(&g); 149 | 150 | // 这里是butex_wait()恢复执行时的开始执行点。 151 | 152 | // erase_from_butex_and_wakeup (called by TimerThread) is possibly still 153 | // running and using bbw. The chance is small, just spin until it's done. 154 | BT_LOOP_WHEN(unsleep_if_necessary(&bbw, get_global_timer_thread()) < 0, 155 | 30/*nops before sched_yield*/); 156 | 157 | // If current_waiter is NULL, TaskGroup::interrupt() is running and using bbw. 158 | // Spin until current_waiter != NULL. 159 | BT_LOOP_WHEN(bbw.task_meta->current_waiter.exchange( 160 | NULL, butil::memory_order_acquire) == NULL, 161 | 30/*nops before sched_yield*/); 162 | #ifdef SHOW_BTHREAD_BUTEX_WAITER_COUNT_IN_VARS 163 | num_waiters << -1; 164 | #endif 165 | 166 | bool is_interrupted = false; 167 | if (bbw.task_meta->interrupted) { 168 | // Race with set and may consume multiple interruptions, which are OK. 169 | bbw.task_meta->interrupted = false; 170 | is_interrupted = true; 171 | } 172 | // If timed out as well as value unmatched, return ETIMEDOUT. 173 | if (WAITER_STATE_TIMEDOUT == bbw.waiter_state) { 174 | errno = ETIMEDOUT; 175 | return -1; 176 | } else if (WAITER_STATE_UNMATCHEDVALUE == bbw.waiter_state) { 177 | errno = EWOULDBLOCK; 178 | return -1; 179 | } else if (is_interrupted) { 180 | errno = EINTR; 181 | return -1; 182 | } 183 | return 0; 184 | } 185 | ``` 186 | 187 | ```c++ 188 | static void wait_for_butex(void* arg) { 189 | ButexBthreadWaiter* const bw = static_cast(arg); 190 | Butex* const b = bw->initial_butex; 191 | { 192 | BAIDU_SCOPED_LOCK(b->waiter_lock); 193 | // 再次判断锁变量的当前最新值是否与expected_value相等。 194 | if (b->value.load(butil::memory_order_relaxed) != bw->expected_value) { 195 | // 锁变量的状态发生了变化，bw代表的bthread不能挂起，要去重新竞争锁。 196 | // 但bw代表的bthread之前已经yield让出cpu了，所以下面要将bw代表的bthread的id再次放入TaskGroup的 197 | // 任务队列，让它恢复执行。 198 | // 将bw的waiter_state更改为WAITER_STATE_UNMATCHEDVALUE，表示锁的状态已发生改变。 199 | bw->waiter_state = WAITER_STATE_UNMATCHEDVALUE; 200 | } else if (bw->waiter_state == WAITER_STATE_READY/*1*/ && 201 | !bw->task_meta->interrupted) { 202 | // 将bw加入到锁的等待队列，这才真正完成bthread的挂起，然后直接返回。 203 | b->waiters.Append(bw); 204 | bw->container.store(b, butil::memory_order_relaxed); 205 | return; 206 | } 207 | } 208 | 209 | // 锁状态发生变化的情况下，才执行后面的代码。 210 | 211 | // b->container is NULL which makes erase_from_butex_and_wakeup() and 212 | // TaskGroup::interrupt() no-op, there's no race between following code and 213 | // the two functions. The on-stack ButexBthreadWaiter is safe to use and 214 | // bw->waiter_state will not change again. 215 | unsleep_if_necessary(bw, get_global_timer_thread()); 216 | // 将bw代表的bthread的tid重新加入TaskGroup的任务队列。 217 | tls_task_group->ready_to_run(bw->tid); 218 | // FIXME: jump back to original thread is buggy. 219 | 220 | // // Value unmatched or waiter is already woken up by TimerThread, jump 221 | // // back to original bthread. 222 | // TaskGroup* g = tls_task_group; 223 | // ReadyToRunArgs args = { g->current_tid(), false }; 224 | // g->set_remained(TaskGroup::ready_to_run_in_worker, &args); 225 | // // 2: Don't run remained because we're already in a remained function 226 | // // otherwise stack may overflow. 227 | // TaskGroup::sched_to(&g, bw->tid, false/*2*/); 228 | } 229 | ``` 230 | 231 | 3. 执行bthread唤醒的函数有butex_wake（唤醒正在等待一个互斥锁的一个bthread）、butex_wake_all（唤醒正在等待一个互斥锁的所有bthread）、butex_wake_except（唤醒正在等待一个互斥锁的除了指定bthread外的其他bthread），下面解释butex_wake的源码： 232 | 233 | ```c++ 234 | // arg是Butex::value锁变量的地址。 235 | int butex_wake(void* arg) { 236 | // 通过arg定位到Butex对象的地址。 237 | Butex* b = container_of(static_cast*>(arg), Butex, value); 238 | ButexWaiter* front = NULL; 239 | { 240 | BAIDU_SCOPED_LOCK(b->waiter_lock); 241 | // 如果锁的等待队列为空，直接返回。 242 | if (b->waiters.empty()) { 243 | return 0; 244 | } 245 | // 取出锁的等待队列中第一个ButexWaiter对象的指针，并将该ButexWaiter对象从等待队列中移除。 246 | front = b->waiters.head()->value(); 247 | front->RemoveFromList(); 248 | front->container.store(NULL, butil::memory_order_relaxed); 249 | } 250 | if (front->tid == 0) { 251 | // ButexWaiter对象的tid=0说明挂起的是pthread系统线程，调用内核提供的系统调用将pthread线程唤醒。 252 | wakeup_pthread(static_cast(front)); 253 | return 1; 254 | } 255 | ButexBthreadWaiter* bbw = static_cast(front); 256 | unsleep_if_necessary(bbw, get_global_timer_thread()); 257 | // 将挂起的bthread的tid压入TaskGroup的任务队列，实现了将挂起的bthread唤醒。 258 | TaskGroup* g = tls_task_group; 259 | if (g) { 260 | TaskGroup::exchange(&g, bbw->tid); 261 | } else { 262 | bbw->control->choose_one_group()->ready_to_run_remote(bbw->tid); 263 | } 264 | return 1; 265 | } 266 | ``` 267 | -------------------------------------------------------------------------------- /docs/client_bthread_sync.md: -------------------------------------------------------------------------------- 1 | [一次RPC过程中需要bthread互斥的场景](#一次RPC过程中需要bthread互斥的场景) 2 | 3 | [bthread互斥过程涉及到的数据结构](#bthread互斥过程涉及到的数据结构) 4 | 5 | [brpc实现bthread互斥的源码解释](#brpc实现bthread互斥的源码解释) 6 | 7 | ## 一次RPC过程中需要bthread互斥的场景 8 | 在一次RPC过程中，由于设置RPC超时定时器和开启Backup Request机制，不同的bthread可能会同时操作本次RPC独有的Controller结构，会存在下列几种竞态情况： 9 | 10 | 1. 第一次Request发出后，在backup_request_ms内未收到响应，触发Backup Request定时器，试图发送Backup Request的同时可能收到了第一次Request的Response，发送Backup Request的bthread和处理Response的bthread需要做互斥； 11 | 12 | 2. 假设没有开启Backup Request机制，处理Response（可能是第一次Request的Response，也可能是某一次重试的Response）时刚好到了RPC超时时间，处理Response的bthread和处理RPC超时定时任务的bthread需要做互斥； 13 | 14 | 3. 第一次Request或者任何一次重试Request，与Backup Request可能同时刻收到Response，分别处理两个Response的bthread间需要做互斥； 15 | 16 | 4. 第一次Request或者任何一次重试Request，与Backup Request可能同时刻收到Response，此时也可能到了RPC超时时间，分别处理两个Response的bthread和处理RPC超时定时任务的bthread，三者之间需要做互斥。 17 | 18 | ## bthread互斥过程涉及到的数据结构 19 | 一次RPC过程中，会有一个Controller对象贯穿本次RPC的始终，Controller对象内存储本次RPC的各类数据和各种状态，bthread间的互斥就是指同一时刻只能有一个bthread在访问Controller对象。互斥主要是通过Id对象和Butex对象实现的。 20 | 21 | 1. Id对象 22 | 23 | brpc会在每一次RPC过程开始阶段创建本次RPC唯一的一个Id对象，用来保护Controller对象，互斥试图同时访问Controller对象的多个bthread。Id对象主要成员有： 24 | 25 | - first_ver & locked_ver 26 | 27 | 如果Id对象的butex锁变量的值（butex指针指向的Butex对象的value值）为first_ver，表示Controller对象此时没有bthread在访问。此时如果有一个bthread试图访问Controller对象，则它可以取得访问权，先将butex锁变量的值置为locked_ver后，再去访问Controller对象。 28 | 29 | 在locked_ver的基础上又有contended_ver、unlockable_ver、end_ver，contended_ver = locked_ver + 1，unlockable_ver = locked_ver + 2，end_ver = locked_ver + 3。 30 | 31 | contended_ver表示同时访问Controller对象的多个bthread产生了竞态。如果有一个bthread（bthread 1）在访问Controller对象结束后，观察到butex锁变量的值仍为locked_ver，表示没有其他的bthread在等待访问Controller对象，bthread 1在将butex锁变量的值改为first_ver后不会再有其他动作。如果在bthread 1访问Controller对象期间又有bthread 2试图访问Controller对象，bthread 2会观察到butex锁变量的值为locked_ver，bthread 2首先将butex锁变量的值改为contended_ver，意图就是告诉bthread 1产生了竞态。接着bthread 2需要将自身的bthread id等信息存储在butex锁变量的waiters等待队列中，yield让出cpu，让bthread 2所在的pthread去执行TaskGroup任务队列中的下一个bthread任务。当bthread 1完成对Controller对象的访问时，会观察到butex锁变量的值已被改为contended_ver，bthread 1会到waiters队列中找到被挂起的bthread 2的id，通知TaskControl将bthread 2的id压入某一个TaskGroup的任务队列，这就是唤醒了bthread 2。bthread 1唤醒bthread 2后会将butex锁变量的值再次改回first_ver。bthread 2重新被某一个pthread调度执行时，如果这期间没有其他bthread在访问Controller对象，bthread 2会观察到butex锁变量的值为first_ver，这时bthread 2获得了Controller对象的访问权。 32 | 33 | unlockable_ver表示RPC即将完成，不允许再有bthread去访问Controller对象了。 34 | 35 | end_ver表示一次RPC结束后，Id对象被回收到对象池中后*butex 和 *join_butex的值（butex和join_butex都指向Butex对象的第一个元素value，value是个整型值，所以butex和join_butex可转化为指向整型的指针）。end_ver也是被回收的Id对象再次被分配给某一次RPC过程时的first_ver值。 36 | 37 | locked_ver的取值和一次RPC的重试次数有关，locked_ver = first_ver + 重试次数 + 2，例如，如果一次RPC过程开始时分配给本次RPC的Id对象的first_ver=1，本次RPC最多允许重试3次，则locked_ver=6，本次RPC的唯一id _correlation_id=1，第一次请求的call_id=2，第一次重试的call_id=3，第二次重试的call_id=4，第三次重试的call_id=5，contended_ver=7，unlockable_ver=8，end_ver=9（_correlation_id和call_id的值只是举例，实际上_correlation_id和call_id都是64bit整型值，前32bit为Id对象在对象池中的slot位移，后32bit为上述的1、2、3...等版本号。服务器返回的Response会回传call_id，通过call_id的前32bit可以在O(1)时间内定位到本次RPC对应的Id对象，方便进行后续的bthread互斥等操作）。 38 | 39 | - mutex 40 | 41 | 类似futex的线程锁，由于试图同时访问同一Controller对象的若干bthread可能在不同的系统线程pthread上被执行，所以bthread在修改Id对象中的字段前需要先做pthread间的互斥。 42 | 43 | - data 44 | 45 | 指向本次RPC唯一的一个Controller对象的指针。 46 | 47 | - butex 48 | 49 | 指向一个Butex对象头节点的指针，该Butex对象用来互斥同时访问Controller对象的多个bthread，waiter队列存储被挂起的等待Controller对象访问权的bthread的tid等信息。 50 | 51 | - join_butex 52 | 53 | 指向另一个Butex对象头节点的指针，如果RPC过程是同步方式，该Butex对象用来同步发起RPC过程的bthread和成功将服务器的Response存入Controller对象的bthread：发起RPC过程的bthread（bthread 1）会执行第一次发送请求的动作，然后会将自己的bthread id等信息存储在join_butex锁的waiters队列中，yield让出cpu，等待某一个能够成功将服务器的Response存入Controller对象的bthread将其唤醒。不论是处理第一次请求的Response的bthread，还是处理某一次重试请求的Response的bthread，只要有一个bthread将Response成功存入Controller对象，就从join_butex锁的waiters队列中拿到bthread 1的bthread id，通知TaskControl将bthread 1的id压入某一个TaskGroup的任务队列，这就是唤醒了bthread 1，bthread 1被某一个pthread重新执行后会从Controller对象中得到Response，进行后续的Response处理工作。 54 | 55 | 2. Butex对象 56 | 57 | Butex对象主要成员是ButexWaiterList类型的waiters，waiters是个等待队列（waiters实际上是一个侵入式链表，增删操作都会在O(1)时间内完成），等待Controller访问权的bthread会在私有栈上创建一个ButexBthreadWaiter对象，并加入到waiters中，ButexBthreadWaiter对象中包含挂起的bthread的tid等信息，释放Controller访问权的bthread可以从waiters队列中拿到挂起的bthread的tid，并负责将挂起的bthread的tid加入某个TaskGroup的任务队列，让它重新得到某个pthread的调度。 58 | 59 | 假设现在有bthread 1使用同步方式发起了一次RPC请求，发送请求后bthread 1被挂起，等待有bthread向Controller对象填充请求的Response，或者超时。一段时间后处理服务器Response的bthread 2和处理超时的bthread 3同时执行，bthread 2抢到Controller的访问权，bthread 3被挂起。此时Controller、Id、Butex几种对象间的内存关系如下图所示，注意Controller对象分配在bthread 1的私有栈上，两个ButexBthreadWaiter对象也分配在相应bthread的私有栈上，Id对象和Butex对象都是通过ResourcePool机制分配的，被分配在heap堆上，Butex 2的value值是contended_ver，因为bthread 2访问Controller期间有bthread 3在排队等待，bthread 2释放Controller访问权后必须负责唤醒bthread 3，并且bthread 2成功向Controller写入了服务器的Response，满足bthread 1的唤醒条件，bthread 2还必须负责唤醒bthread 1。 60 | 61 |

62 | 63 | ## brpc实现bthread互斥的源码解释 64 | brpc实现bthread互斥的主要结构为Id和Butex，关于Butex的细节请见[这篇文章](butex.md)，Id相关的代码在src/bthread/id.cpp中，主要的一些函数如下： 65 | 66 | - bthread_id_lock_and_reset_range_verbose：访问共享数据前，竞争butex锁、等待butex锁： 67 | 68 | ```c++ 69 | // bthread访问Controller对象前必须要执行bthread_id_lock，实际上是调用bthread_id_lock_and_reset_range_verbose。 70 | // 在这个函数中bthread会根据锁变量（Id的butex指针指向的Butex结构中value）的当前值，来判断下一步的动作： 71 | // 1、如果锁变量当前值=first_ver，说明当前没有bthread在访问Controller，则把锁变量的值置为locked_ver， 72 | // 告诉后来的bthread“我正在访问Controller，其他bthread先等待”，再去访问Controller； 73 | // 2、如果锁变量当前值=locked_ver或contended_ver，则当前bthread需要挂起，正在访问Controller的bthread结束访问后 74 | // 会负责唤醒挂起的bthread。 75 | // 参数中，id是请求的call_id（要和RPC的correlation_id区分开），*pdata是共享对象（如Controller）的地址， 76 | // range=RPC重试次数+2。 77 | int bthread_id_lock_and_reset_range_verbose( 78 | bthread_id_t id, void **pdata, int range, const char *location) { 79 | // 通过id的前32bits，在O(1)时间内定位到Id对象的地址。 80 | bthread::Id* const meta = address_resource(bthread::get_slot(id)); 81 | if (!meta) { 82 | return EINVAL; 83 | } 84 | // id_ver是call_id（一次RPC由于重试等因素可能产生多次call，每个call有其唯一id）。 85 | const uint32_t id_ver = bthread::get_version(id); 86 | // butex指针指向的是Butex结构的第一个元素：整型变量value，这就是锁变量。 87 | uint32_t* butex = meta->butex; 88 | // 函数的局部变量都是分配在各个bthread的私有栈上的，所以每个bthread看到的不是同一个ever_contended。 89 | bool ever_contended = false; 90 | // 这段代码可以被位于不同pthread上的多个bthread同时执行，所以需要先加线程锁。 91 | meta->mutex.lock(); 92 | while (meta->has_version(id_ver)) { 93 | if (*butex == meta->first_ver) { 94 | // 执行到这里，表示当前没有其他bthread在访问Controller。 95 | // contended locker always wakes up the butex at unlock. 96 | meta->lock_location = location; 97 | if (range == 0) { 98 | // fast path 99 | } else if (range < 0 || 100 | range > bthread::ID_MAX_RANGE || 101 | range + meta->first_ver <= meta->locked_ver) { 102 | LOG_IF(FATAL, range < 0) << "range must be positive, actually " 103 | << range; 104 | LOG_IF(FATAL, range > bthread::ID_MAX_RANGE) 105 | << "max range is " << bthread::ID_MAX_RANGE 106 | << ", actually " << range; 107 | } else { 108 | // range的值是“一次RPC的重试次数+2”， 109 | // 如果first_ver=1，一次RPC在超时时间内允许重试3次，则locked_ver=6。 110 | meta->locked_ver = meta->first_ver + range; 111 | } 112 | // 1、如果是第一个访问Controller的bthread走到这里，则把锁变量的值置为locked_ver； 113 | // 2、如果是曾经因等待锁而被挂起的bthread走到这里，则把锁变量的值置为contended_ver。 114 | *butex = (ever_contended ? meta->contended_ver() : meta->locked_ver); 115 | // 锁变量已经被重置，后来的bthread看到锁变量最新值后就会得知已经有一个bthread在访问Controller， 116 | // 当前bthread可以释放pthread线程锁了。 117 | meta->mutex.unlock(); 118 | if (pdata) { 119 | // 找到Controller对象的指针并返回。 120 | *pdata = meta->data; 121 | } 122 | return 0; 123 | } else if (*butex != meta->unlockable_ver()) { 124 | // 1、一个bthread（假设bthread id为C）执行到这里，锁变量的当前值（Butex的value值） 125 | // 要么是locked_ver，要么是contented_ver： 126 | // a、如果锁变量的当前值=locked_ver，表示当前有一个bthread A正在访问Controller且还没有访问完成， 127 | // 且锁的等待队列中没有其他bthread被挂起； 128 | // b、如果锁变量的当前值=contented_ver，表示当前不仅有一个bthread A正在访问Controller且还没有 129 | // 访问完成，而且还有一个或多个bthread（B、D、E...）被挂起，等待唤醒。 130 | // 2、执行到这段代码的bthread必须要挂起，挂起前先将锁变量的值置为contended_ver，告诉正在访问Controller 131 | // 的bthread，访问完Controller后，要负责唤醒挂起的bthread； 132 | // 3、挂起是指：bthread将cpu寄存器的上下文存入context结构，让出cpu，执行这个bthread的pthread从TaskGroup 133 | // 的任务队列中取出下一个bthread去执行。 134 | 135 | // 将锁变量的值置为contended_ver。 136 | *butex = meta->contended_ver(); 137 | // 记住竞争锁失败时的锁变量的当前值，在bthread真正执行挂起动作前，要再次检查锁变量的最新值，只有挂起前的 138 | // 锁变量最新值与expected_ver相等，bthread才能真正挂起；如果不等，锁可能已被释放，bthread不能挂起，否则 139 | // 可能永远无法被唤醒，这时bthread应该放弃挂起动作，再次去竞争butex锁。 140 | uint32_t expected_ver = *butex; 141 | // 关键字段的重置已完成，可以释放pthread线程锁了。 142 | meta->mutex.unlock(); 143 | // 已经出现了bthread间的竞态。 144 | ever_contended = true; 145 | // 在butex_wait内部，新建ButexWaiter结构保存该bthread的主要信息并将ButexWaiter加入锁的等待队列waiters 146 | // 链表，然后yield让出cpu。 147 | // bthread真正挂起前，要再次判断锁变量的最新值是否与expected_ver相等。 148 | if (bthread::butex_wait(butex, expected_ver, NULL) < 0 && 149 | errno != EWOULDBLOCK && errno != EINTR) { 150 | return errno; 151 | } 152 | 153 | // 这里是bthread被唤醒后，恢复执行点。 154 | 155 | // 之前挂起的bthread被重新执行，先要再次去竞争pthread线程锁。不一定能竞争成功，所以上层要有一个while循环 156 | // 不断的去判断被唤醒的bthread抢到pthread线程锁后可能观察到的butex锁变量的各种不同值。 157 | meta->mutex.lock(); 158 | } else { // bthread_id_about_to_destroy was called. 159 | // Butex的value被其他bthread置为unlockable_ver，Id结构将被释放回资源池，Controller结构将被析构， 160 | // 即一次RPC已经完成，因此执行到这里的bthread直接返回，不会再有后续的动作。 161 | meta->mutex.unlock(); 162 | return EPERM; 163 | } 164 | } 165 | meta->mutex.unlock(); 166 | return EINVAL; 167 | } 168 | ``` 169 | 170 | - bthread_id_unlock：释放butex锁，从锁的等待队列中唤醒一个bthread： 171 | 172 | ```c++ 173 | int bthread_id_unlock(bthread_id_t id) { 174 | // 通过id的前32bits，在O(1)时间内定位到Id对象的地址。 175 | bthread::Id* const meta = address_resource(bthread::get_slot(id)); 176 | if (!meta) { 177 | return EINVAL; 178 | } 179 | uint32_t* butex = meta->butex; 180 | // Release fence makes sure all changes made before signal visible to 181 | // woken-up waiters. 182 | const uint32_t id_ver = bthread::get_version(id); 183 | // 竞争pthread线程锁。 184 | meta->mutex.lock(); 185 | if (!meta->has_version(id_ver)) { 186 | // call_id非法，严重错误。 187 | meta->mutex.unlock(); 188 | LOG(FATAL) << "Invalid bthread_id=" << id.value; 189 | return EINVAL; 190 | } 191 | if (*butex == meta->first_ver) { 192 | // 一个bthread执行到这里，观察到的butex锁变量的当前值要么是locked_ver，要么是contented_ver， 193 | // 不可能是first_ver，否则严重错误。 194 | meta->mutex.unlock(); 195 | LOG(FATAL) << "bthread_id=" << id.value << " is not locked!"; 196 | return EPERM; 197 | } 198 | bthread::PendingError front; 199 | if (meta->pending_q.pop(&front)) { 200 | meta->lock_location = front.location; 201 | meta->mutex.unlock(); 202 | // 203 | if (meta->on_error) { 204 | return meta->on_error(front.id, meta->data, front.error_code); 205 | } else { 206 | return meta->on_error2(front.id, meta->data, front.error_code, 207 | front.error_text); 208 | } 209 | } else { 210 | // 如果锁变量的当前值为contended_ver，则有N（N>=1）个bthread挂在锁的waiters队列中，等待唤醒。 211 | const bool contended = (*butex == meta->contended_ver()); 212 | // 重置锁变量的值为first_ver，表示当前的bthread对Controller的独占性访问已完成，后续被唤醒的bthread可以去 213 | // 竞争对Controller的访问权了。 214 | *butex = meta->first_ver; 215 | // 关键字段已完成更新，释放线程锁。 216 | meta->mutex.unlock(); 217 | if (contended) { 218 | // We may wake up already-reused id, but that's OK. 219 | // 唤醒waiters等待队列中的一个bthread。 220 | bthread::butex_wake(butex); 221 | } 222 | return 0; 223 | } 224 | } 225 | ``` 226 | 227 | - bthread_id_join：负责RPC发送过程的bthread完成发送动作后，会调用bthread_id_join将自身挂起，等待处理服务器Response的bthread来唤醒。这时是挂在join_butex锁的等待队列中的，与butex锁无关。 228 | 229 | ```c++ 230 | int bthread_id_join(bthread_id_t id) { 231 | // 通过id的前32bits，在O(1)时间内定位到Id对象的地址。 232 | const bthread::IdResourceId slot = bthread::get_slot(id); 233 | bthread::Id* const meta = address_resource(slot); 234 | if (!meta) { 235 | // The id is not created yet, this join is definitely wrong. 236 | return EINVAL; 237 | } 238 | const uint32_t id_ver = bthread::get_version(id); 239 | uint32_t* join_butex = meta->join_butex; 240 | while (1) { 241 | meta->mutex.lock(); 242 | const bool has_ver = meta->has_version(id_ver); 243 | const uint32_t expected_ver = *join_butex; 244 | meta->mutex.unlock(); 245 | if (!has_ver) { 246 | break; 247 | } 248 | // 将ButexWaiter挂在join_butex锁的等待队列中，bthread yield让出cpu。 249 | // bthread恢复执行的时候，一般是RPC过程已经完成时。 250 | if (bthread::butex_wait(join_butex, expected_ver, NULL) < 0 && 251 | errno != EWOULDBLOCK && errno != EINTR) { 252 | return errno; 253 | } 254 | } 255 | return 0; 256 | } 257 | ``` 258 | -------------------------------------------------------------------------------- /docs/client_retry.md: -------------------------------------------------------------------------------- 1 | [异常情况下的发送重试](#异常情况下的发送重试) 2 | 3 | [brpc的发送重试策略](#brpc的发送重试策略) 4 | 5 | [brpc的Backup-Request机制](#brpc的Backup-Request机制) 6 | 7 | ## 异常情况下的发送重试 8 | 在一次RPC过程中，客户端向服务器发送请求时，有可能会 9 | 10 | 下面分析下brpc是如何处理 11 | 12 | ## brpc的发送重试策略 13 | 14 | ## brpc的Backup-Request机制 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /docs/client_rpc_exception.md: -------------------------------------------------------------------------------- 1 | [RPC请求发送过程可能会遇到的异常情况](#RPC请求发送过程可能会遇到的异常情况) 2 | 3 | [brpc对各类发送异常的处理方式](#brpc对各类发送异常的处理方式) 4 | 5 | ## RPC请求发送过程可能会遇到的异常情况 6 | 在一次RPC过程中（假设一个客户端fd与一个服务端fd原先已正常建立TCP长连接），客户端向服务器发送请求时可能会遇到如下各类异常： 7 | 8 | 1. 服务器限流 9 | 10 | 服务器接收请求的队列的长度超出阈值，此时服务器不能再负责处理新到的请求，否则服务器可能压力过大而崩溃。 11 | 12 | 2. 服务器正在退出 13 | 14 | 服务器正在退出服务，此时服务器进程虽然存活，但不能响应客户端的请求。 15 | 16 | 3. 服务器进程正常关闭fd 17 | 18 | 服务器进程正常关闭fd包括三种情况： 19 | 20 | - 服务器进程主动调用close关闭fd； 21 | 22 | - 服务器进程主动调用exit； 23 | 24 | - 在shell下kill或ctrl+c杀掉服务器进程。 25 | 26 | 以上三种情况都会正常关闭服务端fd，TCP连接上会将FIN发给客户端，客户端的epoll会通知客户端的fd可读，触发的事件是EPOLLIN和EPOLLRDHUP，客户端应用层对fd执行read调用的返回值是0。 27 | 28 | 4. 机器断电导致的服务器进程异常关闭 29 | 30 | 服务器进程所在的机器可能在如下时刻断电： 31 | 32 | - 客户端发送数据过程中服务器机器断电 33 | 34 | 这里的客户端发送数据指的是客户端机器内核TCP/IP协议栈向服务器发送数据，客户端应用层进程只是将数据写入fd的内核inode发送缓存，操作系统内核的TCP/IP协议栈负责将inode发送缓存里的数据真正发送到网络上。如果服务器机器断电而客户端没有任何动作，则客户端不会被告知对端异常，只有当客户端的TCP/IP协议栈执行发送数据的动作时，由于滑动窗口无法滑动、无法收到已发数据的Ack等原因，协议栈会感知到对端异常，epoll会触发EPOLLERR事件通知应用层TCP连接已异常断开（此处的逻辑需要进一步查阅Linux内核协议栈代码来证实）。 35 | 36 | - 客户端已将数据发到网络上，服务器进程接到请求数据前机器断电 37 | 38 | 客户端在超时时间内无法接收到响应，走超时处理的逻辑。 39 | 40 | - 服务器进程处理请求过程中断电 41 | 42 | 同上，客户端会判断超时。 43 | 44 | 5. 网络断线 45 | 46 | 断线包括网线断开、客户端与服务器间的路由器故障等，发生的场景有： 47 | 48 | - 客户端发送数据时网络断线 49 | 50 | 与发送数据时服务器断电的逻辑一致，内核在发送数据时检测到TCP连接异常，epoll触发EPOLLERR事件通知应用层TCP连接异常（需进一步查阅Linux内核协议栈代码证实）。 51 | 52 | - 客户端发送请求成功，请求数据在网络上传输时断线，服务器未收到请求 53 | 54 | 客户端在超时时间内无法收到响应，走超时处理的逻辑。服务器内核协议栈没有读写动作，不会得知此时TCP连接异常。 55 | 56 | - 客户端发送请求成功，服务器处理请求过程中网络断线 57 | 58 | 客户端在超时时间内无法收到响应，走超时处理的逻辑。服务器内核协议栈发送响应数据过程中检测到TCP连接异常。 59 | 60 | - 客户端发送请求成功，服务器处理请求后发送响应成功，响应数据在网络上传输时断线，客户端未收到响应 61 | 62 | 客户端走超时处理的逻辑。服务器内核协议栈接下来没有读写动作，不会得知此时TCP连接异常。 63 | 64 | - 由于断线，服务器可能只接收到请求数据的一个片段，或者客户端只接收到响应数据的一个片段 65 | 66 | 按照非法数据处理。 67 | 68 | 综上，可以总结出RPC客户端需要面对的异常状况有： 69 | 70 | - 服务器虽存活但不能处理请求 71 | 72 | - 超时时间内没有接收到响应 73 | 74 | - 服务器被正常关闭 75 | 76 | - 因服务器机器断电或网络断线导致的TCP连接异常 77 | 78 | 下面阐述brpc是如何处理发送RPC请求过程中可能出现的上述4种异常的。 79 | 80 | ## brpc对各类发送异常的处理方式 81 | 1. 服务器虽存活但不能处理请求 82 | 83 | 如果服务器由于软件层面的限制不能处理请求，会发回给客户端一个响应，响应包含错误码，服务器接收队列过大的错误码为ELIMIT，服务器正在退出的错误码为ELOGOFF，服务器已停止服务的错误码为EHOSTDOWN，客户端收到这几类错误码后，如果是通过连接池与多台服务器连接，会重新将请求发往一台新的服务器，这就是一次重试。如果服务器返给客户端的带有错误码的response由于TCP连接异常等原因丢失，则只能按请求超时的逻辑处理。 84 | 85 | 重试的请求也可能遇到服务器不能处理的情况，会产生下一次重试，重试不能无限产生，必须限定一次RPC过程的最大重试次数，只有下面三个条件同时满足时，才允许进行重试： 86 | 87 | - 未到RPC超时时间 88 | 89 | - 有剩余的重试次数 90 | 91 | - 服务器返回的错误码允许重试，或者TCP连接出现异常时也可以重试 92 | 93 | 2. 超时时间内没有接收到响应 94 | 95 | brpc规定每一次RPC都有一个超时时间，超时时间是在Channel对象中设置，通过此Channel的所有RPC的超时时间是一样的，填充在每个RPC对应的Controller对象中。在超时时间内可以进行若干次重试发送，但只要过了超时时间还没收到想要的响应，RPC就会结束，就算重试产生的响应在超时时间过后马上到来也会被忽略。 96 | 97 | 3. 服务器被正常关闭 98 | 99 | 客户端对一个fd的read操作返回0后，该fd不再可用，关闭fd，客户端应将后续的请求发给连接池中的其他TCP连接。如果fd上有bthread因等待服务器的Response而被挂起，则这些bthread需要被唤醒，唤醒后去执行RPC重试的逻辑。 100 | 101 | 4. 因服务器机器断电或网络断线导致的TCP连接异常 102 | 103 | 当一个TCP连接出现异常后，epoll会在对应的fd上触发EPOLLERR事件，brpc会立即将fd所属的Socket对象置为不可用，如果之前已经有bthread通过此Socket对象发送了数据、正在等待服务器Response而被挂起，则这些bthread需要被唤醒，唤醒后去执行RPC重试的逻辑。 104 | -------------------------------------------------------------------------------- /docs/client_rpc_normal.md: -------------------------------------------------------------------------------- 1 | [数据发送过程涉及到的主要数据结构](#数据发送过程涉及到的主要数据结构) 2 | 3 | [数据发送过程中的系统内存布局与多线程执行状态](#数据发送过程中的系统内存布局与多线程执行状态) 4 | 5 | ## 数据发送过程涉及到的主要数据结构 6 | 1. Channel对象：表示客户端与一台服务器或一组服务器的连接通道。 7 | 8 | 2. Controller对象：存储一次完整的RPC请求的Context以及各种状态。 9 | 10 | 3. Butex对象：实现bthread粒度的互斥锁，管理锁的挂起等待队列。 11 | 12 | 4. Id对象：同步一次RPC过程中的各个bthread（发送数据、处理服务器的响应、处理超时均由不同的bthread完成）。 13 | 14 | ## 数据发送过程中的系统内存布局与多线程执行状态 15 | 以brpc自带的实例程序example/multi_threaded_echo_c++/client.cpp为例，结合Client端内存布局的变化过程和多线程执行过程，阐述无异常状态下（所有发送数据都及时得到响应，没有超时、没有因服务器异常等原因引发的请求重试）的Client发送请求直到处理响应的过程。 16 | 17 | 该程序运行后，会与单台服务器建立一条TCP长连接，创建thread_num个bthread（后续假设thread_num=3）在此TCP连接上发送、接收数据，不涉及连接池、负载均衡。RPC使用同步方式，这里的同步是指bthread间的同步：负责发送数据的bthread完成发送操作后，不能结束，而是需要挂起，等待负责接收服务器响应的bthread将其唤醒后，再恢复执行。挂起时会让出cpu，pthread可继续执行任务队列中的其他bthread。 18 | 19 | 具体运行过程为： 20 | 21 | 1. 在main函数的栈上创建一个Channel对象，并初始化Channel的协议类型、连接类型、RPC超时时间、请求重试次数等参数，上述参数后续会被赋给所有通过此Channel的RPC请求； 22 | 23 | 2. 在main函数中调用bthread_start_background创建3个bthread，此时TaskControl、TaskGroup对象都并不存在，所以此时需要在heap内存上创建它们（惰性初始化方式，不是在程序启动时就创建所有的对象，而是到对象确实要被用到时才去创建）： 24 | 25 | - 一个TaskControl单例对象； 26 | 27 | - N个TaskGroup对象（后续假设N=4），每个TaskGroup对应一个系统线程pthread，是pthread的线程私有对象，每个pthread启动后以自己的TaskGroup对象的run_main_task函数作为主工作函数，在该函数内执行无限循环，不断地从TaskGroup的任务队列中取得bthread id、通过id找到bthread对象、去执行bthread任务函数。 28 | 29 | 3. 在TaskMeta对象池中创建3个TaskMeta对象（每个TaskMeta等同一个bthread），每个TaskMeta的fn函数指针指向client.cpp中定义的static类型函数sender，sender就是bthread的任务处理函数。每个TaskMeta创建完后，按照散列规则找到一个TaskGroup对象，并将tid（也就是bthread的唯一标识id）压入该TaskGroup对象的_remote_rq队列中（TaskGroup所属的pthread线程称为worker线程，worker线程自己产生的bthread的tid会被压入自己私有的TaskGroup对象的_rq队列，本实例中的main函数所在线程不属于worker线程，所以main函数所在的线程生成的bthread的tid会被压入找到的TaskGroup对象的_rq队列）； 30 | 31 | 4. main函数执行到这里，不能直接结束（否则Channel对象会被马上析构，所有RPC无法进行），必须等待3个bthread全部执行sender函数结束后，main才能结束。 32 | 33 | - main函数所在线程挂起的实现机制是，将main函数所在线程的信息存储在ButexPthreadWaiter中，并加入到TaskMeta对象的version_butex指针所指的Butex对象的等待队列waiters中，TaskMeta的任务函数fn执行结束后，会从waiters中查找到“之前因等待TaskMeta的任务函数fn执行结束而被挂起的”pthread线程，再将其唤醒。关于Butex机制的细节，可参见[这篇文章](butex.md)； 34 | 35 | - main函数所在的系统线程在join bthread 1的时候就被挂起，等待在wait_pthread函数处。bthread 1执行sender函数结束后，唤醒main函数的线程，main函数继续向下执行，去join bthread 2。如果此时bthread 2仍在运行，则再将存储了main函数所在线程信息的一个新的ButexPthreadWaiter加入到bthread 2对应的TaskMeta对象的version_butex指针所指的Butex对象的等待队列waiters中，等到bthread 2执行完sender函数后再将main函数所在线程唤醒。也可能当main函数join bthread 2的时候bthread 2已经运行完成，则join操作直接返回，接着再去join bthread 3； 36 | 37 | - 只有当三步join操作全部返回后，main函数才结束。 38 | 39 | 5. 此时Client进程内部的线程状态是： 40 | 41 | - bthread状态：三个bthread 1、2、3已经创建完毕，各自的bthread id已经被分别压入不同的TaskGroup对象（设为TaskGroup 1、2、3）的任务队列_remote_rq中； 42 | 43 | - pthread状态：此时进程中存在5个pthread线程：3个pthread即将从各自私有的TaskGroup对象的_remote_rq中拿到bthread id，将要执行bthread id对应的TaskMeta对象的任务函数；1个pthread仍然阻塞在run_main_task函数上，等待新任务到来通知；main函数所在线程被挂起，等待bthread 1执行结束。 44 | 45 | 6. 此时Client进程内部的内存布局如下图所示，由于bthread 1、2、3还未开始运行，未分配任何局部变量，所以此时各自的私有栈都是空的： 46 | 47 |

48 | 49 | 7. TaskGroup 1、2、3分别对应的3个pthread开始执行各自拿到的bthread的任务函数，即client.cpp中的static类型的sender函数。由于各个bthread有各自的私有栈空间，所以sender中的局部变量request、response、Controller对象均被分配在bthread的私有栈内存上； 50 | 51 | 8. 根据protobuf的标准编程模式，3个执行sender函数的bthread都会执行Channel的CallMethod函数，CallMethod负责的工作为： 52 | 53 | - CallMethod函数的入参为各个bthread私有的request、response、Controller，CallMethod内部会为Controller对象的相关成员变量赋值，包括RPC起始时间戳、最大重试次数、RPC超时时间、Backup Request超时时间、标识一次RPC过程的唯一id correlation_id等等。Controller对象可以认为是存储了一次RPC过程的所有Context上下文信息; 54 | 55 | - 在CallMethod函数中不存在线程间的竞态，CallMethod本身是线程安全的。而Channel对象是main函数的栈上对象，main函数所在线程已被挂起，直到3个bthread全部执行完成后才会结束，所以Channel对象的生命期贯穿于整个RPC过程; 56 | 57 | - 构造Controller对象相关联的Id对象，Id对象的作用是同步一次RPC过程中的各个bthread，因为在一次RPC过程中，发送请求、接收响应、超时处理均是由不同的bthread负责，各个bthread可能运行在不同的pthread上，因此这一次RPC过程的Controller对象可能被上述不同的bthread同时访问，也就是相当于被不同的pthread并发访问，产生竞态。此时不能直接让某个pthread去等待线程锁，那样会让pthread挂起，阻塞该pthread私有的TaskGroup对象的任务队列中其他bthread的执行。因此如果一个bthread正在访问Controller对象，此时位于不同pthread上的其他bthread若想访问Controller，必须将自己的bthread信息加入到一个等待队列中，yield让出cpu，让pthread继续去执行任务队列中下一个bthread。正在访问Controller的bthread让出访问权后，会从等待队列中找到挂起的bthread，并将其bthread id再次压入某个TaskGroup的任务队列，这样就可让原先为了等待Controller访问权而挂起的bthread得以从yield点恢复执行。这就是bthread级别的挂起-唤醒的基本原理，这样也保证所有pthread是wait-free的。 58 | 59 | - 在CallMethod中会通过将Id对象的butex指针指向的Butex结构的value值置为“locked_ver”表示Id对象已被锁，即当前发送数据的bthread正在访问Controller对象。在本文中假设发送数据后正常接收到响应，不涉及重试、RPC超时等，所以不深入阐述Id对象，关于Id的细节请参考[这篇文章](client_bthread_sync.md)。 60 | 61 | 9. pthread线程执行流程接着进入Controller的IssueRPC函数，在该函数中： 62 | 63 | - 按照指定协议格式将RPC过程的首次请求的call_id、RPC方法名、实际待发送数据打包成报文； 64 | 65 | - 调用Socket::Write函数执行实际的发送数据过程。Socket对象表示Client与单台Server的连接。向fd写入数据的细节过程参考[这篇文章](io_write.md)； 66 | 67 | - 在实际发送数据前需要先建立与Server的TCP长连接，并惰性初始化event_dispatcher_num个EventDispatcher对象（假设event_dispatcher_num=2），从而新建2个bthread 4和5，并将它们的任务处理函数设为static类型的EventDispatcher::RunThis函数，当bthread 4、5得到pthread执行时，会调用epoll_wait检测是否有I/O事件触发。brpc是没有专门的I/O pthread线程的； 68 | 69 | - 从Socket::Write函数返回后，调用bthread_id_unlock释放对Controller对象的独占访问。 70 | 71 | 10. 因为RPC使用synchronous同步方式，所以bthread完成数据发送后调用bthread_id_join将自身挂起，让出cpu，等待负责接收服务器响应的bthread来唤醒。此时Client进程内部的线程状态是：bthread 1、2、3都已挂起，执行bthread任务的pthread 1、2、3分别跳出了bthread 1、2、3的任务函数，回到TaskGroup::run_main_task函数继续等待新的bthread任务，因为在向fd写数据的过程中通常会新建一个KeepWrite bthread（bthread 6），假设这个bthread的id被压入到TaskGroup 4的任务队列中，被pthread 4执行，所以pthread 1、2、3此时没有新bthread可供执行，处于系统线程挂起状态。 72 | 73 | 11. 此时Client进程内部的内存布局如下图所示，注意各个类型对象分配在不同的内存区，比如Butex对象、Id对象分配在heap上，Controller对象、ButexBthreadWaiter对象分配在bthread的私有栈上： 74 | 75 |

76 | 77 | 12. KeepWrite bthread完成工作后，3个请求都被发出，假设服务器正常返回了3个响应，由于3个响应是在一个TCP连接上接收的，所以bthread 4、5二者只会有一个通过epoll_wait()检测到fd可读，并新建一个bthread 7去负责将fd的inode输入缓存中的数据读取到应用层，在拆包过程中，解析出一条Response，就为这个Response的处理再新建一个bthread，目的是实现响应读取+处理的最大并发。因此Response 1在bthread 8中被处理，Response 2在bthread 9中被处理，Response 3在bthread 7中被处理（最后一条Response不需要再新建bthread了，直接在bthread 7本地处理即可）。bthread 8、9、7会将Response 1、2、3分别复制到相应Controller对象的response中，这时应用程序就会看到响应数据了。bthread 8、9、7也会将挂起的bthread 1、2、3唤醒，bthread 1、2、3会恢复执行，可以对Controller对象中的response做一些操作，并开始发送下一个RPC请求。 78 | -------------------------------------------------------------------------------- /docs/futex.md: -------------------------------------------------------------------------------- 1 | [spinlock和内核提供的同步机制存在的不足](#spinlock和内核提供的同步机制存在的不足) 2 | 3 | [Futex设计原理](#Futex设计原理) 4 | 5 | [brpc中Futex的实现](#brpc中Futex的实现) 6 | 7 | ## spinlock和内核提供的同步机制存在的不足 8 | 在Futex出现之前，想要pthread系统线程等待一把锁，有两种实现方式： 9 | 10 | 1. 使用spinlock自旋锁，实现起来也很简单： 11 | 12 | ```c++ 13 | std::atomic flag; 14 | 15 | void lock() { 16 | int expect = 0; 17 | while (!flag.compare_exchange_strong(expect, 1)) { 18 | expect = 0; 19 | } 20 | } 21 | 22 | void unlock() { 23 | flag.store(0); 24 | } 25 | ``` 26 | 27 | spinlock属于应用层的同步机制，直接运行在用户态，不涉及用户态-内核态的切换。spinlock存在的问题是，只适用于需要线程加锁的临界区代码段较小的场景，在这样的场景下可以认为一个线程加了锁后很快就会释放锁，等待锁的其他线程只需要很少的while()调用就可以得到锁；但如果临界区代码很长，一个线程加锁后会耗费相当一段时间去执行临界区代码，在这个线程释放锁之前其他线程只能不停地在while()中不断busy-loop，耗费了cpu资源，而且在应用程序层面又没有办法能让pthread系统线程挂起。 28 | 29 | 2. 直接使用Linux提供的pthread_mutex_lock系统调用或者编程语言提供的操作互斥锁的API（如C++的std::unique_lock），本质上是一样的，都是使用Linux内核提供的同步机制，可以让pthread系统线程挂起，让出CPU。但缺点是如果直接使用Linux内核提供的同步机制，每一次lock、unlock操作都是一次系统调用，需要进行用户态-内核态的切换，存在一定的性能开销，但lock、unlock的时候不一定会有线程间的竞争，在没有线程竞争的情况下没有必要进行用户态-内核态的切换。 30 | 31 | ## Futex设计原理 32 | Futex机制可以认为是结合了spinlock和内核态的pthread线程锁，它的设计意图是： 33 | 34 | 1. 一个线程在加锁的时候，在用户态使用原子操作执行“尝试加锁，如果当前锁变量值为0，则将锁变量值更新为1，返回成功；如果当前锁变量值为1，说明之前已有线程拿到了锁，返回失败”这个动作，如果加锁成功，则可直接去执行临界区代码；如果加锁失败，则用类似pthread_mutex_lock这样的系统调用将当前线程挂起。由于可能有多个线程同时被挂起，所以必须将各个被挂起线程的信息存入一个与锁相关的等待队列中； 35 | 36 | 2. 一个线程在释放锁的时候，也是用原子操作将锁变量的值改回0，并且如果与锁相关的等待队列不为空，则释放锁的线程必须使用内核提供的系统调用去唤醒因等待锁而被挂起的线程，具体是唤醒一个线程还是唤醒全部线程视使用场景而定； 37 | 38 | 3. 由上可见，使用Futex机制，在没有线程竞争的情况下，在用户层就可以完成临界区代码的加锁解锁，只有在确实有线程竞争的情况下才会使用内核提供的系统调用实现线程的挂起与唤醒； 39 | 40 | 4. 在实现Futex的时候有一个细节需要注意，Futex的代码如果像下面这样写： 41 | 42 | ```c++ 43 | void lock() { 44 | while (!trylock()) { 45 | wait(); // 使用内核提供的系统调用挂起当前线程。 46 | } 47 | } 48 | ``` 49 | 50 | 这样的实现存在一个问题，在trylock()和wait()间存在一个时间窗口，在这个时间窗口中锁变量可能发生改变。比如一个线程A调用trylock()返回失败，在调用wait()前，锁被之前持有锁的线程B释放，线程A再调用wait()就会被永久挂起，永远不会再被唤醒了。因此需要在wait()内部再次判断锁变量是否仍为在trylock()内看到的旧值，如果不是，则wait()应直接返回，再次去执行trylock()。 51 | 52 | ## brpc中Futex的实现 53 | brpc实现了Futex机制，主要代码在src/bthread/sys_futex.cpp中，SimuFutex类定义了一个锁的等待队列计数等统计量，另外有两个函数分别负责wait和wake： 54 | 55 | - SimuFutex类： 56 | 57 | ```c++ 58 | class SimuFutex { 59 | public: 60 | SimuFutex() : counts(0) 61 | , ref(0) { 62 | pthread_mutex_init(&lock, NULL); 63 | pthread_cond_init(&cond, NULL); 64 | } 65 | ~SimuFutex() { 66 | pthread_mutex_destroy(&lock); 67 | pthread_cond_destroy(&cond); 68 | } 69 | 70 | public: 71 | pthread_mutex_t lock; 72 | pthread_cond_t cond; 73 | // 有多少个线程在等待一个锁的时候被挂起。 74 | int32_t counts; 75 | int32_t ref; 76 | }; 77 | ``` 78 | 79 | - futex_wait_private函数： 80 | 81 | ```c++ 82 | // addr1是锁变量的地址，expected是在外层调用spinlock时看到的锁变量的值。 83 | int futex_wait_private(void* addr1, int expected, const timespec* timeout) { 84 | // 调用InitFutexMap初始化全局的std::unordered_map* 类型的s_futex_map， 85 | // InitFutexMap仅被执行一次。 86 | if (pthread_once(&init_futex_map_once, InitFutexMap) != 0) { 87 | LOG(FATAL) << "Fail to pthread_once"; 88 | exit(1); 89 | } 90 | std::unique_lock mu(s_futex_map_mutex); 91 | SimuFutex& simu_futex = (*s_futex_map)[addr1]; 92 | ++simu_futex.ref; 93 | mu.unlock(); 94 | 95 | int rc = 0; 96 | { 97 | std::unique_lock mu1(simu_futex.lock); 98 | // 判断锁*addr1的当前最新值是否等于expected期望值。 99 | if (static_cast*>(addr1)->load() == expected) { 100 | // 锁*addr1的当前最新值与expected期望值相等，可以使用系统调用将当前线程挂起。 101 | // 因为有一个线程为了等待锁而将要被挂起，锁*addr1相关的counts计数器需要递增1。 102 | ++simu_futex.counts; 103 | // 调用pthread_cond_wait将当前线程挂起，并释放simu_futex.lock锁。 104 | if (timeout) { 105 | timespec timeout_abs = butil::timespec_from_now(*timeout); 106 | if ((rc = pthread_cond_timedwait(&simu_futex.cond, &simu_futex.lock, &timeout_abs)) != 0) { 107 | // pthread_cond_timedwait返回时会再次对simu_futex.lock上锁。 108 | errno = rc; 109 | rc = -1; 110 | } 111 | } else { 112 | if ((rc = pthread_cond_wait(&simu_futex.cond, &simu_futex.lock)) != 0) { 113 | // pthread_cond_wait返回时会再次对simu_futex.lock上锁。 114 | errno = rc; 115 | rc = -1; 116 | } 117 | } 118 | // 当前线程已被唤醒并持有了锁*addr1，counts计数器递减1。 119 | --simu_futex.counts; 120 | } else { 121 | // 锁*addr1的当前最新值与expected期望值不等，需要再次执行上层的spinlock。 122 | errno = EAGAIN; 123 | rc = -1; 124 | } 125 | } 126 | 127 | std::unique_lock mu1(s_futex_map_mutex); 128 | if (--simu_futex.ref == 0) { 129 | s_futex_map->erase(addr1); 130 | } 131 | mu1.unlock(); 132 | return rc; 133 | } 134 | ``` 135 | 136 | - futex_wake_private函数： 137 | 138 | ```c++ 139 | int futex_wake_private(void* addr1, int nwake) { 140 | if (pthread_once(&init_futex_map_once, InitFutexMap) != 0) { 141 | LOG(FATAL) << "Fail to pthread_once"; 142 | exit(1); 143 | } 144 | std::unique_lock mu(s_futex_map_mutex); 145 | auto it = s_futex_map->find(addr1); 146 | if (it == s_futex_map->end()) { 147 | mu.unlock(); 148 | return 0; 149 | } 150 | SimuFutex& simu_futex = it->second; 151 | ++simu_futex.ref; 152 | mu.unlock(); 153 | 154 | int nwakedup = 0; 155 | int rc = 0; 156 | { 157 | std::unique_lock mu1(simu_futex.lock); 158 | nwake = (nwake < simu_futex.counts)? nwake: simu_futex.counts; 159 | for (int i = 0; i < nwake; ++i) { 160 | // 唤醒指定数量的在锁*addr1上挂起的线程。 161 | if ((rc = pthread_cond_signal(&simu_futex.cond)) != 0) { 162 | errno = rc; 163 | break; 164 | } else { 165 | ++nwakedup; 166 | } 167 | } 168 | } 169 | 170 | std::unique_lock mu2(s_futex_map_mutex); 171 | if (--simu_futex.ref == 0) { 172 | s_futex_map->erase(addr1); 173 | } 174 | mu2.unlock(); 175 | return nwakedup; 176 | } 177 | ``` 178 | -------------------------------------------------------------------------------- /docs/io_protobuf.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronaldo8210/brpc_source_code_analysis/d4d52689a786767b892a547cb03781a546d00b77/docs/io_protobuf.md -------------------------------------------------------------------------------- /docs/io_read.md: -------------------------------------------------------------------------------- 1 | [传统RPC框架从fd读取数据的方式](#传统RPC框架从fd读取数据的方式) 2 | 3 | [brpc实现的读数据方式](#brpc实现的读数据方式) 4 | 5 | ## 传统RPC框架从fd读取数据的方式 6 | 传统的RPC框架一般会区分I/O线程和worker线程，一台机器上处理的所有fd会散列到多个I/O线程上，每个I/O线程在其私有的EventLoop对象上执行类似下面这样的循环： 7 | 8 | ```C++ 9 | while (!stop) { 10 | int n = epoll_wait(); 11 | for (int i = 0; i < n; ++i) { 12 | if (event is EPOLLIN) { 13 | // read data from fd 14 | // push pointer of reference of data to task queue 15 | } 16 | } 17 | } 18 | ``` 19 | 20 | I/O线程从每个fd上读取到数据后，将已读取数据的指针或引用封装在一个Task对象中，再将Task的指针压入一个全局的任务队列，worker线程从任务队列中拿到报文并进行业务处理。 21 | 22 | 这种方式存在以下几个问题： 23 | 24 | 1. 一个I/O线程同一时刻只能从一个fd读取数据，数据从fd的inode内核缓存读取到应用层缓冲区是一个相对较慢的操作，读取顺序越靠后的fd，其上面的数据读取、处理越可能产生延迟。实际场景中，如果10个客户端同一时刻分别通过10条TCP连接向一个服务器发送请求，假设服务器只开启一个I/O线程，epoll同时通知10个fd有数据可读，I/O线程只能先读完第一个fd，再读去读第二个fd...可能等到读第9、第10个fd时，客户端已经报超时了。但实际上10个客户端是同一时刻发送的请求，服务器的读取数据顺序却有先有后，这对客户端来说是不公平的； 25 | 26 | 2. 多个I/O线程都会将Task压入一个全局的任务队列，会产生锁竞争；多个worker线程从全局任务队列中获取任务，也会产生锁竞争，这都会降低性能。并且多个worker线程从长久看很难获取到均匀的任务数量，例如有4个worker线程同时去任务队列拿任务，worker 1竞争锁成功，拿到任务后去处理，然后worker 2拿到任务，接着worker 3拿到任务，再然后worker 1可能已经处理任务完毕，又来和worker 4竞争全局队列的锁，可能worker 1又再次竞争成功，worker 4还是饿着，这就造成了任务没有在各个worker线程间均匀分配； 27 | 28 | 3. I/O线程从fd的inode内核缓存读数据到应用层缓冲区，worker线程需要处理这块内存上的数据，内存数据同步到执行worker线程的cpu的cacheline上需要耗费时间。 29 | 30 | ## brpc实现的读数据方式 31 | brpc没有专门的I/O线程，只有worker线程，epoll_wait()也是在bthread中被执行。当一个fd可读时： 32 | 33 | 1. 读取动作并不是在epoll_wait()所在的bthread 1上执行，而是会通过TaskGroup::start_foreground()新建一个bthread 2，bthread 2负责将fd的inode内核输入缓存中的数据读到应用层缓存区，pthread执行流会立即进入bthread 2，bthread 1会被加入任务队列的尾部，可能会被steal到其他pthread上执行； 34 | 35 | 2. bthread 2进行拆包时，每解析出一个完整的应用层报文，就会为每个报文的处理再专门创建一个bthread，所以bthread 2可能会创建bthread 3、4、5...这样的设计意图是尽量让一个fd上读出的各个报文也得到最大化的并发处理。 36 | 37 | （细节TODO） 38 | -------------------------------------------------------------------------------- /docs/io_write.md: -------------------------------------------------------------------------------- 1 | [多线程向同一个TCP连接写数据的设计原理](#多线程向同一个TCP连接写数据的设计原理) 2 | 3 | [brpc中的代码实现](#brpc中的代码实现) 4 | 5 | [一个实际场景下的示例](#一个实际场景下的示例) 6 | 7 | ## 多线程向同一个TCP连接写数据的设计原理 8 | 考虑brpc自带的示例程序example/multi_threaded_echo_c++/client.cpp，use_bthread为true的情况下，多个bthread通过一条TCP长连接向服务端发送数据，而多个bthread通常又是运行在多个系统线程pthread上的，所以多个pthread如何高效且线程安全地向一个TCP连接写数据，是系统设计需要重点考虑的。brpc针对这个问题的设计思路如下： 9 | 10 | 1. 为每个可被多线程写入数据的fd维护一个单项链表，每个试图向fd写数据的线程首先判断自己是不是当前第一个向fd写数据的线程，如果是，则持有写数据的权限，可以执行向fd写数据的操作；如果不是，则将待写数据加入链表就即刻返回（bthread执行结束，挂起，等待响应数据）。 11 | 12 | 2. 掌握写权限的线程，在向fd写数据的时候，不仅可以写本线程持有的待写数据，而且可以观察到fd的链表上是否还加入了其他线程的待写数据，写入的时候可以尽量写入足够多的数据，但只执行一次写操作，如果因为fd的内核inode输出缓冲区已满而未能全部写完，则启动一个新的bthread去执行后续的写操作，当前bthread立即返回（被挂起，等待响应response的bthread唤醒）。 13 | 14 | 3. 新启动的执行写操作的bthread，负责将fd的链表上的所有待写入数据写入fd（后续可能会有线程不断将待写数据加入待写链表），直到将链表清空。如果fd的内核inode缓冲区已满而不能写入，则该bthread将被挂起，让出cpu。等到epoll通知fd可写时，该thread再被唤起，继续写入。 15 | 16 | 4. KeepWrite bthread直到通过一个原子操作判断出_write_hread已为NULL时，才会执行完成，如果同时刻有一个线程通过原子操作判断出_write_hread为NULL，则重复上述过程1，所以不可能同时有两个KeepWrite bthread存在。 17 | 18 | 5. 按照如上规则，所有bthread都不会有任何的等待操作，这就做到了wait-free，当然也是lock-free的（判断自己是不是第一个向fd写数据的线程的操作实际上是个原子交换操作）。 19 | 20 | 下面结合brpc的源码来阐述这套逻辑的实现。 21 | 22 | ## brpc中的代码实现 23 | brpc中的Socket类对象代表Client端与Server端的一条TCP连接，其中主要函数有： 24 | 25 | - StartWrite函数：每个bthread向TCP连接写数据的入口，在实际环境下通常会被多个pthread执行，必须要做到线程安全： 26 | 27 | ```c++ 28 | int Socket::StartWrite(WriteRequest* req, const WriteOptions& opt) { 29 | // Release fence makes sure the thread getting request sees *req 30 | // 与当前_write_head做原子交换，_write_head初始值是NULL， 31 | // 如果是第一个写fd的线程，则exchange返回NULL，并将_write_head指向第一个线程的待写数据， 32 | // 如果不是第一个写fd的线程，exchange返回值是非NULL，且将_write_head指向最新到来的待写数据。 33 | WriteRequest* const prev_head = 34 | _write_head.exchange(req, butil::memory_order_release); 35 | if (prev_head != NULL) { 36 | // Someone is writing to the fd. The KeepWrite thread may spin 37 | // until req->next to be non-UNCONNECTED. This process is not 38 | // lock-free, but the duration is so short(1~2 instructions, 39 | // depending on compiler) that the spin rarely occurs in practice 40 | // (I've not seen any spin in highly contended tests). 41 | // 如果不是第一个写fd的bthread，将待写数据加入链表后，就返回。 42 | req->next = prev_head; 43 | return 0; 44 | } 45 | 46 | int saved_errno = 0; 47 | bthread_t th; 48 | SocketUniquePtr ptr_for_keep_write; 49 | ssize_t nw = 0; 50 | 51 | // We've got the right to write. 52 | // req指向的是第一个待写数据，肯定是以_write_head为头部的链表的尾结点，next一定是NULL。 53 | req->next = NULL; 54 | 55 | // Connect to remote_side() if not. 56 | // 如果TCP连接未建立，则在ConnectIfNot内部执行非阻塞的connect，并将自身挂起， 57 | // 等待epoll通知连接已建立后再被唤醒执行。 58 | int ret = ConnectIfNot(opt.abstime, req); 59 | if (ret < 0) { 60 | saved_errno = errno; 61 | SetFailed(errno, "Fail to connect %s directly: %m", description().c_str()); 62 | goto FAIL_TO_WRITE; 63 | } else if (ret == 1) { 64 | // We are doing connection. Callback `KeepWriteIfConnected' 65 | // will be called with `req' at any moment after 66 | // TCP连接建立中，bthread返回、挂起，等待唤醒。 67 | return 0; 68 | } 69 | 70 | // NOTE: Setup() MUST be called after Connect which may call app_connect, 71 | // which is assumed to run before any SocketMessage.AppendAndDestroySelf() 72 | // in some protocols(namely RTMP). 73 | req->Setup(this); 74 | 75 | if (ssl_state() != SSL_OFF) { 76 | // Writing into SSL may block the current bthread, always write 77 | // in the background. 78 | goto KEEPWRITE_IN_BACKGROUND; 79 | } 80 | 81 | // Write once in the calling thread. If the write is not complete, 82 | // continue it in KeepWrite thread. 83 | // 向fd写入数据，这里只关心req指向的数据，不关心其他bthread加入_write_head链表的数据。 84 | // 不一定能一次写完，可能req指向的数据只写入了一部分。 85 | if (_conn) { 86 | butil::IOBuf* data_arr[1] = { &req->data }; 87 | nw = _conn->CutMessageIntoFileDescriptor(fd(), data_arr, 1); 88 | } else { 89 | nw = req->data.cut_into_file_descriptor(fd()); 90 | } 91 | if (nw < 0) { 92 | // RTMP may return EOVERCROWDED 93 | if (errno != EAGAIN && errno != EOVERCROWDED) { 94 | saved_errno = errno; 95 | // EPIPE is common in pooled connections + backup requests. 96 | PLOG_IF(WARNING, errno != EPIPE) << "Fail to write into " << *this; 97 | SetFailed(saved_errno, "Fail to write into %s: %s", 98 | description().c_str(), berror(saved_errno)); 99 | goto FAIL_TO_WRITE; 100 | } 101 | } else { 102 | AddOutputBytes(nw); 103 | } 104 | // 判断req指向的数据是否已写完。 105 | // 在IsWriteComplete内部会判断，如果req指向的数据已全部写完，且当前时刻req是唯一待写入的数据， 106 | // 则IsWriteComplete返回true。 107 | if (IsWriteComplete(req, true, NULL)) { 108 | // 回收req指向的heap内存到对象池，bthread完成任务，返回。 109 | ReturnSuccessfulWriteRequest(req); 110 | return 0; 111 | } 112 | 113 | KEEPWRITE_IN_BACKGROUND: 114 | ReAddress(&ptr_for_keep_write); 115 | req->socket = ptr_for_keep_write.release(); 116 | // req指向的数据未全部写完，为了使pthread wait-free，启动KeepWrite bthread后，当前bthread就返回。 117 | // 在KeepWrite bthread内部，不仅需要处理当前req未写完的数据，还可能要处理其他bthread加入链表的数据。 118 | // KeepWrite bthread并不具有最高的优先级，所以使用bthread_start_background，将KeepWrite bthread的 119 | // tid加到执行队列尾部。 120 | if (bthread_start_background(&th, &BTHREAD_ATTR_NORMAL, 121 | KeepWrite, req) != 0) { 122 | LOG(FATAL) << "Fail to start KeepWrite"; 123 | KeepWrite(req); 124 | } 125 | return 0; 126 | 127 | FAIL_TO_WRITE: 128 | // `SetFailed' before `ReturnFailedWriteRequest' (which will calls 129 | // `on_reset' callback inside the id object) so that we immediately 130 | // know this socket has failed inside the `on_reset' callback 131 | ReleaseAllFailedWriteRequests(req); 132 | errno = saved_errno; 133 | return -1; 134 | } 135 | ``` 136 | 137 | - KeepWrite函数：作为一个独立存在的bthread的任务函数，负责不停地写入所有线程加入到_write_hread链表的数据，直到链表为空： 138 | 139 | ```c++ 140 | void* Socket::KeepWrite(void* void_arg) { 141 | g_vars->nkeepwrite << 1; 142 | WriteRequest* req = static_cast(void_arg); 143 | SocketUniquePtr s(req->socket); 144 | 145 | // When error occurs, spin until there's no more requests instead of 146 | // returning directly otherwise _write_head is permantly non-NULL which 147 | // makes later Write() abnormal. 148 | WriteRequest* cur_tail = NULL; 149 | do { 150 | // req was written, skip it. 151 | // 如果req的next指针不为NULL，则已经调用过IsWriteComplete实现了单向链表的翻转， 152 | // 待写数据的顺序已按到达序排列。 153 | // 所以如果req的next指针不为NULL且req的数据已写完，可以即刻回收req指向的内存， 154 | // 并将req重新赋值为下一个待写数据的指针。 155 | if (req->next != NULL && req->data.empty()) { 156 | // 执行到这里，就是因为虽然req指向的WriteRequest中的数据已写完， 157 | // 但_write_head链表中又被其他bthread加入了待写数据。 158 | WriteRequest* const saved_req = req; 159 | req = req->next; 160 | s->ReturnSuccessfulWriteRequest(saved_req); 161 | } 162 | // 向fd写入一次数据，DoWrite内部的实现为尽可能的多谢，可以连带req后面的待写数据一起写。 163 | const ssize_t nw = s->DoWrite(req); 164 | if (nw < 0) { 165 | if (errno != EAGAIN && errno != EOVERCROWDED) { 166 | // 如果不是因为内核inode输出缓存已满导致的write操作结果小于0， 167 | // 则标记Socket对象状态异常（TCP连接异常）。 168 | const int saved_errno = errno; 169 | PLOG(WARNING) << "Fail to keep-write into " << *s; 170 | s->SetFailed(saved_errno, "Fail to keep-write into %s: %s", 171 | s->description().c_str(), berror(saved_errno)); 172 | break; 173 | } 174 | } else { 175 | s->AddOutputBytes(nw); 176 | } 177 | // Release WriteRequest until non-empty data or last request. 178 | // 可能一次写入了链表中多个节点中的待写数据，数据已写完的节点回收内存。 179 | // while操作结束后req指向的是已翻转的链表中的第一个数据未写完的节点。 180 | while (req->next != NULL && req->data.empty()) { 181 | WriteRequest* const saved_req = req; 182 | req = req->next; 183 | s->ReturnSuccessfulWriteRequest(saved_req); 184 | } 185 | // TODO(gejun): wait for epollout when we actually have written 186 | // all the data. This weird heuristic reduces 30us delay... 187 | // Update(12/22/2015): seem not working. better switch to correct code. 188 | // Update(1/8/2016, r31823): Still working. 189 | // Update(8/15/2017): Not working, performance downgraded. 190 | //if (nw <= 0 || req->data.empty()/*note*/) { 191 | if (nw <= 0) { 192 | // 执行到这里，nw小于0的原因肯定是因为内核inode输出缓存已满。 193 | // 如果是由于fd的inode输出缓冲区已满导致write操作返回值小于等于0，则需要挂起执行KeepWrite的 194 | // bthread，让出cpu，让该bthread所在的pthread去任务队列中取出下一个bthread去执行。 195 | // 等到epoll返回告知inode输出缓冲区有可写空间时，再唤起执行KeepWrite的bthread，继续向fd写入数据。 196 | g_vars->nwaitepollout << 1; 197 | bool pollin = (s->_on_edge_triggered_events != NULL); 198 | // NOTE: Waiting epollout within timeout is a must to force 199 | // KeepWrite to check and setup pending WriteRequests periodically, 200 | // which may turn on _overcrowded to stop pending requests from 201 | // growing infinitely. 202 | const timespec duetime = 203 | butil::milliseconds_from_now(WAIT_EPOLLOUT_TIMEOUT_MS); 204 | // 在WaitEpollOut内部会执行butex_wait，挂起当前bthread。当bthread重新执行时，执行点是 205 | // butex_wait的函数返回点。 206 | const int rc = s->WaitEpollOut(s->fd(), pollin, &duetime); 207 | if (rc < 0 && errno != ETIMEDOUT) { 208 | const int saved_errno = errno; 209 | PLOG(WARNING) << "Fail to wait epollout of " << *s; 210 | s->SetFailed(saved_errno, "Fail to wait epollout of %s: %s", 211 | s->description().c_str(), berror(saved_errno)); 212 | break; 213 | } 214 | } 215 | // 令cur_tail找到已翻转链表的尾节点。 216 | if (NULL == cur_tail) { 217 | for (cur_tail = req; cur_tail->next != NULL; 218 | cur_tail = cur_tail->next); 219 | } 220 | // 执行到这里，cur_tail指向的是当前已被翻转的链表的尾节点。 221 | // Return when there's no more WriteRequests and req is completely 222 | // written. 223 | if (s->IsWriteComplete(cur_tail, (req == cur_tail), &cur_tail)) { 224 | // 如果IsWriteComplete返回true，则req必然，并且当前的_write_hread肯定是NULL。 225 | CHECK_EQ(cur_tail, req); 226 | // 回收内存后KeepWrite bthread就结束了，后续再有线程向fd写数据，则重复以前的逻辑。 227 | // 所以同一时刻只会存在一个KeepWrite bthread。 228 | s->ReturnSuccessfulWriteRequest(req); 229 | return NULL; 230 | } 231 | } while (1); 232 | 233 | // Error occurred, release all requests until no new requests. 234 | s->ReleaseAllFailedWriteRequests(req); 235 | return NULL; 236 | } 237 | ``` 238 | 239 | - IsWriteComplete函数：两种情况下会调用IsWriteComplete函数，1、持有写权限的bthread向fd写自身的WriteRequest中的待写数据，写一次fd后检测自身的WriteRequest中的数据是否写完；2、被KeepWrite bthread中执行，检测上一轮经过翻转的单向链表中的各个WriteRequest中数据是否全部写完。并且IsWriteComplete函数内部还负责检测是否还有其他bthread向_write_head链表加入了新的待写数据。 240 | 241 | ```c++ 242 | bool Socket::IsWriteComplete(Socket::WriteRequest* old_head, 243 | bool singular_node, 244 | Socket::WriteRequest** new_tail) { 245 | // old_head只有两种可能：1、指向持有写权限的bthread携带的WriteRequest，2、指向上一轮经过翻转的链表的尾节点。 246 | // 不论是哪两种，old_head指向的WriteRequest的next必然是NULL。 247 | CHECK(NULL == old_head->next); 248 | // Try to set _write_head to NULL to mark that the write is done. 249 | WriteRequest* new_head = old_head; 250 | WriteRequest* desired = NULL; 251 | bool return_when_no_more = true; 252 | if (!old_head->data.empty() || !singular_node) { 253 | desired = old_head; 254 | // Write is obviously not complete if old_head is not fully written. 255 | return_when_no_more = false; 256 | } 257 | // 1、如果之前翻转的链表已全部写完，则将_write_head置为NULL，当前的KeepWrite bthread也即将结束； 258 | // 2、如果之前翻转的链表未全部写完，且暂时无其他bthread向_write_head新增待写数据，_write_head指针保存原值； 259 | // 3、如果之前翻转的链表未全部写完，且已经有其他bthread向_write_head新增待写数据，将new_head的值置为当前 260 | // 最新的_write_head值，为后续的链表翻转做准备。 261 | if (_write_head.compare_exchange_strong( 262 | new_head, desired, butil::memory_order_acquire)) { 263 | // No one added new requests. 264 | if (new_tail) { 265 | *new_tail = old_head; 266 | } 267 | return return_when_no_more; 268 | } 269 | // 执行到这里，一定是有其他bthread将待写数据加入到了_write_head链表里， 270 | // 经过compare_exchange_strong后new_head指向当前_write_head所指的WriteRequest实例，肯定是不等于old_head的。 271 | CHECK_NE(new_head, old_head); 272 | // Above acquire fence pairs release fence of exchange in Write() to make 273 | // sure that we see all fields of requests set. 274 | 275 | // Someone added new requests. 276 | // Reverse the list until old_head. 277 | // 将以new_head为头节点、old_head为尾节点的单向链表做一次翻转，保证待写数据以先后顺序排序。 278 | // 随时可能有新的bthread将待写数据加入到_write_head链表，但暂时不考虑这些新来的数据。 279 | WriteRequest* tail = NULL; 280 | WriteRequest* p = new_head; 281 | do { 282 | while (p->next == WriteRequest::UNCONNECTED) { 283 | // TODO(gejun): elaborate this 284 | sched_yield(); 285 | } 286 | WriteRequest* const saved_next = p->next; 287 | p->next = tail; 288 | tail = p; 289 | p = saved_next; 290 | CHECK(p != NULL); 291 | } while (p != old_head); 292 | 293 | // Link old list with new list. 294 | old_head->next = tail; 295 | // Call Setup() from oldest to newest, notice that the calling sequence 296 | // matters for protocols using pipelined_count, this is why we don't 297 | // calling Setup in above loop which is from newest to oldest. 298 | for (WriteRequest* q = tail; q; q = q->next) { 299 | q->Setup(this); 300 | } 301 | // 将*new_tail指向当前最新的已翻转链表的尾节点。 302 | if (new_tail) { 303 | *new_tail = new_head; 304 | } 305 | return false; 306 | } 307 | ``` 308 | 309 | ## 一个实际场景下的示例 310 | 下面以一个实际场景为例，说明线程执行过程和内存变化过程： 311 | 312 | 1. 假设T0时刻有3个分别被不同pthread执行的bthread同时向同一个fd写入数据，3个bthread同时进入到StartWrite函数执行_write_head.exchange原子操作，_write_head初始值是NULL，假设bthread 0第一个用自己的req指针与_write_head做exchange，则bthread 0获取了向fd写数据的权限，bthread 1和bthread 2将待发送的数据加入_write_head链表后直接return 0返回（bthread 1和bthread 2返回后会被挂起，yield让出cpu）。此时内存结构为： 313 | 314 |

315 | 316 | 2. T1时刻起（后续若无特别说明，假设暂时没有新的bthread再往_write_head链表中加入待写数据），bthread 1向fd写自身携带的WriteRequest 1中的数据（假设TCP长连接已建好，在ConnectIfNot内部不发起非阻塞的connect调用），执行一次写操作后，进入IsWriteComplete，判断是否写完（WriteRequest 1中的数据未写完，或者虽然WriteRequest 1的数据写完了但是还有其他bthread往链表中加入了待写数据，都算没写完。本示例中此时IsWriteComplete肯定是返回false的）。 317 | 318 | 3. bthread 1所在的pthread执行进入IsWriteComplete（假设WriteRequest 1中的数据没有全部写完），在IsWriteComplete中判断出WriteRequest 1中仍然有未写数据，并且_write_head也并不指向WriteRequest 1而是指向了新来的WriteRequest 3，为保证将数据依先后顺序写入fd，将图1所示的单向链表做翻转（代码中的_write_head.compare_exchange_strong操作的是最新的_write_head，在这个原子操作后，仍然会有bthread将待写数据加入到_write_head，_write_head会变化。但上述的链表翻转之后，如果有新来的WriteRequest暂时也不管它，后续会处理。这里先假设没有bthread加入新的待写数据）。此时内存结构为： 319 | 320 |

321 | 322 | 4. 因为IsWriteComplete返回了false，仍然有待写数据要写，但接下来的写操作不能再由bthread 1负责，因为剩下的待写数据也不能保证一次都写完，bthread 1不可能去等待fd的内核inode输出缓存是否有可用空间，否则会令bthread 1所在的整个pthread线程卡顿，pthread私有的TaskGroup上的任务执行队列中其他bthread就得不到及时执行了，也就不是wait-free了。因此bthread 1创建一个新的KeepWrite bthread专门负责剩余数据的发送，bthread 1即刻返回（bthread 1到这里也就完成了任务，会被挂起，yield让出cpu）。 323 | 324 | 5. T3时刻起，KeepWrite bthread得到了调度，被某一个pthread执行，开始写之前剩余的数据。假设一次向fd的写操作执行后，WriteRequest 1、WriteRequest 2中的数据全部写完（WriteRequest对象随即被回收内存），WriteRequest 3写了一部分，并且此时又有其他两个bthread向_write_head链表中新加入了待写数据WriteRequest 4和WriteRequest 5，此时内存结构为： 325 | 326 |

327 | 328 | 6. KeepWrite bthread执行IsWriteComplete判断之前已翻转过的链表是否已全部写完，并在IsWriteComplete内部通过_write_head.compare_exchange_strong原子操作检测到之前新增的待写数据WriteRequest 4、5，并完成WriteRequest 5->4->3链表的翻转。假设在_write_head.compare_exchange_strong执行之后立即有其他bthread又向_write_head链表中新加入了待写数据WriteRequest 6和WriteRequest 7，但WriteRequest 6和7在_write_head.compare_exchange_strong一旦被调用之后是暂时被无视的，等到下一轮调用IsWriteComplete时才会被发现（通过_write_head.compare_exchange_strong发现最新的_write_head不等于之前已被翻转的链表的尾节点）。此时内存结构为： 329 | 330 |

331 | 332 | 7. KeepWrite bthread调用IsWriteComplete返回后，cur_tail指针指向的是当前最新的已翻转链表的尾节点，即WriteRequest 5。然后继续尽可能多的向fd写一次数据，一次写操作后再调用IsWriteComplete判断当前已翻转的链表是否已全部写完、_write_head链表是否有新加入的待写数据，会发现之前新加入的WriteRequest 6和WriteRequest 7，再将WriteRequest 6、7加入翻转链表...如此循环反复。只有在IsWriteComplete中判断出已翻转链表的待写数据已全部写入fd、且当前最新的_write_head指向翻转链表的尾节点时，KeepWrite bthread才会执行结束。如果紧接着又有其他bthread试图向fd写入数据，则重复步骤1开始的过程。所以同一时刻针对一个TCP连接不会同时存在多个KeepWrite bthread，一个TCP连接上最多只会有一个KeepWrite bthread在工作。 333 | 334 | 8. KeepWrite bthread在向fd写数据时，fd一般被设为非阻塞，如果fd的内核inode输出缓存已满，对fd调用::write（或者::writev）会返回-1且errno为EAGAIN。这时候KeepWrite bthread不能去主动等待fd是否可写，必须要挂起，yield让出cpu，让KeepWrite bthread所在的pthread接着去调度执行私有的TaskGroup的任务队列中的下一个bthread，否则pthread就会卡住，影响了TaskGroup任务队列中其他bthread的执行，这违背了wait-free的设计思想。等到内核inode输出缓存有了可用空间时，epoll会返回fd的可写事件，epoll所在的bthread会唤醒KeepWrite bthread，KeepWrite bthread的id会被重新加入某个TaskGroup的任务队列，从而会被重新调度执行，继续向fd写入数据。 335 | -------------------------------------------------------------------------------- /docs/linkedlist.md: -------------------------------------------------------------------------------- 1 | [侵入式链表原理及实现](#侵入式链表原理及实现) 2 | 3 | [侵入式链表在brpc中的应用](#侵入式链表在brpc中的应用) 4 | 5 | ## 侵入式链表原理及实现 6 | 侵入式链表与一般链表不同之处在于，将负责链表节点链接关系的next、prev指针单独拿出来实现为一个基类，在基类中完成链表节点的增、删等操作，具体业务相关的数据结构只需要继承这个基类，即可实现业务相关的链表，不需要在业务数据结构中实现增、删等操作。 7 | 8 | brpc的侵入式链表相关代码在src/butil/containers/linked_list.h中，next、prev指针和链表节点的增、删操作都定义在LinkNode模板类中，代码也很简单： 9 | 10 | ```c++ 11 | template 12 | class LinkNode { 13 | public: 14 | // LinkNode are self-referential as default. 15 | LinkNode() : previous_(this), next_(this) {} 16 | 17 | LinkNode(LinkNode* previous, LinkNode* next) 18 | : previous_(previous), next_(next) {} 19 | 20 | // Insert |this| into the linked list, before |e|. 21 | void InsertBefore(LinkNode* e) { 22 | // ...... 23 | } 24 | 25 | // Insert |this| as a circular linked list into the linked list, before |e|. 26 | void InsertBeforeAsList(LinkNode* e) { 27 | // ...... 28 | } 29 | 30 | // Insert |this| into the linked list, after |e|. 31 | void InsertAfter(LinkNode* e) { 32 | // ...... 33 | } 34 | 35 | // Insert |this| as a circular list into the linked list, after |e|. 36 | void InsertAfterAsList(LinkNode* e) { 37 | // ...... 38 | } 39 | 40 | // Remove |this| from the linked list. 41 | void RemoveFromList() { 42 | // ...... 43 | } 44 | 45 | LinkNode* previous() const { 46 | return previous_; 47 | } 48 | 49 | LinkNode* next() const { 50 | return next_; 51 | } 52 | 53 | private: 54 | LinkNode* previous_; 55 | LinkNode* next_; 56 | }; 57 | ``` 58 | 59 | 在具体业务中，想要实现一个数据类型为A的侵入式链表，只需要直接继承LinkNode基类就可以了： 60 | 61 | ```c++ 62 | class A : public LinkNode { 63 | // 定义A的成员变量 64 | } 65 | ``` 66 | 67 | A的侵入式链表的内存如下图所示： 68 | 69 |