├── actor-systems └── README.md ├── async-mutexes ├── README.md ├── fairness.svg ├── laziness1.svg ├── laziness2.svg ├── laziness3.svg └── thundering_herd.svg └── message-queues ├── part1 └── README.md ├── part2 └── README.md └── part3 ├── README.md ├── agentdb-arch.svg └── agentdb-client.svg /actor-systems/README.md: -------------------------------------------------------------------------------- 1 | # Building an async-compatible actor system 2 | 3 | I [previously wrote](../async-mutexes/README.md) about the problems with 4 | implementing an async-aware `Mutex` type in Rust, and showed how 5 | treating a mutex as a task to be spawned can solve many of those 6 | problems. 7 | 8 | This time, I want to talk about my experiences building an actor 9 | system out of that primitive mutex concept. 10 | 11 | ## What is an actor? 12 | 13 | Real actor systems encompass a huge amount of functionality, but there 14 | are certain properties of an actor that are fundamental, and common 15 | to all actor systems. 16 | 17 | An actor: 18 | 19 | - Has state. 20 | - Receives messages. 21 | - May modify its state in response to a message. 22 | - May send messages to other actors. 23 | - Has exclusive access to its own state, ie. no other code can 24 | directly access that state. 25 | 26 | ## Why are actors useful? 27 | 28 | Again, I'm going to limit this to the absolute fundamentals of 29 | what an actor system can do: with real actor systems there are 30 | many additional advantages that may be specific to that system. 31 | 32 | - As an alternative to a mutex, an actor can provide higher 33 | throughput. 34 | 35 | When a mutex is unlocked, the next thread must be 36 | woken up, a context switch must occur, and then the next thread 37 | must take the lock. During that time, no useful work can be done 38 | on the data protected by the mutex. As the amount of contention for 39 | the mutex increases, the overhead of this synchronization also 40 | increases, reducing throughput further. 41 | 42 | In contrast, with an actor, the synchronization cost does not eat 43 | into the actor's productive time: the actor has exclusive access to 44 | its own state at all times, and will process messages as fast as it 45 | can. With message passing, the synchronization cost is both reduced, 46 | and largely paid by the message sender instead of the receiver. 47 | Furthermore, the maximum throughput of the actor stays constant: it 48 | is not affected by contention in the way a mutex is. 49 | 50 | - Since all access to an actor's state goes through the actor, it can be 51 | easier to reason about than code using mutexes. Even if multiple 52 | threads are sending messages to the actor, the actor itself 53 | processes messages one at a time. 54 | 55 | In this way, an actor can be easily modelled as a state machine. 56 | Alternatively, a protocol may be specified via state-machines, and 57 | this protocol specification can be almost mechanically transformed into 58 | the logic for one or more actors. 59 | 60 | When the code implementing a specification so closely matches the 61 | specification itself, it's much easier to verify that the 62 | implementation is correct. 63 | 64 | (See also, [CSP](https://en.wikipedia.org/wiki/Communicating_sequential_processes)) 65 | 66 | - When writing code with a lot of concurrency, using actors can 67 | dramatically simplify the code. 68 | 69 | Actors are a tool that a programmer can use to turn a huge concurrent 70 | problem into a large number of single-threaded problems. 71 | 72 | They enable individual parts of the system to be changed without 73 | having to re-evaluate the big picture. 74 | 75 | In contrast, attempting to solve complex concurrency problems using 76 | only primitive tools can result in monolithic programs where minor 77 | changes can have far-reaching repurcussions. 78 | 79 | ## What's wrong with existing actor systems for Rust? 80 | 81 | Now that you're convinced of why actor systems are a good idea, it's 82 | probably worth taking a look at the existing options. 83 | 84 | ### [Actix](https://crates.io/crates/actix) 85 | 86 | Actix is the most well-known actor system for Rust, and one of the more 87 | mature. It is very fast, and a lot of effort has been expended in optimizing 88 | it. The area where I find it most lacking is in ergonomics. 89 | 90 | For me at least, the most compelling reason to use an actor system is to 91 | make it easier to solve problems. Yet there are several design choices 92 | made by Actix which, in my opinion, work against this goal: 93 | 94 | 1. Actix has its own `ActorFuture` trait which is not compatible with 95 | Rust's built-in `Future` trait. 96 | 97 | The difference between these two traits is in the `poll` method: 98 | 99 | `Future`: 100 | 101 | ```rust 102 | fn poll( 103 | self: Pin<&mut Self>, 104 | context: &mut Context 105 | ) -> Poll 106 | ``` 107 | 108 | `ActorFuture`: 109 | 110 | ```rust 111 | fn poll( 112 | self: Pin<&mut Self>, 113 | state: &mut Self::Actor, 114 | actor_ctx: &mut ::Context, 115 | context: &mut Context<'_> 116 | ) -> Poll 117 | ``` 118 | 119 | This extra `state` parameter is how an actor is able to modify its 120 | own state when responding to a message. Likewise, the `actor_ctx` 121 | parameter provides things like the actor's own address, for use when 122 | sending out messages to other actors. 123 | 124 | You can adapt a `Future` for an API expecting an `ActorFuture`, but not vice-versa. 125 | 126 | 2. As a result of having an incompatible `Future` trait, actix is also 127 | incompatible with Rust's async/await syntax. (Unless you use an 128 | interop crate like [`actix-interop`](https://crates.io/crates/actix-interop) 129 | </shameless-plug>) 130 | 131 | 3. Actix allows multiple asynchronous message handlers to run 132 | concurrently. Because `state` is only borrowed for the poll 133 | method, actix is able to interleave calls to `poll` to multiple 134 | futures executing on the actor. 135 | 136 | I can see uses for this, but making it the default can lead to 137 | surprising results, as it breaks the assumption that a given 138 | message handler has exclusive access to the actor's state for the 139 | duration of its execution. 140 | 141 | Luckily actix _does_ have a way to run a handler with exclusive 142 | access, but it must be opted into, and comes with its own ergonomic 143 | cost. Even then, you cannot borrow the actor's state across yield 144 | points, so methods with the following signatures cannot 145 | be called on the actor's state: 146 | 147 | ```rust 148 | // `.await`ing these methods would result in `self` being borrowed 149 | // across a yield point, which is not allowed. 150 | async fn ref_receiver(&self, ...); 151 | async fn mut_receiver(&mut self, ...); 152 | ``` 153 | 154 | As you can imagine, this is extremely limiting, and leads to some 155 | very ugly work-arounds. 156 | 157 | 4. So much boilerplate! 158 | 159 | - You must define a type for each message. 160 | - You must implement the `Message` trait for each message. 161 | - You must define a "response" type for each message. 162 | - You must implement the `Handler` trait for each message 163 | an actor can handle. 164 | - You must implement the `Actor` trait for each actor. 165 | 166 | Each piece makes sense, but in total there's a huge barrier to even 167 | getting off the ground, and it detracts from thinking about the 168 | problem you're actually trying to solve. 169 | 170 | ### [Riker](https://crates.io/crates/riker) 171 | 172 | Riker is a little more opinionated than `actix`, which allows it to 173 | avoid some of the boilerplate (for example, the `Handler` and `Actor` 174 | traits are unified). It also seems to be quite mature with support for many 175 | more advanced actor system concepts. 176 | 177 | However, it does not appear to be fully compatible with futures: whilst 178 | it is possible to spawn futures onto the system, message handlers are 179 | required to be synchronous themselves. This may be a good trade-off 180 | for many applications, but it rules it out for me as a truly async-compatible 181 | actor framework. 182 | 183 | ### [Acteur](https://crates.io/crates/acteur) 184 | 185 | Of the actor systems I've reviewed, `acteur` is probably the closest 186 | to meeting all my requirements for an actor system. It is fully 187 | async/await compatible, and I only have a few criticisms: 188 | 189 | - It doesn't seem to provide a way to run futures "in the background" 190 | on an actor: this is useful for tasks that do not require 191 | access to the actor's state, but should stop when the actor 192 | is stopped. 193 | 194 | One example of this would be sending a message to the actor after a 195 | certain amount of time has passed. You would still want the actor 196 | to be able to handle other messages during that time interval. 197 | 198 | - It's locked into the `async-std` runtime. 199 | 200 | - There's still a decent amount of boilerplate necessary to get an 201 | actor up and running, and certain concepts like activation 202 | and deactivation are baked-in, which many applications will not 203 | care about. 204 | 205 | ### Others 206 | 207 | There are many more actor systems in Rust, forks of actor systems I've 208 | already mentioned, etc. but I haven't had time to review all of them 209 | in detail. 210 | 211 | ## What _could_ an ergonomic Rust actor system look like? 212 | 213 | I want: 214 | 215 | 1. It to be fully async/await compatible. Mashing together 216 | future combinators is so last year! 217 | 218 | 2. To be able to call async methods that borrow the actor state. 219 | 220 | 3. Each message handler to run to completion before moving onto 221 | the next message (at least by default). 222 | 223 | 4. To be able to run "background" futures on an actor that don't need 224 | access to its state. 225 | 226 | 5. To have a zero-boilerplate way to define an actor. I don't want a 227 | separate trait implementation for every single message I can 228 | handle. 229 | 230 | 6. To support static and dynamic polymorphism using normal traits. 231 | 232 | 7. To make no — or at least minimal — use of unsafe code. 233 | 234 | 8. To achieve all of the above with the least amount of "magic" possible. 235 | Actor code should still be normal Rust as far as the user is concerned. 236 | 237 | Putting all that together, we end up with something like this: 238 | 239 | ```rust 240 | #[derive(Default)] 241 | struct AutomaticSupportActor { 242 | // The actor state 243 | computer: Computer, 244 | problems_fixed: u32, 245 | } 246 | 247 | impl Actor for AutomaticSupportActor {} 248 | 249 | #[async_trait] 250 | trait SupportTechnician: Actor { 251 | async fn fix_problem(&mut self); 252 | } 253 | 254 | #[async_trait] 255 | impl SupportTechnician for AutomaticSupportActor { 256 | async fn fix_problem(&mut self) { 257 | self.computer.turn_off().await; 258 | self.computer.turn_on().await; 259 | self.problems_fixed += 1; 260 | } 261 | } 262 | 263 | fn example() { 264 | let x: Addr = spawn(MyActor::default()); 265 | 266 | // This must be a macro, because we want the method call to 267 | // happen within the execution context of the actor. 268 | send!(x.fix_problem()); 269 | } 270 | ``` 271 | 272 | We lean heavily on standard tools such as the 273 | [async-trait](https://crates.io/crates/async-trait) 274 | crate, which should be familiar to anyone who's done much async programming in Rust 275 | already. 276 | 277 | ## How do we actually implement that? 278 | 279 | This is the hard part! Spoiler: to get this to run on stable Rust, we 280 | will need to make a few minor compromises. 281 | 282 | Looking at the code above, we need at least 3 things: 283 | 284 | 1. A way to spawn an actor (`spawn`). 285 | 2. A type to store the address of an actor or trait object (`Addr`). 286 | 3. A macro to call a method on an actor. 287 | 4. A way to convert the address of a concrete actor type to a trait object (upcast it). 288 | 289 | In addition, although not necessary in the above code, we'll want: 290 | 291 | 1. A way for an actor to get its own address. 292 | 2. A way for the caller to wait on a response from an actor. 293 | 3. A way for an actor to spawn background futures. 294 | 4. A flexible way to manage the lifetime of an actor. 295 | 296 | ## Implementing `send!` 297 | 298 | We need an ergonomic way to send functions to be executed. 299 | 300 | The first hurdle is that the type signature of async methods which borrow one 301 | of their arguments is complex: 302 | 303 | ```rust 304 | for<'a, T> FnOnce(&'a mut T) -> impl Future<'a, T> 305 | ``` 306 | 307 | To make things worse, it is not currently possible to write a closure with 308 | this exact signature due to language limitations. 309 | 310 | I tried a number of ways to work around this, but in the end the only option 311 | which did not come with too many downsides was to use a macro, and to box 312 | things whenever the types got too out of hand. 313 | 314 | The macro expands to something like this: 315 | 316 | ```rust 317 | addr.send_mut(Box::new(move |x| { 318 | (async move { 319 | x.method(arg1, arg2).await 320 | }).boxed() 321 | })); 322 | ``` 323 | 324 | By boxing the return type of the closure, we transform the signature to: 325 | 326 | ```rust 327 | for<'a, T> FnOnce(&'a mut T) -> BoxFuture<'a, T> 328 | ``` 329 | 330 | This signature _can_ be achieved with the closure above. 331 | 332 | In total we require two allocations: one to erase the type of the closure to be 333 | executed on the actor, and a second to erase the type of the future it 334 | returns. 335 | 336 | ## Spawning an actor 337 | 338 | We want to implement a task that pulls an "action" from a channel, and then 339 | polls it to completion before pulling the next "action". 340 | 341 | We'll define an "action" as a function that operates on the actor state 342 | and returns a future. Keep in mind: the future it returns will still 343 | be borrowing the actor state, so we need to be careful around lifetimes. 344 | 345 | The future will output a `bool` indicating whether the actor should stop. 346 | 347 | ```rust 348 | type MutItem = Box FnOnce(&'a mut T) -> BoxFuture<'a, bool> + Send>; 349 | 350 | async fn actor_task( 351 | mut value: T, 352 | mut mut_channel: mpsc::UnboundedReceiver>, 353 | ) { 354 | loop { 355 | // Obtain an item 356 | let current_item = if let Some(item) = mut_channel.next().await { 357 | item 358 | } else { 359 | return 360 | }; 361 | 362 | // Call the function, and then poll the future it returns to completion 363 | let current_future = current_item(&mut value); 364 | if current_future.await { 365 | return; 366 | } 367 | } 368 | } 369 | ``` 370 | 371 | To spawn an actor, we just spawn this async task, passing in the actor state. 372 | 373 | However, we'll also want the actor to be able to execute background futures 374 | which don't require access to the actor state. We can achieve this using the 375 | `select_biased!` macro and the `FuturesUnordered` type: 376 | 377 | ```rust 378 | type MutItem = Box FnOnce(&'a mut T) -> BoxFuture<'a, bool> + Send>; 379 | type FutItem = BoxFuture<'static, ()>; 380 | 381 | async fn actor_task( 382 | mut value: T, 383 | mut mut_channel: mpsc::UnboundedReceiver>, 384 | mut fut_channel: mpsc::UnboundedReceiver, 385 | ) { 386 | let mut futs = FuturesUnordered::new(); 387 | loop { 388 | // Obtain an item 389 | let current_item = loop { 390 | if select_biased! { 391 | _ = futs.select_next_some() => false, 392 | item = mut_channel.next() => if let Some(item) = item { 393 | break item 394 | } else { 395 | true 396 | }, 397 | item = fut_channel.select_next_some() => { 398 | futs.push(item); 399 | false 400 | }, 401 | complete => true, 402 | } { 403 | return; 404 | } 405 | }; 406 | 407 | // Wait for the current item to run 408 | let mut current_future = current_item(&mut value).fuse(); 409 | loop { 410 | select_biased! { 411 | done = current_future => if done { 412 | return; 413 | } else { 414 | break 415 | }, 416 | _ = futs.select_next_some() => {}, 417 | item = fut_channel.select_next_some() => futs.push(item), 418 | } 419 | } 420 | } 421 | } 422 | ``` 423 | 424 | We use `select_biased!` rather than `select!` as we don't need to be fair: in 425 | fact, if anything, we'd prefer to give priority to the `current_future` which 426 | mutably borrows the actor state, and that's exactly what we do. 427 | 428 | ## Storing the address of an actor 429 | 430 | The actor address just needs to store the two channels used for sending 431 | items to the actor for execution. 432 | 433 | ```rust 434 | struct AddrInner { 435 | mut_channel: mpsc::UnboundedSender>, 436 | fut_channel: mpsc::UnboundedSender, 437 | } 438 | ``` 439 | 440 | However, I wanted to allow actors to automatically stop when there are no 441 | more references, and for that we need to use reference counting. 442 | 443 | This is simple enough: we just wrap our `AddrInner` in an `Arc` or `Weak` 444 | depending on if we want to keep the actor alive or not. 445 | 446 | ```rust 447 | struct Addr { 448 | inner: Arc>, 449 | } 450 | ``` 451 | 452 | ## Supporting trait objects 453 | 454 | This is perhaps the hardest requirement: `CoerceUnsized` and friends are 455 | all unstable so we can't make this conversion be completely implicit, and 456 | still support stable Rust. 457 | 458 | I found the easiest way was to define a macro, `upcast!(...)` to do 459 | this conversion. It expands to a simple method call `.upcast(|x| x as _)` 460 | and relies on type inference to work correctly. 461 | 462 | The second part is even harder: how do we make `Addr` also work as 463 | `Addr`? We can't use specialization or anything like that, 464 | and the Rust type system starts to feel very limited... 465 | 466 | To solve this, I essentially ended up building a manual vtable: 467 | 468 | ```rust 469 | pub struct Addr { 470 | inner: Option>, 471 | send_mut: &'static (dyn Fn(&Arc, MutItem) + Send + Sync), 472 | send_fut: &'static (dyn Fn(&Arc, FutItem) + Send + Sync), 473 | } 474 | ``` 475 | 476 | We erase the concrete type of the actor on construction via the 477 | `Arc`, and we store the address of some static methods to 478 | send items and futures to the actor. 479 | 480 | When we `upcast!` an actor, we leave the `inner` value unchanged, but 481 | re-assign `send_mut` to a different "glue" function that performs the required 482 | conversion of items sent to the actor. 483 | 484 | Couple of things to note: this is all possible to implement using zero unsafe 485 | code, and only a single additional allocation is required, which happens 486 | when sending a message to an `Addr` in order for the "glue code" 487 | to be injected. 488 | 489 | In practice, a single line of `unsafe` code _is_ used for performance 490 | reasons, but it will be possible to remove this if 491 | [my PR to Rust](https://github.com/rust-lang/rust/pull/77688) is merged. 492 | 493 | ## Actor trait 494 | 495 | We want to allow an actor to access its own address, and also to implement 496 | some standardized error handling. To that end, we add two methods to the 497 | `Actor` trait: 498 | 499 | ```rust 500 | #[async_trait] 501 | pub trait Actor: Send + 'static { 502 | /// Called automatically when an actor is started. Actors can use this 503 | /// to store their own address for future use. 504 | async fn started(&mut self, _addr: Addr) -> ActorResult<()> 505 | where 506 | Self: Sized, 507 | { 508 | Ok(()) 509 | } 510 | 511 | /// Called when any actor method returns an error. If this method 512 | /// returns `true`, the actor will stop. 513 | /// The default implementation logs the error using the `log` crate 514 | /// and then stops the actor. 515 | async fn error(&mut self, error: ActorError) -> bool { 516 | error!("{}", error); 517 | true 518 | } 519 | } 520 | ``` 521 | 522 | We avoid boilerplate by providing reasonable default implementations. 523 | 524 | ## Returning values 525 | 526 | We can allow callers to access the return value of actor methods by adding 527 | a `call!` macro to completement our `send!` macro. It will be implemented 528 | the same way, except that it will create a `oneshot::channel` to send back 529 | the return value from the actor method. 530 | 531 | The caller can `await` this channel to get their result. 532 | 533 | ## All the trimmings 534 | 535 | What we've got so far is completely executor agnostic, but is pretty much 536 | the bare minimum to be considered an actor system. 537 | 538 | To make life easier for users I have added a couple of standard integrations 539 | behind feature flags, for the two main async runtimes: 540 | 541 | - `tokio` 542 | - `async-std` 543 | 544 | These modules provide runtime-specific `spawn_actor` functions, and abstract 545 | across timer implementations in the two runtimes. 546 | 547 | I also built a generic `timer` module which uses these abstractions to allow 548 | actors to easily set and clear timers. Timers in an actor system come with 549 | their own small set of challenges: if a timer works by firing messages at 550 | the actor, it is not possible to reliably "cancel" a timer, as the message 551 | may have already been sent. 552 | 553 | The `timer` module solves this by allowing spurious timer messages, and 554 | requiring the user to check that the timer has actually elapsed before 555 | responding to a `tick` event. 556 | 557 | ## That's it! 558 | 559 | This covers the core of how `act-zero 0.3` is implemented. 560 | 561 | This is the second major iteration of the design, and I'm really happy with 562 | how it came out. You can see the real code and some example uses of the crate 563 | in the [github repo](https://github.com/Diggsey/act-zero). 564 | 565 | In order to test the design of `act-zero` I've been building an implementation 566 | of the `raft` distributed consensus algorithm, `raft-zero`. This has helped 567 | me stay focused on building an actor system that is ergonomic, and designed 568 | to solve problems rather than create them! 569 | -------------------------------------------------------------------------------- /async-mutexes/README.md: -------------------------------------------------------------------------------- 1 | # Async Mutexes 2 | 3 | As Rust's `async` ecosystem grows, there's an increasing need for a set of concurrency 4 | primitives, similar to those found in `std::sync`, but suitable for use in async code. 5 | 6 | In this post I'm going to be looking at a single concurrency primitive: [the mutex]. 7 | This deceptively simple primitive turns out to be a nightmare to translate to the async 8 | world, and it reveals some of the darker corners of async programming. 9 | 10 | [the mutex]: https://en.wikipedia.org/wiki/Mutual_exclusion 11 | 12 | ## Why not just use `std::sync::Mutex`? 13 | 14 | The standard `Mutex` is definitely usable from async code, but it comes with an 15 | important constraint: you must not hold the mutex across a yield point. 16 | 17 | If you yield from a piece of async code whilst holding a lock, the executor will 18 | poll another task on that same thread. If this new task also tries to lock the mutex, 19 | then neither task will be able to progress: the whole thread is blocked, so the original 20 | task will have no opportunity to continue and release the lock. 21 | 22 | ## Avoiding complexity 23 | 24 | For these code examples, I'm going to simplify things as much as possible: 25 | 26 | - The mutex will not protect any data. This avoids some generics which aren't 27 | relevant for the locking mechanism itself. 28 | 29 | - There will be no `MutexGuard` for RAII-style access, just `lock` and `unlock` 30 | methods, and I will not try to enforce that they are used correctly. 31 | 32 | - No low-level optimisations have been applied. I am only considering performance 33 | of the algorithm itself, not paying attention to implementation details. 34 | 35 | - I will ignore mutex poisoning. 36 | 37 | All of the example code listed here compiles in the [Rust playground]! 38 | 39 | [Rust playground]: https://play.rust-lang.org/ 40 | 41 | ## The care-free implementation 42 | 43 | Given those caveats, let's have our first attempt at writing an async mutex. 44 | 45 | We'll store a simple boolean value for whether the mutex is locked, and we'll 46 | keep a list of tasks waiting on the mutex so we know who to wake up when the 47 | mutex is unlocked. 48 | 49 | We'll protect this state using a standard mutex, but this is OK because we 50 | never yield whilst it is locked. 51 | 52 | ```rust 53 | use std::collections::HashMap; 54 | use std::task::{Poll, Waker, Context}; 55 | use std::sync::{Arc, Mutex}; 56 | use std::future::Future; 57 | use std::pin::Pin; 58 | 59 | // This is our async mutex type. We use a standard mutex to protect 60 | // our inner state. 61 | #[derive(Default)] 62 | pub struct AsyncMutex { 63 | inner: Arc> 64 | } 65 | 66 | #[derive(Default)] 67 | struct Inner { 68 | // Basic mutex state 69 | locked: bool, 70 | waiting_tasks: HashMap, 71 | 72 | // This is just a convenient way to generate unique IDs to 73 | // identify waiting tasks. 74 | next_id: LockId, 75 | } 76 | 77 | type LockId = usize; 78 | 79 | impl AsyncMutex { 80 | // This begins the lock operation: polling the returned future 81 | // will attempt to take the lock. 82 | pub fn lock(&self) -> LockFuture { 83 | let mut inner = self.inner.lock().unwrap(); 84 | 85 | // Generate a unique ID for this lock operation 86 | // We won't worry about integer overflow for this example. 87 | let id = inner.next_id; 88 | inner.next_id += 1; 89 | 90 | LockFuture { inner: self.inner.clone(), id, done: false } 91 | } 92 | 93 | // Unlocking can be done synchronously. 94 | pub fn unlock(&self) { 95 | let mut inner = self.inner.lock().unwrap(); 96 | 97 | assert!(inner.locked); 98 | inner.locked = false; 99 | 100 | // Wake up all the tasks waiting on this mutex 101 | for (_, waker) in inner.waiting_tasks.drain() { 102 | waker.wake(); 103 | } 104 | } 105 | } 106 | 107 | // This future completes when we have taken the lock. 108 | pub struct LockFuture { 109 | id: LockId, 110 | inner: Arc>, 111 | done: bool, 112 | } 113 | 114 | impl Future for LockFuture { 115 | type Output = (); 116 | fn poll(self: Pin<&mut Self>, ctx: &mut Context) -> Poll<()> { 117 | let this = self.get_mut(); 118 | 119 | assert!(!this.done, "Future polled after completion"); 120 | 121 | let mut inner = this.inner.lock().unwrap(); 122 | 123 | if !inner.locked { 124 | // Nobody has the lock, so take it for ourselves 125 | inner.locked = true; 126 | inner.waiting_tasks.remove(&this.id); 127 | this.done = true; 128 | 129 | Poll::Ready(()) 130 | } else { 131 | // Someone else has the lock, so queue ourselves to be woken up. 132 | // We use a fixed ID, so if we already queued ourselves we'll 133 | // overwrite that entry with the new waker. 134 | inner.waiting_tasks.insert(this.id, ctx.waker().clone()); 135 | Poll::Pending 136 | } 137 | } 138 | } 139 | ``` 140 | 141 | ## The "[thundering herd]" problem 142 | 143 | The first issue with this mutex is that when it's unlocked, we wake up *all* of the 144 | tasks waiting on the mutex. If there is a lot of contention for the mutex, then this 145 | is very inefficient, because only one of the tasks we wake up will actually be able 146 | to take the lock. 147 | 148 | ![](thundering_herd.svg) 149 | 150 | We could try to fix it by changing our `unlock()` implementation to wake a single 151 | task: 152 | 153 | ```rust 154 | pub fn unlock(&self) { 155 | let mut inner = self.inner.lock().unwrap(); 156 | 157 | assert!(inner.locked); 158 | inner.locked = false; 159 | 160 | // Wake up a single task waiting on this mutex 161 | if let Some(&id) = inner.waiting_tasks.keys().next() { 162 | let waker = inner.waiting_tasks.remove(&id).unwrap(); 163 | waker.wake(); 164 | } 165 | } 166 | ``` 167 | 168 | However, this leads to a new problem: what if the task we woke up doesn't exist 169 | anymore? It will never try to take the lock, and the other tasks will be stuck 170 | waiting for a mutex indefinitely. 171 | 172 | We can try to remedy this too: 173 | 174 | ```rust 175 | impl Drop for LockFuture { 176 | fn drop(&mut self) { 177 | if !self.done { 178 | let mut inner = self.inner.lock().unwrap(); 179 | 180 | // Wake up a single task waiting on this mutex 181 | if let Some(&id) = inner.waiting_tasks.keys().next() { 182 | let waker = inner.waiting_tasks.remove(&id).unwrap(); 183 | waker.wake(); 184 | } 185 | } 186 | } 187 | } 188 | ``` 189 | 190 | This way, if a task was cancelled whilst trying to take a lock, we'll ensure 191 | another task is woken up. Now everything should be fine... right? Not quite, 192 | but let's look at some other problems first! 193 | 194 | [thundering herd]: https://en.wikipedia.org/wiki/Thundering_herd_problem 195 | 196 | 197 | ## The fairness problem 198 | 199 | When building a mutex, we want it to be fair: in other words, even if there's 200 | a lot of contention, every task should gets its chance to take the mutex. We don't 201 | want any tasks to be starved. 202 | 203 | Unfairness is not always a deal-breaker. However, our mutex is *very* unfair. Firstly, 204 | the order of items in a `HashMap` is not random: if a lock operation is unlucky enough 205 | to get an ID which maps to a slot near the end of the `HashMap`, the task performing 206 | that operation will have to wait a long time to get woken up. 207 | 208 | We can try to fix this by switching from a `HashMap` to a queue: 209 | 210 | ```diff 211 | - use std::collections::HashMap; 212 | + use std::collections::BTreeMap; 213 | ... 214 | - waiting_tasks: HashMap, 215 | + waiting_tasks: BTreeMap, 216 | ``` 217 | 218 | A useful side effect of our sequential ID generation is that a `BTreeMap` will behave like 219 | a queue, so we don't have to change anything else. 220 | 221 | --- 222 | 223 | At this point, we've reached something functionally equivalent to the async mutex 224 | provided by the `futures` crate: [`futures::lock::Mutex`], albeit nowhere near as 225 | performant. 226 | 227 | --- 228 | 229 | However, there is a second source of unfairness: consider a task which polls several 230 | futures before polling the lock future: 231 | 232 | ```rust 233 | future1.poll(ctx); 234 | future2.poll(ctx); 235 | ... 236 | futureN.poll(ctx); 237 | lock_future.poll(ctx); 238 | ``` 239 | 240 | Every time it wakes up, it polls these other futures first. During that period another 241 | thread can sneak in and take the mutex. This is because `lock()` always succeeds 242 | immediately if the mutex is free: we don't check whether another task has been waiting 243 | for ages already. 244 | 245 | ![](fairness.svg) 246 | 247 | Solving this problem is a little more involved: when we release a mutex, we should 248 | immediately lock it on behalf of the next waiting task so that nobody else can steal 249 | it in the meantime. 250 | 251 | ```diff 252 | - locked: bool, 253 | + locked_by: Option, 254 | ``` 255 | 256 | ```rust 257 | 258 | impl Future for LockFuture { 259 | type Output = (); 260 | fn poll(self: Pin<&mut Self>, ctx: &mut Context) -> Poll<()> { 261 | let this = self.get_mut(); 262 | 263 | assert!(!this.done, "Future polled after completion"); 264 | let mut inner = this.inner.lock().unwrap(); 265 | 266 | match inner.locked_by { 267 | Some(id) if id != this.id => { 268 | // Someone else has the lock, so queue ourselves to be woken up 269 | inner.waiting_tasks.insert(this.id, ctx.waker().clone()); 270 | Poll::Pending 271 | }, 272 | _ => { 273 | // Either nobody has the lock, or we were given the lock by 274 | // another task, so return success. 275 | inner.locked_by = Some(this.id); 276 | inner.waiting_tasks.remove(&this.id); 277 | this.done = true; 278 | 279 | Poll::Ready(()) 280 | } 281 | } 282 | } 283 | } 284 | 285 | impl Drop for LockFuture { 286 | fn drop(&mut self) { 287 | if !self.done { 288 | let mut inner = self.inner.lock().unwrap(); 289 | 290 | // If we were given the lock by another task, 291 | // make sure to unlock it. 292 | if inner.locked_by == Some(self.id) { 293 | inner.unlock_inner(); 294 | } 295 | } 296 | } 297 | } 298 | 299 | impl AsyncMutex { 300 | pub fn unlock(&self) { 301 | let mut inner = self.inner.lock().unwrap(); 302 | inner.unlock_inner(); 303 | } 304 | } 305 | 306 | impl Inner { 307 | fn unlock_inner(&mut self) { 308 | assert!(self.locked_by.is_some()); 309 | 310 | // Wake up a single task waiting on this mutex 311 | if let Some(&id) = self.waiting_tasks.keys().next() { 312 | self.locked_by = Some(id); 313 | let waker = self.waiting_tasks.remove(&id).unwrap(); 314 | waker.wake(); 315 | } else { 316 | self.locked_by = None; 317 | } 318 | } 319 | } 320 | ``` 321 | 322 | --- 323 | 324 | At this point, we've reached something functionally equivalent to the async mutex 325 | provided by the `tokio` crate: [`tokio::sync::Mutex`], albeit nowhere near as 326 | performant. 327 | 328 | --- 329 | 330 | 331 | [`futures::lock::Mutex`]: https://docs.rs/futures/0.3.4/futures/lock/struct.Mutex.html 332 | [`tokio::sync::Mutex`]: https://docs.rs/tokio/0.2.20/tokio/sync/struct.Mutex.html 333 | 334 | ## The laziness problem 335 | 336 | If we were writing our mutex for use with Javascript's async model, everything 337 | would be working fine! However, we're not, and everything is not fine... It's 338 | time to look at the problem I hinted at earlier on! 339 | 340 | A Javascript promise does not need to be "spawned" - it exists by itself and 341 | will run to completion without any prompting. However, a Rust future must be 342 | explicitly driven to completion, by composing it into a task and then spawning 343 | that task onto an executor. 344 | 345 | So far, we've been assuming that a future will, after its task is woken up, either 346 | be polled or dropped promptly. However, the async model does not guarantee this. 347 | It is perfectly valid for a task to poll a future, stop polling it for a period 348 | of time, and then start polling it again. 349 | 350 | This causes two obvious problems: 351 | 352 | 1) A task may stop polling the future which took the lock before it has a chance 353 | to release it. This affects all versions of our mutex we've built so far. 354 | 355 | ![](laziness1.svg) 356 | 357 | 2) A task may stop polling the future whilst it is in the process of taking the 358 | lock. This affects all the fair versions of our mutex. 359 | 360 | ![](laziness2.svg) 361 | 362 | In both cases, the result is that nobody else will be able to access the mutex 363 | until this future is polled or dropped. 364 | 365 | There is also a third problem which is both more subtle and much nastier: 366 | 367 | 3) A task may own two futures which both try to lock the mutex. If the order 368 | in which these futures are polled can change, and the task returns as soon 369 | as the first future returns `Pending`, then the task can deadlock itself. 370 | 371 | ![](laziness3.svg) 372 | 373 | ## Understanding laziness and how it affects futures 374 | 375 | Why do these problems afflict `LockFuture`, but not other kinds of future? 376 | If we are writing a new kind of future, how do we know that it won't have the 377 | same issues? 378 | 379 | I've found it very difficult to explain precisely what is wrong: every rule 380 | seems to have some exceptions. Intuitively the problem seems to be caused by 381 | side-effects: polling (or rather, failing to poll) a `LockFuture` interferes 382 | with the execution of other futures. 383 | 384 | However, there are lots of futures that have side effects: for example, reading 385 | from a file will advance the file pointer. There are even examples of futures 386 | blocking the completion of others: failing to send on a `channel` will of 387 | course block the other side of that channel from progressing, and it's entirely 388 | possible to create deadlocks as a result, where the sending side is blocked 389 | on a receive from the same channel. You can end up with a similar problem to 390 | the mutex if both the "send" and "receive" futures are composed onto the same 391 | task. 392 | 393 | Ultimately, I think it comes down to expectations: we expect there to be a 394 | dependency between two ends of a channel. We don't expect (and can't easily 395 | predict) a dependency between two otherwise unrelated pieces of code that 396 | happen to access shared state via a mutex. If a set of futures use a mutex 397 | internally, we would like to be able to compose those futures without needing 398 | to worry about those implementation details. 399 | 400 | ## Solutions to the laziness problem 401 | 402 | ### Change the async guarantees 403 | 404 | The idea is that we simply require all combinators and tasks to poll or drop 405 | futures promptly after they are woken up. 406 | 407 | #### Advantages 408 | 409 | - Does not require any changes to our mutex. 410 | 411 | #### Disadvantages 412 | 413 | - It's very difficult to define what "poll promptly" actually means in practice, and 414 | impossible to verify it automatically. 415 | - It's a rule that must be followed by all async code, it can't just be encapsulated 416 | in a library implementation. 417 | - Does not play well with CPU-bound tasks - it's not clear if CPU-bound tasks 418 | would even be compatible with this rule. 419 | 420 | ### Force code accessing a mutex to be spawned as a task 421 | 422 | The problems we've encountered were caused because the futures on their own are 423 | lazy, and because we have no control over how they will be composed into a task 424 | to be executed. We can solve these problems by forcing code accessing a mutex 425 | to be directly spawned as a task (ie. prevent it from being composed with 426 | other futures first). 427 | 428 | We can simplify things further by observing that only one of these tasks 429 | will ever be executing at once, so we could spawn a single task when the 430 | mutex is created, and have that task poll each future which needs to access 431 | the mutex in sequence. 432 | 433 | #### Advantages 434 | 435 | - Works nicely with CPU-bound tasks. The mutex task can be scheduled to an executor 436 | that specializes in having low latency, regardless of the executor used by the 437 | calling code. Executors running CPU-bound tasks will generally have high latency 438 | in comparison. 439 | - No rules for end-users to follow. 440 | - Very simple to implement. 441 | 442 | #### Disadvantages 443 | 444 | - Access to the mutex will require its own scope, not just a RAII guard. 445 | - The mutex needs to be able to spawn tasks onto the executor. 446 | 447 | Example implementation: 448 | 449 | ```rust 450 | use std::future::Future; 451 | use std::pin::Pin; 452 | 453 | use futures::channel::{oneshot, mpsc}; 454 | use futures::{StreamExt, TryFutureExt}; 455 | 456 | // An arbitrary future to run whilst the mutex is held 457 | type ItemType = Pin + Send + 'static>>; 458 | 459 | pub struct AsyncMutex { 460 | tx: mpsc::UnboundedSender, 461 | } 462 | 463 | impl AsyncMutex { 464 | pub fn new() -> Self { 465 | let (tx, rx) = mpsc::unbounded(); 466 | tokio::spawn(rx.for_each(|x| x)); 467 | 468 | Self { tx } 469 | } 470 | pub fn lock(&self, f: impl FnOnce() -> F + Send + 'static) 471 | -> impl Future 472 | where 473 | F: Future + Send + 'static, 474 | F::Output: Send 475 | { 476 | // First create a channel to receive the result 477 | let (tx, rx) = oneshot::channel(); 478 | 479 | // Send the function to the mutex task to be executed 480 | self.tx.unbounded_send(Box::pin(async move { 481 | // Await the function and send back the result 482 | let _ = tx.send(f().await); 483 | })).unwrap(); 484 | 485 | // Panic if the other side of the channel was closed 486 | rx.unwrap_or_else(|_| panic!()) 487 | } 488 | } 489 | ``` 490 | 491 | In practice, the `FnOnce` passed into `lock()` would take a single argument, 492 | which would be a mutable reference to the data protected by the mutex: 493 | 494 | ```rust 495 | use std::future::Future; 496 | use std::pin::Pin; 497 | 498 | use futures::channel::{oneshot, mpsc}; 499 | use futures::{StreamExt, TryFutureExt}; 500 | 501 | // An arbitrary mutation to apply to the mutex state 502 | type ItemType = Box Pin + Send>> + Send>; 503 | 504 | pub struct AsyncMutex { 505 | tx: mpsc::UnboundedSender>, 506 | } 507 | 508 | impl AsyncMutex { 509 | pub fn new(value: T) -> Self { 510 | let (tx, rx) = mpsc::unbounded(); 511 | tokio::spawn(rx.fold(value, |value, f: ItemType| f(value))); 512 | 513 | Self { tx } 514 | } 515 | pub fn lock(&self, f: impl FnOnce(&mut T) -> F + Send + 'static) 516 | -> impl Future + '_ 517 | where 518 | F: Future + Send + 'static, 519 | F::Output: Send 520 | { 521 | // First create a channel to receive the result 522 | let (tx, rx) = oneshot::channel(); 523 | 524 | // Send the function to the mutex task to be executed 525 | self.tx.unbounded_send(Box::new(|mut value: T| Box::pin(async move { 526 | // Await the function and send back the result 527 | let _ = tx.send(f(&mut value).await); 528 | value 529 | }))).unwrap(); 530 | 531 | // Panic if the other side of the channel was closed 532 | rx.unwrap_or_else(|_| panic!()) 533 | } 534 | } 535 | ``` 536 | 537 | The solution we've ended up with has a lot of parallels with the actor model: 538 | instead of multiple tasks vying for access to a shared resource, we have a 539 | single task which owns its own state, and fairly and efficiently applies a 540 | sequence of mutations to that state in response to messages from other tasks. 541 | 542 | ## Conclusion 543 | 544 | This is as far as I've got investigating possible mutex solutions. Perhaps 545 | you can think of some other ways to address the problems I laid out that 546 | come with different trade-offs! 547 | -------------------------------------------------------------------------------- /async-mutexes/fairness.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 25 | 31 | 37 | 43 | 49 | 55 | 56 | 76 | 82 | 83 | 85 | 86 | 88 | image/svg+xml 89 | 91 | 92 | 93 | 94 | 95 | 99 | 106 | Task B 117 | Task A 128 | 132 | 136 | 140 | 141 | 145 | Task A releases the lock. 152 | 156 | 160 | 164 | 165 | Thread 2 begins executing task B's poll method. 174 | 178 | 182 | 186 | 187 | 192 | Holds lock 203 | 208 | Ready 219 | 223 | Task A has already acquired the mutex again by the time Task B tries to access it. 237 | Thread 1 248 | Thread 2 259 | 264 | Ready 275 | 280 | Holds lock 291 | 296 | CPU-bound 307 | 312 | Ready 323 | 328 | 329 | 330 | -------------------------------------------------------------------------------- /async-mutexes/laziness1.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 27 | 32 | 33 | 41 | 46 | 47 | 55 | 60 | 61 | 67 | 73 | 79 | 85 | 91 | 97 | 105 | 110 | 111 | 119 | 124 | 125 | 133 | 138 | 139 | 147 | 152 | 153 | 161 | 166 | 167 | 175 | 180 | 181 | 189 | 194 | 195 | 203 | 208 | 209 | 210 | 230 | 236 | 237 | 239 | 240 | 242 | image/svg+xml 243 | 245 | 246 | 247 | 248 | 249 | 253 | 261 | 267 | 271 | Take Lock 285 | 291 | Do Work 303 | 309 | Release Lock 321 | 326 | Future A 337 | Future B 348 | Task 359 | 364 | poll() 375 | 382 | ? 393 | 398 | 403 | 409 | Take Lock 423 | 429 | Do Work 441 | 447 | Release Lock 459 | 464 | Future A 475 | Future B 486 | Task 497 | 502 | poll() 513 | 520 | ? 531 | 536 | 541 | 543 | 547 | 551 | 552 | 553 | 554 | -------------------------------------------------------------------------------- /async-mutexes/laziness2.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 27 | 32 | 33 | 41 | 46 | 47 | 55 | 60 | 61 | 67 | 73 | 79 | 85 | 91 | 97 | 105 | 110 | 111 | 119 | 124 | 125 | 133 | 138 | 139 | 147 | 152 | 153 | 161 | 166 | 167 | 175 | 180 | 181 | 189 | 194 | 195 | 203 | 208 | 209 | 210 | 230 | 236 | 237 | 239 | 240 | 242 | image/svg+xml 243 | 245 | 246 | 247 | 248 | 249 | 253 | 261 | 267 | 271 | Take Lock 285 | 291 | Do Work 303 | 309 | Release Lock 321 | 326 | Future A 337 | Future B 348 | Task 359 | 364 | poll() 375 | 382 | ? 393 | 398 | 404 | Take Lock 418 | 424 | Do Work 436 | 442 | Release Lock 454 | 459 | Future A 470 | Future B 481 | Task 492 | 497 | poll() 508 | 515 | ? 526 | 531 | 533 | 537 | 541 | 542 | 543 | 544 | -------------------------------------------------------------------------------- /async-mutexes/laziness3.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 27 | 32 | 33 | 41 | 46 | 47 | 55 | 60 | 61 | 67 | 73 | 79 | 85 | 91 | 97 | 105 | 110 | 111 | 119 | 124 | 125 | 133 | 138 | 139 | 147 | 152 | 153 | 161 | 166 | 167 | 175 | 180 | 181 | 189 | 194 | 195 | 203 | 208 | 209 | 217 | 222 | 223 | 231 | 236 | 237 | 245 | 250 | 251 | 252 | 272 | 278 | 279 | 281 | 282 | 284 | image/svg+xml 285 | 287 | 288 | 289 | 290 | 291 | 295 | 303 | 309 | 313 | Take Lock 327 | 333 | Do Work 345 | 351 | Release Lock 363 | 368 | Future A 379 | Future B 390 | Task 401 | 406 | poll() 417 | 422 | 428 | Take Lock 442 | 448 | Do Work 460 | 466 | Release Lock 478 | 483 | Future A 494 | Task 505 | 510 | poll() 521 | 526 | 528 | 532 | 536 | 537 | 543 | Take Lock 557 | 563 | Do Work 575 | 581 | Release Lock 593 | 598 | 603 | Future B 614 | 620 | Take Lock 634 | 640 | Do Work 652 | 658 | Release Lock 670 | 671 | 672 | -------------------------------------------------------------------------------- /async-mutexes/thundering_herd.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 25 | 31 | 37 | 43 | 49 | 55 | 56 | 76 | 82 | 83 | 85 | 86 | 88 | image/svg+xml 89 | 91 | 92 | 93 | 94 | 95 | 99 | 106 | Task D 117 | Task C 128 | Task B 139 | Task A 150 | 154 | 158 | 162 | 163 | 167 | Task A releases the lock. 174 | 178 | 182 | 186 | 187 | Task B is polled and acquires the lock. 196 | 200 | 204 | 208 | 209 | 214 | Holds lock 225 | 230 | Ready 241 | 246 | Ready 257 | 261 | 265 | 269 | 270 | 275 | Ready 286 | 291 | Holds lock 302 | 306 | Tasks C and D are polled and fail to acquire the lock. 317 | 318 | 319 | -------------------------------------------------------------------------------- /message-queues/part1/README.md: -------------------------------------------------------------------------------- 1 | # Part 1. Why *simple* message queues suck 2 | 3 | ## The lifecycle of a web application 4 | 5 | Web applications typically start their life as simple CRUD applications. 6 | They receive a request, run some queries against a database and finally 7 | return a response. When there are no requests, the database state 8 | remains unchanged. 9 | 10 | Our deployment might look something like this: 11 | 1. SQL database 12 | - Tables map 1:1 with business concepts. 13 | 2. Stateless web server running our code. 14 | 15 | This is a great architecture: 16 | - It's simple and well understood. 17 | - Database transactions solve most consistency/atomicity requirements. 18 | - It can scale to large numbers of clients, either through the use 19 | of a distributed database, or effective use of caching in read-heavy 20 | workloads. 21 | - All of our state is contained in the SQL database, so backup and 22 | restore is very easy. 23 | 24 | The problems begin when we inevitably need to do something that doesn't 25 | fit into this model. We might need to: 26 | 27 | - Make a request to a slow external service. 28 | - Run some background work after a request finished. 29 | - Perform a task periodically. 30 | - Run some code reactively (eg. when this row in the database changes, 31 | do this other thing). 32 | 33 | At this point, the obvious tool to reach for is some kind of message 34 | queue. This is where things go downhill fast... 35 | 36 | ## Adding a message queue 37 | 38 | *I'm going to pick on RabbitMQ here, but only because that's what I 39 | have most experience with. There are many products in this space 40 | that suffer from largely the same issues.* 41 | 42 | We add RabbitMQ to our deployment: 43 | 44 | 1. SQL database 45 | - Tables map 1:1 with business concepts. 46 | 2. Stateless web server running our server code. 47 | 3. **RabbitMQ broker.** 48 | 4. **RabbitMQ worker running our background tasks.** 49 | 50 | We configure RabbitMQ to send ACKs on task completion, so that tasks 51 | are automatically retried if they fail. 52 | 53 | ## Message queue persistence 54 | 55 | Everything is running smoothly, until one day we get a spike in load. 56 | Shortly thereafter, we start getting various bug reports from customers 57 | which all seem to relate to background tasks not running. 58 | 59 | Upon checking the logs we realize what happened: with the spike in load, 60 | the RabbitMQ broker built up a backlog of messages, and ended up being 61 | OOM killed. It was automatically restarted by its supervisor, but we 62 | lost all the messages it had in memory at the time. 63 | 64 | We spend many hours of developer time figuring out how to rerun the 65 | lost tasks, but in the end this was our fault for not giving RabbitMQ 66 | somewhere to persist messages reliably: 67 | 68 | 1. SQL database 69 | - Tables map 1:1 with business concepts. 70 | 2. Stateless web server running our server code. 71 | 3. RabbitMQ broker. 72 | 4. RabbitMQ worker running our background tasks. 73 | 5. **Persistent volume for RabbitMQ broker.** 74 | 75 | ## Backup and restore 76 | 77 | We're on our day off, when suddenly we get paged. Our application is 78 | returning errors on all requests. It looks like there was a hardware 79 | failure, and the database state has become corrupted! 80 | 81 | No problem; we have backups for a reason, and we spin up a fresh 82 | database instance and restore into it. We even have point-in-time 83 | restore, so we can restore to right before the hardware failure. 84 | 85 | However, even after restoring, we see some strange errors coming in. 86 | Backgrounds tasks trying to access records in the database that don't 87 | exist. Background tasks that seem to have been lost altogether. Of 88 | course! We restored the database state, but we didn't touch the state 89 | of the message queue, and now they're out of sync. 90 | 91 | We spend many more hours of developer time reconciling this inconsistent 92 | state, and begin considering how we might address this issue. Hardware 93 | failures are rare, but there are lots of reasons we might need to rely 94 | on our backups, and this reconciliation process is unpleasant and 95 | error prone... 96 | 97 | ## Non-transactional transactions 98 | 99 | Business is going well, and we're seeing a lot more load as a result. 100 | For some reason, our task queue seems to be getting less and less 101 | reliable. When the load goes up by a fixed amount, the rate of 102 | duplicate or lost tasks seems to double! What used to *never* 103 | happen is now a daily occurrence, and a significant amount of 104 | developer time is spent trying to reconcile inconsistencies as a result. 105 | 106 | After yet more investigation, we find that the duplicate tasks occur 107 | when a transaction is rolled back due to a conflict. In that case, we 108 | retry, and this results in duplicate tasks being scheduled. 109 | 110 | The missing tasks seem to happen when we schedule tasks *after* 111 | committing the transaction. It looks like there's a brief window 112 | where a network blip or other intermittent error can result in the 113 | transaction being committed and the task never being scheduled. 114 | 115 | Surely this problem has been solved before? 116 | 117 | ## The outbox pattern 118 | 119 | Sure enough, our research yields results. It looks like there's a 120 | simple pattern to solve this: we simply write any tasks we want to 121 | run to a database table, and then have a separate process that takes 122 | things from this "outbox" and sends them to the task queue. 123 | 124 | Once a task completes, it removes itself from the table, ensuring that 125 | it won't be retried. 126 | 127 | 1. SQL database 128 | - Tables map 1:1 with business concepts. 129 | - **Outbox table.** 130 | 2. Stateless web server running our server code. 131 | 3. RabbitMQ broker. 132 | 4. RabbitMQ worker running our background tasks. 133 | 5. Persistent volume for RabbitMQ broker. 134 | 6. **Outbox watcher.** 135 | 136 | This new "outbox watcher" takes rows from the "outbox" table and 137 | schedules them in RabbitMQ. If we're lucky, our database has some 138 | means to do this efficiently, but more than likely it does not. 139 | In that case, we might have to resort to polling the table. 140 | 141 | ## The retry-pocalypse 142 | 143 | Another day, another developer deploys some buggy code. But this time, 144 | instead of causing a few tasks to error, the entire system implodes! 145 | Messages are building up in the task queue at an absurd rate, and 146 | the workers are maxed out. 147 | 148 | We roll back the change, but it still results in a significant outage. 149 | What's going on? We have exponential backoff on the tasks, so a few 150 | failing tasks shouldn't cause this much of an issue. 151 | 152 | The problem is that we now have two retry mechanisms at play: there's 153 | RabbitMQ's retry mechanism, and there's also the retries triggered 154 | by the "outbox" watcher. We need the latter, so we decide to turn off 155 | the former by sending an ACK as soon as a message is received by a 156 | worker. 157 | 158 | On the plus side we can also get rid of our persistent volume for 159 | RabbitMQ! 160 | 161 | 1. SQL database 162 | - Tables map 1:1 with business concepts. 163 | - Outbox table. 164 | 2. Stateless web server running our server code. 165 | 3. RabbitMQ broker. 166 | 4. RabbitMQ worker running our background tasks. 167 | 5. **~~Persistent volume for RabbitMQ broker.~~** 168 | 6. Outbox watcher. 169 | 170 | ## Performance problems 171 | 172 | The business continues to perform well, but the same can't be said 173 | of our web application. Customers are starting to complain about tasks 174 | running slowly, and on top of that, our "outbox" table is becoming a 175 | bottleneck with our database regularly hitting CPU and I/O limits. 176 | 177 | *If we were using MySQL, we might find that we have to keep several 178 | hundred "dummy" rows in the "outbox" table, or else the gap locks 179 | used by MySQL cause all our transactions to conflict with each other!* 180 | 181 | We move our database to a beefier machine, and look nervously at the 182 | projected future loads... It seems to have at least temporarily solved 183 | the database load issue. However, customers are still complaining about 184 | slow tasks. 185 | 186 | We add some more metrics, and find that a lot of time is spent between 187 | a message being scheduled and it being picked up by a worker. What's 188 | going on? We just scaled up our workers and they're mostly idle. Why 189 | are they not picking up new tasks? We try disabling "pre-fetch" in 190 | the workers, but it doesn't seem to help. 191 | 192 | Many *many* hours of investigation later, the problem becomes clear. 193 | RabbitMQ uses the AMQP protocol to deliver messages. This is a "push" 194 | based protocol, meaning that the broker pushes messages to workers, 195 | rather than workers "pull"ing them from the broker when they're ready. 196 | 197 | How does the broker decide when to send a new message to a worker? 198 | Well, assuming pre-fetch is disabled, it sends a new message when the 199 | worker ACKs the previous one. When we updated our workers to send ACKs 200 | early to avoid the retry-pocalypse, we inadvertently made it so that 201 | new tasks could get stuck behind other slow tasks! 202 | 203 | Unfortunately, there doesn't appear to be a way to solve this, as it's 204 | intrinsic to the underlying protocol. We try to work around it by 205 | splitting tasks into those we expect to run quickly, and those which 206 | might take a little more time, scheduling the different types onto 207 | different queues. We split our workers up similarly, so that some 208 | workers are dedicated specifally to each latency group. 209 | 210 | We've complicated our deployment yet again, and it hasn't completely 211 | eliminated the problem, but at least customers seem to be less vocal 212 | about it... 213 | 214 | ## Transaction conflicts 215 | 216 | We've noticed the number of transaction rollbacks caused by conflicts 217 | has steadily been increasing. We're scheduling background tasks in 218 | response to certain updates in the database, and this is causing 219 | bursts of writes to the same set of rows in the database. The background 220 | tasks have a much higher than expected chance of conflicting with 221 | concurrent requests and other tasks. 222 | 223 | We try to mitigate this by explicitly locking rows we're going to update 224 | at the beginning of the transaction, so that other potentially conflicting 225 | transactions will wait rather than do redundant work that must later be 226 | reverted. 227 | 228 | This gets rid of the transaction conflicts, but as a result, and combined 229 | with our segregation by latencies, we're making very inefficient use of 230 | our workers, with lots of them idle, either waiting on database locks, or 231 | simply belonging to a latency group that is currently idle. 232 | 233 | ## The grass is always greener 234 | 235 | At this point, we're getting pretty fed up. Those promises of simplicity 236 | and performance when we first went looking for a task queue seem 237 | laughable now. 238 | 239 | Perhaps it was a mistake to shy away from the more advanced systems like 240 | Kafka or Apache Pulsar? Maybe the designers of those systems have spent 241 | more than 5 seconds thinking about how their product might actually 242 | integrate into a larger application? 243 | 244 | Continued in [part 2](../part2)... 245 | -------------------------------------------------------------------------------- /message-queues/part2/README.md: -------------------------------------------------------------------------------- 1 | # Part 2. Why *all* message queues suck 2 | 3 | ## Switching message queue 4 | 5 | After much effort, we port our application to use a "better" message 6 | queue. Something like Kafka or Apache Pulsar. We have more control 7 | over how messages are distributed to workers which helps us eliminate 8 | the high latencies which customers were seeing, and we might be able 9 | to partition tasks in such a way that potentially conflicting tasks 10 | are processed sequentially rather than in parallel, which helps 11 | with the database contention. 12 | 13 | *If we're using Kafka, we might be able to get rid of our "outbox" 14 | entirely, by storing queue offsets directly in our database, but 15 | this brings its own challenges.* 16 | 17 | We modify the retry logic in our "outbox watcher": now we remove 18 | tasks from the outbox as soon as our message queue accepts them, and 19 | rely on the retry logic within the message queue to handle failures 20 | from running the task. 21 | 22 | This allows us to reduce some of the load on our "outbox" table, 23 | since it will remain relatively small at all times. Our database 24 | can likely keep it entirely in memory. The downside to this change 25 | is that we lose the ability to cleanly restore our application 26 | state from a backup, and we need our persistent volume back... 27 | 28 | Deployment of our application is now significantly more complicated 29 | than it once was. 30 | 31 | ## Complexity 32 | 33 | The new message queue is not a panacea. There are lots of little gotchas 34 | here and there, but at least they seem solvable. Somebody somewhere will 35 | have run into them before, and over time we get closer to a reliable and 36 | stable system. 37 | 38 | The problem is the additional complexity this has introduced to our 39 | codebase and deployment. We end up with lots of rules for what should 40 | and should not be done to avoid running into problems. A lot of code is 41 | dedicated to handling various failure modes, and our business logic has 42 | become harder to isolate and test independently. 43 | 44 | Everywhere we trigger a background job, we have to add extra 45 | columns to record the ID of that job so that when it completes, we 46 | can match its result back up with the correct entity, and this 47 | all needs to be mocked out when under test. 48 | 49 | Finally, the routing and partitioning logic in our message queue is 50 | intricately tied to our database access patterns and updates to either 51 | one can easily result in unexpected performance regressions. 52 | 53 | We have something that *works*. Eventually it might even *work well*. But 54 | we're a long way from our nice simple CRUD architecture. 55 | 56 | ## Why is this so hard? 57 | 58 | People have been building these kinds of applications for a very long 59 | time. Why are the tools still so hard to use? Why do they fit together 60 | so poorly? I've thought about this for a long time, and I believe the 61 | fundamental reason for this complexity is the lack of a shared transactor. 62 | 63 | A transactor is a process that ensures transactions are either fully 64 | committed or fully rolled back. It is a fundamental building block to 65 | writing correct applications, but it typically operates at a very low 66 | level. 67 | 68 | A SQL database will have its own transactor. A message queue will have 69 | its own transactor (although it may not call it that). Most systems 70 | which store persistent state will have some kind of transactor to 71 | ensure correctness, but no such tool is available to the application 72 | itself. 73 | 74 | Every transactor has a domain in which it operates, and the boundary 75 | between two transactors is a place where failures can and will occur. 76 | Any time there is communication between domains owned by different 77 | transactors, extra complexity is needed: 78 | 79 | - Automatic retries with backoff. 80 | - Handling of failures and timeouts. 81 | 82 | These things cannot be handled "generically". A true failure here (i.e. 83 | one that cannot be retried) is something that eventually needs to be 84 | surfaced to a user somewhere. What that actually means is intrinsically 85 | tied to the business logic of the application. 86 | 87 | Furthermore, the persistent store beneath a particular transactor 88 | might not support the functionality we need to implement this extra 89 | error handling efficiently. Take a SQL database for example: 90 | 91 | - Pros: 92 | - Can easily store the "outbox" in a table. 93 | - Cons: 94 | - Frequent updates to the "outbox" table are a bottleneck. 95 | - "Watching" the "outbox" table for new rows may require polling. 96 | 97 | On the other hand, a message queue has this retry logic built in, but 98 | if we want to e.g. find out how many tasks are "stuck", or get any 99 | insight into the state of the message queue that requires a 100 | non-trivial query, then the message queue likely won't support that. 101 | 102 | Wouldn't things be a lot easier if we could just have all of our 103 | persistent state live under the same transactor? We'd still have 104 | to worry about failure modes when talking to external systems, but 105 | at least *for our own code, and our own business logic* we can just 106 | assume reliable communication everywhere. 107 | 108 | ## FoundationDB 109 | 110 | Unsurprisingly, I'm not the first person to think this way. FoundationDB 111 | is a distributed key/value store which is intended to be used as the 112 | *foundation* for other tools which need to store persistent state. 113 | 114 | In FoundationDB, an application will typically talk to higher level 115 | "layers" which run on top of FoundationDB. These layers could include 116 | a SQL database, a message queue, etc. All the layers use the same 117 | transactor, so you can easily update a row and send a message, all in 118 | one atomic transaction. Furthermore, you can 119 | backup and restore the entire FoundationDB state in one go. 120 | 121 | The premise here is that useful layers can be built upon a 122 | shared key/value store, and that sharing a transactor is useful. 123 | 124 | ## What can we do with a tool like FoundationDB? 125 | 126 | Obviously we could mimic our old architecture, with a SQL layer and 127 | a message queue layer, but FoundationDB opens up a lot of other 128 | options too. 129 | 130 | I've been pursuing one of these ideas with a project I'm calling 131 | AgentDB. This project borrows a lot of ideas from actor systems, 132 | but with a strong focus on eliminating code which is not required 133 | for the business logic itself. 134 | 135 | This will be the focus of [part 3](../part3)! 136 | -------------------------------------------------------------------------------- /message-queues/part3/README.md: -------------------------------------------------------------------------------- 1 | # Part 3. AgentDB 2 | 3 | ## Goals 4 | 5 | - Eliminate failure modes *within* the system. 6 | - Allow business logic to be expressed directly and succinctly. 7 | - Ensure business logic can be tested easily. 8 | - Unlimited horizontal scaling. 9 | - Within a constant factor of performance we might attain with other solutions. 10 | - General purpose. 11 | - Enable backup and restore. 12 | 13 | ## Non-goals 14 | 15 | - Achieve the best possible performance. 16 | - Surpass specialized tools. 17 | 18 | ## Implementation 19 | 20 | AgentDB is a FoundationDB *layer*, and is used as a client library 21 | by stateless services which can be independently scaled: 22 | 23 | ![](agentdb-arch.svg) 24 | 25 | The concepts in AgentDB are: 26 | 27 | - Agents 28 | 29 | These have their own state. The state changes over time in response to 30 | messages received by the agent, according to the agent's *state function*. 31 | 32 | - Messages 33 | 34 | These are partitioned according to the receiving agent, so for a given agent, 35 | all of its messages will belong to the same partition. Messages are guaranteed 36 | to be delivered *exactly once*. 37 | 38 | Each message also belongs to an operation. 39 | 40 | - Operations 41 | 42 | These are used for tracing and to prevent unintentional feedback loops within 43 | the system. Each operation has a certain budget, which slowly increases over time. 44 | Sending a message eats into that budget. 45 | 46 | - State function 47 | 48 | This is a pure function which takes the following inputs: 49 | 50 | - The current state of the agent. 51 | - The list of messages in the agent's inbox. 52 | 53 | And produces the following outputs: 54 | 55 | - The new state of the agent. 56 | - The list of messages to send to other agents. 57 | - A side-effectful post-commit hook. 58 | 59 | - Clients 60 | 61 | These are processes which connect to a given AgentDB root and provide the 62 | implementation of the *state function* for agents within that root. AgentDB 63 | automatically assigns partitions to active clients, such that at any one time, 64 | every partition is owned by exactly one client. 65 | 66 | Each client waits for messages to appear in any of the partitions it owns. 67 | The client first groups messages according to the receiving agent. It then 68 | loads the agent's current state, along with all the messages for that agent. 69 | It runs the *state function* to produce the new state and a set of messages 70 | to send. These are then saved back into the database. 71 | 72 | Finally, if everything committed successfully, the post-commit hook is run. 73 | 74 | ![](agentdb-client.svg) 75 | 76 | - Root 77 | 78 | This is an "instance" of AgentDB. All agents within the same root evolve 79 | according to the same *state function*, and are processed by the same set of 80 | clients. 81 | 82 | That said, since we have a shared transactor, there is no problem sending 83 | messages from agents in one root to agents in another: none of the guarantees 84 | are weakened in that case. 85 | 86 | ## Achievement of goals 87 | 88 | - Eliminate failure modes within the system. 89 | 90 | AgentDB guarantees exactly-once delivery of all messages with no additional 91 | effort from the application. This is thanks to FoundationDB's transactor. 92 | 93 | - Allow business logic to be expressed directly and succinctly. 94 | 95 | Glue code, such as retry-logic for communication with external systems, can 96 | be completely encapsulated by reusable agents. The focus of the developer 97 | can be entirely on how the business logic itself is implemented. 98 | 99 | - Ensure business logic can be tested easily. 100 | 101 | Since the *state function* is a pure function, it's trivial to extensively 102 | unit test every single agent without even needing a real database. There's 103 | no need to mock anything, since the inputs and outputs (with the exception 104 | of the post-commit hook) are pure data. 105 | 106 | Post-commit hooks are more difficult to test, but luckily these are only 107 | needed when communicating with an external system, and can be implemented 108 | once and reused for all such communication. 109 | 110 | - Unlimited horizontal scaling. 111 | 112 | FoundationDB is inherently distributed. AgentDB itself will automatically 113 | re-assign partitions as clients come and go, so scaling is trivial. 114 | 115 | If it's necessary to scale beyond the current number of partitions in a 116 | root, it is possible to re-partition the database on the fly, with only 117 | a brief pause to message processing. 118 | 119 | - Within a constant factor of performance we might attain with other solutions. 120 | 121 | The unit of concurrency in AgentDB is the agent: a given agent will process 122 | all messages sequentially, but any number of agents can execute in parallel. 123 | The asynchronous nature of this approach means that there is no lock contention, 124 | and no transaction conflicts during normal operation. 125 | 126 | As throughput increases, the efficiency of the workers also increases, 127 | since multiple messages can be processed after loading an agent's state only 128 | once. 129 | 130 | Together, these should mean that our "big O" complexity is pretty optimal - we 131 | just might have some bigger constant factors than other solutions. 132 | 133 | - General purpose. 134 | 135 | Many problems can be expressed as a set of communicating agents. When this 136 | is not an appropriate model, it is possible to mix in usage of other 137 | FoundationDB layers. 138 | 139 | Some of these layers can themselves be exposed through an agent-based 140 | abstraction. For example, AgentDB itself implements an "index" agent, which 141 | allows efficient lookups of other agents via an arbitrary key. The data 142 | backing such an index would be too large to store directly within the agent 143 | state, so this agent uses FoundationDB directly, but queries and updates 144 | to the index are performed via normal AgentDB messages. 145 | 146 | - Enable backup and restore. 147 | 148 | Since all persistent state is kept within FoundationDB, backing up and 149 | restoring everything works just fine. 150 | 151 | -------------------------------------------------------------------------------- /message-queues/part3/agentdb-client.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |
For each partition
For each partition
Wait for partition to contain messages.
Wait for partition t...
Group messages by receiving agent.
Group messages by re...
For each receiving agent.
For each receiving agent.
Load agent state and messages.
Load agent state and messages.
Run state function.
Run state function.
Save agent state and send messages.
Save agent state and send messages.
Run post-commit hook.
Run post-commit hook.
Viewer does not support full SVG 1.1
--------------------------------------------------------------------------------