├── .gitignore ├── README.md ├── docs ├── domain-specific-debuggers.md ├── erlang-is-not-about.md ├── modular-state-machines.md ├── practice-in-programming.md └── specification-language.md └── talks └── zurihac-2023 ├── Makefile ├── abstract.txt ├── slides_stevan.md └── slides_stevan.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | # haskell 2 | dist-newstyle/ 3 | cabal.project.local* 4 | 5 | # emacs 6 | .\#* 7 | *~ 8 | \#*\# 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Armstrong distributed systems 2 | 3 | How do we build reliable, scalable and maintainable computer systems? 4 | 5 | This repository contains notes on how, I think, we can improve on the state of 6 | development, documentation, testing, deployment, observability, debugging, and 7 | upgrading of distributed systems. Most of the ideas are stolen from others, many 8 | from Erlang and Joe Armstrong. Over time I hope to turn this into a more 9 | coherent text, for now think of it a crude blog or some basic scaffolding for me 10 | to hang my thoughts on. 11 | 12 | If any of this interests you as well, please do get in touch -- one of the 13 | reasons I'm writing this down is to find people to collaborate with. 14 | 15 | ## Development 16 | 17 | * [Erlang's not about lightweight processes and message 18 | passing...](docs/erlang-is-not-about.md) 19 | 20 | * [Implementing 21 | behaviours](https://github.com/stevana/armstrong-distributed-systems/blob/implementing-behaviours/docs/implementing-behaviours.md) 22 | 23 | * [Modular state machines](docs/modular-state-machines.md) 24 | 25 | * Implementing `gen_event` using the LMAX disruptor 26 | - https://github.com/stevana/pipelined-state-machines 27 | - Shard on: "people, stuff or deals", 28 | [says](https://youtu.be/1KRYH75wgy4?t=2781) Martin Thompson. 29 | 30 | * [State machines with of async I/O](https://github.com/stevana/coroutine-state-machines) 31 | 32 | * Persisted/`mmap`ped lock-free concurrent data structures 33 | - [Working with binary data](https://github.com/stevana/bits-and-bobs) 34 | - bytebuffer 35 | - journal 36 | - hashmap 37 | - arena allocator 38 | - buffered actions 39 | - [Efficient Tree-Traversals: Reconciling Parallelism and 40 | Dense](https://arxiv.org/pdf/2107.00522.pdf) 41 | 42 | * [On the role of practice in programming](docs/practice-in-programming.md) 43 | 44 | ## Documentation 45 | 46 | * [Specification language](docs/specification-language.md) 47 | - Interfaces 48 | - Messages 49 | - Compression 50 | - Protocols 51 | - Usage model or operational profile 52 | 53 | * Joe's idea of ["the bigger picture"](https://youtu.be/h8nmzPh5Npg?t=1220) in 54 | particular the "research" part, not just the end result (code) but how you got 55 | to the result. 56 | 57 | ## Testing 58 | 59 | * [Simulation testing using state 60 | machines](https://github.com/stevana/property-based-testing-stateful-systems-tutorial) 61 | 62 | * Usage model or operational profile 63 | 64 | * Load/soak testing 65 | 66 | ## Deployment 67 | 68 | * [Deploying and restarting state machines using supervisor 69 | trees](https://github.com/stevana/supervised-state-machines) 70 | 71 | * [Elastically scalable thread 72 | pools](https://github.com/stevana/elastically-scalable-thread-pools) 73 | 74 | ## Observability 75 | 76 | * [Visualising datastructures over time using 77 | SVG](https://github.com/stevana/svg-viewer-in-svg) 78 | 79 | ## Debugging 80 | 81 | * [Domains-specific debuggers](docs/domain-specific-debuggers.md) 82 | 83 | ## Upgrading 84 | 85 | * [Hot-code swapping à la Erlang with `Arrow`-based state 86 | machines](https://github.com/stevana/hot-swapping-state-machines) 87 | 88 | * [`smarrow-lang`](https://github.com/stevana/smarrow-lang) an experimental 89 | programming language where programs are state machines expressed in arrow 90 | notation to allow easy hot-code swapping 91 | 92 | * Version everything 93 | -------------------------------------------------------------------------------- /docs/domain-specific-debuggers.md: -------------------------------------------------------------------------------- 1 | --- 2 | status: research (please don't share) 3 | --- 4 | 5 | # Domain-specific debuggers 6 | 7 | ## Motivation 8 | 9 | * General purpose debuggers such as [`gdb`](https://www.sourceware.org/gdb/) are underused 10 | 11 | * Theory: people who use general purpose debuggers also care about memory layout, e.g.: 12 | 13 | - John Carmack on using 14 | [debuggers](https://www.youtube.com/watch?v=tzr7hRXcwkw) and his old 15 | [.plan](https://github.com/oliverbenns/john-carmack-plan/blob/53c00aedaeeb23dee06b28ba985c3b3b4d61107c/archive/1998-10-14.md) 16 | (1998); 17 | 18 | - Martin Thompson also [says](https://youtu.be/1KRYH75wgy4?t=411) "step 19 | through your code using a debugger". 20 | 21 | * Possible fix: domain-specific debuggers, that focus on displaying your 22 | application state at the level of abstraction that you think of it, rather 23 | than how the programming language that you are using happens to lay it out in 24 | memory 25 | 26 | ## Idea 27 | 28 | * Assumptions: determinism, state machine 29 | * Record inputs (and states) in circular buffer, dump to disk/SQLite db on error 30 | 31 | ## Examples 32 | 33 | ### Distributed systems 34 | 35 | * TigerbeetleDB's demo https://youtu.be/w3WYdYyjek4?t=3175 36 | * https://spritely.institute/news/introducing-a-distributed-debugger-for-goblins-with-time-travel.html 37 | 38 | ### Games 39 | 40 | * [Tomorrow Corporation Tech Demo](https://www.youtube.com/watch?v=72y2EC5fkcE) 41 | 42 | ## Optimisations 43 | 44 | * Avoid storing the state in the trace (if deterministic) 45 | * Avoiding infinite traces via state snapshots 46 | * Turn off logging (if deterministic) 47 | * Audit trails 48 | 49 | ## Contributing 50 | 51 | * General purpose formats for states, inputs and outputs against which generic 52 | debuggers can be written against? Structured JSON? Binary? 53 | 54 | ## See also 55 | 56 | * The history of [time traveling 57 | debuggers](http://jakob.engbloms.se/archives/1564) 58 | * https://werat.dev/blog/what-a-good-debugger-can-do/ 59 | 60 | * Mozilla's [`rr` debugger](https://github.com/rr-debugger/rr) 61 | 62 | * [Visualising application state](https://youtu.be/-HhI3BEIqWw?t=1199) 63 | 64 | * Command sourcing 65 | 66 | * Jamie Brandon's post [*Local state is 67 | harmful*](https://www.scattered-thoughts.net/writing/local-state-is-harmful/) 68 | (2014) 69 | -------------------------------------------------------------------------------- /docs/erlang-is-not-about.md: -------------------------------------------------------------------------------- 1 | --- 2 | date: 2023-01-18 3 | --- 4 | 5 | # Erlang's not about lightweight processes and message passing... 6 | 7 | I used to think that the big idea of Erlang is its lightweight processes and 8 | message passing. Over the last couple of years I've realised that there's a 9 | bigger insight to be had, and in this post I'd like to share it with you. 10 | 11 | ## Background 12 | 13 | Erlang has an interesting history. If I understand things correctly, it started 14 | off as a Prolog library for building reliable distributed systems, morphed into a 15 | Prolog dialect, before finally becoming a language in its own right. 16 | 17 | The goal seemed to have always been to solve the problem of building reliable 18 | distributed systems. It was developed at Ericsson and used to program their 19 | telephone switches. This was sometime in the 80s and 90s, before internet use 20 | become widespread. I suppose they were already dealing with "internet scale" 21 | traffic, i.e. hundreds of millions of users, with stricter SLAs than most 22 | internet services provide today. So in a sense they were ahead of their time. 23 | 24 | In 1998 Ericsson decided to ban all use of Erlang[^0]. The people responsible 25 | for developing it argued that if they were going to ban it, then they might as 26 | well open source it. Which Ericsson did and shortly after most of the team that 27 | created Erlang quit and started their own company. 28 | 29 | One of these people was Joe Armstrong, which also was one of the main people 30 | behind the design and implementation of Erlang. The company was called Bluetail 31 | and they got bought up a couple of times but in the end Joe got fired in 2002. 32 | 33 | Shortly after, still in 2002, Joe starts writing his PhD thesis at the Swedish 34 | Institute of Computer Science (SICS). Joe was born 1950, so he was probably 52 35 | years old at this point. The topic of the thesis is *Making reliable distributed 36 | systems in the presence of software errors* and it was finished the year after 37 | in 2003. 38 | 39 | It's quite an unusual thesis in many ways. For starters, most theses are written 40 | by people in their twenties with zero experience of practical applications. 41 | Whereas in Joe's case he has been working professionally on this topic since the 42 | 80s, i.e. about twenty years. The thesis contains no math nor theory, it's 43 | merely a presentation of the ideas that underpin Erlang and how they used Erlang 44 | to achieve the original goal of building reliable distributed systems. 45 | 46 | I highly commend reading his 47 | [thesis](http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A9492&dswid=-1166) 48 | and forming your own opinion, but to me it's clear that the big idea there isn't 49 | lightweight processes[^1] and message passing, but rather the generic components 50 | which in Erlang are called *behaviours*. 51 | 52 | ## Behaviours 53 | 54 | I'll first explain in more detail what behaviours are, and then I'll come back 55 | to the point that they are more important than the idea of lightweight processes. 56 | 57 | Erlang behaviours are like interfaces in, say, Java or Go. It's a collection of 58 | type signatures which can have multiple implementations, and once the programmer 59 | provides such an implementation they get access to functions written against 60 | that interface. To make it more concrete here's a contrived example in Go: 61 | 62 | ```go 63 | // The interface. 64 | type HasName interface { 65 | Name() string 66 | } 67 | 68 | // A generic function written against the interface. 69 | func Greet(n HasName) { 70 | fmt.Printf("Hello %s!\n", n.Name()) 71 | } 72 | 73 | // First implementation of the interface. 74 | type Joe struct {} 75 | 76 | func (_ *Joe) Name() string { 77 | return "Joe" 78 | } 79 | 80 | // Second implementation of the interface. 81 | type Mike struct {} 82 | 83 | func (_ *Mike) Name() string { 84 | return "Mike" 85 | } 86 | 87 | func main() { 88 | joe := &Joe{} 89 | mike := &Mike{} 90 | Greet(mike) 91 | Greet(joe) 92 | } 93 | ``` 94 | 95 | Running the above program will display: 96 | 97 | ``` 98 | Hello Mike! 99 | Hello Joe! 100 | ``` 101 | 102 | This hopefully illustrates how `Greet` is generic in, or parametrised by, the 103 | interface `HasName`. 104 | 105 | ### Generic server behaviour 106 | 107 | Next lets have a look at a more complicated example in Erlang taken from Joe's 108 | thesis (p. 136). It's a key-value store where we can `store` a key value pair or 109 | `lookup` the value of a key, the `handle_call` part is the most interesting: 110 | 111 | ```erlang 112 | -module(kv). 113 | -behaviour(gen_server). 114 | 115 | -export([start/0, stop/0, lookup/1, store/2]). 116 | 117 | -export([init/1, handle_call/3, handle_cast/2, terminate/2]). 118 | 119 | start() -> 120 | gen_server:start_link({local,kv},kv,arg1,[]). 121 | 122 | stop() -> gen_server:cast(kv, stop). 123 | 124 | init(arg1) -> 125 | io:format("Key-Value server starting~n"), 126 | {ok, dict:new()}. 127 | 128 | store(Key, Val) -> 129 | gen_server:call(kv, {store, Key, Val}). 130 | 131 | lookup(Key) -> gen_server:call(kv, {lookup, Key}). 132 | 133 | handle_call({store, Key, Val}, From, Dict) -> 134 | Dict1 = dict:store(Key, Val, Dict), 135 | {reply, ack, Dict1}; 136 | handle_call({lookup, crash}, From, Dict) -> 137 | 1/0; %% <- deliberate error :-) 138 | handle_call({lookup, Key}, From, Dict) -> 139 | {reply, dict:find(Key, Dict), Dict}. 140 | 141 | handle_cast(stop, Dict) -> {stop, normal, Dict}. 142 | 143 | terminate(Reason, Dict) -> 144 | io:format("K-V server terminating~n"). 145 | ``` 146 | 147 | This is an implementation of the `gen_server` behaviour/interface. Notice how 148 | `handle_call` updates the state (`Dict`) in case of a `store` and `lookup`s the 149 | key in the state. Once `gen_server` is given this implementation it will provide 150 | a server which can handle concurrent `store` and `lookup` requests, similarly to 151 | how `Greet` provided the displaying functionality. 152 | 153 | At this point you might be thinking "OK, so what? Lots of programming languages 154 | have interfaces...". That's true, but notice how `handle_call` is completely 155 | sequential, i.e. all concurrency is hidden away in the generic `gen_server` 156 | component. "Yeah, but that's just good engineering practice which can be done in 157 | any language" you say. That's true as well, but the thesis pushes this idea 158 | quite far. It identifies six behaviours: `gen_server`, `gen_event`, `gen_fsm`, 159 | `supervisor`, `application`, and `release` and then says these are enough to 160 | build reliable distributed systems. As a case study Joe uses one of Ericsson's 161 | telephone switches (p. 157): 162 | 163 | > When we look at the AXD301 project in chapter 8, we will see that there were 164 | > 122 instances of gen_server, 36 instances of gen_event and 10 instances of 165 | > gen_fsm. There were 20 supervisors and 6 applications. All this is packaged 166 | > into one release. 167 | 168 | Joe gives several arguments for why behaviour should be used (pp. 157-158): 169 | 170 | 1. The application programmer only has to provide the part of the code which 171 | defines the *semantics* (or "business logic") of their problem, while the 172 | *infrastructure* code is provided automatically by the behaviour; 173 | 174 | 2. The application programmer writes sequential code, all concurrency is hidden 175 | away in the behaviour; 176 | 177 | 3. Behaviours are written by experts, and based on years of experience and 178 | represent "best practices"; 179 | 180 | 4. Easier for new team members to get started: business logic is sequential, 181 | similar structure that they might have seen before elsewhere; 182 | 183 | 5. If whole systems are implemented reusing a small set of behaviours: as 184 | behaviour implementations improve the whole systems will improve without 185 | requiring any code changes; 186 | 187 | 6. Sticking to only using behaviours enforces structure, which in turn makes 188 | testing and formal verification much easier. 189 | 190 | We'll come back to this last point about testing later. 191 | 192 | ### Event manager behaviour 193 | 194 | Lets come back to the behaviours we listed above first. We looked at 195 | `gen_server`, but what are the others for? There's `gen_event` which is a 196 | generic event manager, which lets you register event handlers that are then run 197 | when the event manager gets messages associated with the handlers. Joe says this 198 | is useful for, e.g., error logging and gives the following example of an simple 199 | logger (p. 142): 200 | 201 | ```erlang 202 | -module(simple_logger). 203 | -behaviour(gen_event). 204 | 205 | -export([start/0, stop/0, log/1, report/0]). 206 | 207 | -export([init/1, terminate/2, 208 | handle_event/2, handle_call/2]). 209 | 210 | -define(NAME, my_simple_event_logger). 211 | 212 | start() -> 213 | case gen_event:start_link({local, ?NAME}) of 214 | Ret = {ok, Pid} -> 215 | gen_event:add_handler(?NAME,?MODULE,arg1), 216 | Ret; 217 | Other -> 218 | Other 219 | end. 220 | 221 | stop() -> gen_event:stop(?NAME). 222 | 223 | log(E) -> gen_event:notify(?NAME, {log, E}). 224 | 225 | report() -> 226 | gen_event:call(?NAME, ?MODULE, report). 227 | 228 | init(arg1) -> 229 | io:format("Logger starting~n"), 230 | {ok, []}. 231 | 232 | handle_event({log, E}, S) -> {ok, trim([E|S])}. 233 | 234 | handle_call(report, S) -> {ok, S, S}. 235 | 236 | terminate(stop, _) -> true. 237 | 238 | trim([X1,X2,X3,X4,X5|_]) -> [X1,X2,X3,X4,X5]; 239 | trim(L) -> L. 240 | ``` 241 | 242 | The interesting part is `handle_event`, `trim` and `report`. Together they let 243 | the user log, keep track and display the last five error messages. 244 | 245 | ### State machine behaviour 246 | 247 | The `gen_fsm` behavior has been renamed to `gen_statem` (for state machine) 248 | since thesis was written. It's very similar to `gen_server`, but more geared 249 | towards implementing protocols, which often are specified as state machines. I 250 | believe any `gen_server` can be implemented as a `gen_statem` and vice versa so 251 | we won't go into the details of `gen_statem`. 252 | 253 | ### Supervisor behaviour 254 | 255 | The next interesting behavior is `supervisor`. Supervisors are processes which 256 | sole job is to make sure that other processes are healthy and doing their job. 257 | If a supervised process fails then the supervisor can restart it according 258 | to some predefined strategy. Here's an example due to Joe (p. 148): 259 | 260 | ```erlang 261 | -module(simple_sup). 262 | -behaviour(supervisor). 263 | 264 | -export([start/0, init/1]). 265 | 266 | start() -> 267 | supervisor:start_link({local, simple_supervisor}, 268 | ?MODULE, nil). 269 | 270 | init(_) -> 271 | {ok, 272 | {{one_for_one, 5, 1000}, 273 | [ 274 | {packet, 275 | {packet_assembler, start, []}, 276 | permanent, 500, worker, [packet_assembler]}, 277 | {server, 278 | {kv, start, []}, 279 | permanent, 500, worker, [kv]}, 280 | {logger, 281 | {simple_logger, start, []}, 282 | permanent, 500, worker, [simple_logger]}]}}. 283 | ``` 284 | 285 | The `{one_for_one, 5, 1000}` is the restart strategy. It says that if one of the 286 | supervised processes (`packet_assembler`, `kv`, and `simple_logger`) fail then 287 | only restart the failing process (`one_for_one`). If the supervisor needs to 288 | restart more than `5` times in `1000` seconds then the supervisor itself should 289 | fail. 290 | 291 | The `permanent, 500, worker` part means that this is a worker process which 292 | should be permanently kept alive and its given 500 milliseconds to gracefully 293 | stop what it's doing in case the supervisor wants to restart it. 294 | 295 | "Why would the supervisor want to restart it if it's not dead already?", one 296 | might wonder. Well, there are other restart strategies than `one_for_one`. For 297 | example, `one_for_all` where if one process fails then the supervisor restarts 298 | all of its children. 299 | 300 | If we also consider that supervisors can supervise supervisors, which are not 301 | necessarily running on the same computer, then I hope that you get an idea of how 302 | powerful this behaviour can be. And, no, this isn't "just Kubernetes", because 303 | it's at the thread/lightweight process level rather than docker container level. 304 | 305 | The idea for supervisors and their restart strategies comes from the observation 306 | that often a restart appears to fix the problem, as captured in the *Have You 307 | Tried Turning It Off And On Again?* sketches from IT Crowd. 308 | 309 | Knowing that failing processes will get restarted coupled with Jim Gray's idea 310 | of failing fast, that's either produce the output according to the specification 311 | or signal failure and stop operating, leads to Joe's slogan: "Let it crash!" (p. 312 | 107). Another way to think of it is that a program should only express its 313 | "happy path", should anything go wrong on its happy way it should crash, rather 314 | than trying to be clever about it and try to fix the problem (potentially making 315 | it worse), and another program higher up the supervisor tree will handle it. 316 | 317 | Supervisors and the "let it crash" philosophy, appear to produce reliable 318 | systems. Joe uses the Ericsson AXD301 telephone switch example again (p. 191): 319 | 320 | > Evidence for the long-term operational stability of the system had also not 321 | > been collected in any systematic way. For the Ericsson AXD301 the only 322 | > information on the long-term stability of the system came from a power-point 323 | > presentation showing some figures claiming that a major customer had run an 11 324 | > node system with a 99.9999999% reliability, though how these figure had been 325 | > obtained was not documented. 326 | 327 | To put this in perspective, five nines (99.999%) reliability is considered good 328 | (5.26 minutes of downtime per year). "59% of Fortune 500 companies experience a 329 | minimum of 1.6 hours of downtime per week", according to some 330 | [report](https://courseware.cutm.ac.in/wp-content/uploads/2020/06/Assessing-the-Financial-Impact-of-Downtime-UK.pdf) 331 | from a biased company. Notice per *year* vs per *week*, but as we don't know how 332 | either reliability numbers are obtained its probably safe to assume that the 333 | truth is somewhere in the middle -- still a big difference, but not 31.56 334 | milliseconds (nine nines) of downtime per year vs 1.6 hours of downtime per 335 | week. 336 | 337 | ### Application and release behaviours 338 | 339 | I'm not sure if `application` and `release` technically are behaviours, i.e. 340 | interfaces. They are part of the same chapter as the other behaviours in the 341 | thesis and they do provide a clear structure which is a trait of the other 342 | behaviours though, so we'll include them in the discussion. 343 | 344 | So far we've presented behaviours from the bottom up. We started with "worker" 345 | behaviours `gen_server`, `gen_statem` and `gen_event` which together capture the 346 | semantics of our problem. We then saw how we can define `supervisor` trees whose 347 | children are other supervisor trees or workers, to deal with failures and 348 | restarts. 349 | 350 | Next level up is an `application` which consists of a supervisor tree together 351 | with everything else we need to deliver a particular application. 352 | 353 | A system can consist of several `application` and that's where the final 354 | "behaviour" comes in. A `release` packages up one or more applications. They 355 | also contain code to handle upgrades. If the upgrade fails, it must be able to 356 | rollback to the previous stable state. 357 | 358 | ## How behaviours can be implemented 359 | 360 | I hope that by now I'm managed to convince you that it's not actually the 361 | lightweight processes and message passing by themselves that make Erlang great 362 | for building reliable systems. 363 | 364 | At best one might be able to claim that lightweight processes and supervisors 365 | are the key mechanisms at play[^2], but I think it would be more honest to recognise 366 | the structure that behaviours provide and how that ultimately leads to reliable 367 | software. 368 | 369 | I've not come across any other language, library, or framework which provides 370 | such relatively simple building blocks that compose into big systems like the 371 | AXD301 ("over a million lines of Erlang code", p. 167). 372 | 373 | This begs the question: why aren't language and library designers stealing the 374 | structure behind Erlang's behaviours, rather than copying the ideas of 375 | lightweight processes and message passing? 376 | 377 | Let's take a step back. We said earlier that behaviours are interfaces and many 378 | programming languages have interfaces. How would we go about starting to 379 | implement behaviours in other languages? 380 | 381 | Lets start with `gen_server`. I like to think its interface signature as being: 382 | 383 | ```haskell 384 | Input -> State -> (State, Output) 385 | ``` 386 | 387 | That's it takes some input, its current state and produces a pair of the new 388 | updated state and an output. 389 | 390 | How do we turn this sequential signature into something that can handle 391 | concurrent requests? One way would be to fire up a HTTP server which transforms 392 | requests into `Input`s and puts them on a queue, have an event loop which pops 393 | inputs from the queue and feeds it to the sequential implementation, then 394 | writing the output back to the client response. It wouldn't be difficult to 395 | generalise this to be able to handle multiple `gen_server`s at the same time, by 396 | giving each a name and let the request include the name in addition to the 397 | input. 398 | 399 | `gen_event` could be implemented by allowing registration of callbacks to 400 | certain types of event on the queue. 401 | 402 | `supervisor`s is more interesting, one simple way to think of it is: when we 403 | feed the `gen_server` function the next input from the queue, we wrap that call 404 | in an exception handler, and should it throw we notify its supervisor. It gets a 405 | bit more complicated if the supervisor is not running on the same computer as 406 | the `gen_server`. 407 | 408 | I haven't thought about `application` and `release`s much yet, but given that 409 | configuration, deployment and upgrades are difficult problems they seem 410 | important. 411 | 412 | ## Correctness of behaviours 413 | 414 | Writing a post solely about stealing from Erlang doesn't seem fair, even though 415 | it's the right thing to do, so I'd like to finish off with how we can build upon 416 | the insights of Joe and the Erlang community. 417 | 418 | I've been interesting in testing for a while now. Most recently I've been 419 | looking into [simulation 420 | testing](https://github.com/stevana/property-based-testing-stateful-systems-tutorial) 421 | distributed systems à la 422 | [FoundationDB](https://www.youtube.com/watch?v=4fFDFbi3toc). 423 | 424 | Simulation testing in a nutshell is running your system in a simulated world, 425 | where the simulation has full control over which messages get sent when over the 426 | network. 427 | 428 | FoundationDB built their own programming language, or dialect of C++ with 429 | actors, in order do the simulation testing. Our team seemed to be able to get 430 | quite far with merely using state machines of type: 431 | 432 | ```haskell 433 | Input -> State -> (State, [Output]) 434 | ``` 435 | 436 | where `[Output]` is a sequence of outputs. 437 | 438 | The idea being that the simulator keeps track of a priority queue of messages 439 | sorted by their arrival time, it pops a message, advances the clock to the 440 | arrival time of that message, feeds the message to the receiving state machine, 441 | generates new arrival times for all output messages and puts them back into the 442 | priority queue, rinse and repeat. As long as everything is deterministic and the 443 | arrival times are generated using a seed we can explore many different 444 | interleavings and get reproducible failures. It's also much faster than Jepsen, 445 | because messaging is done in-memory and we advance the clock to the arrival 446 | time, thereby triggering any timeouts without having to wait for them. 447 | 448 | We used to say that programs of this state machine type where written in 449 | "network normal form", and conjectured that every program which can receive and 450 | send stuff over the network can be refactored into this shape[^3]. Even if we 451 | had a proof, "network normal form" always felt a bit arbitrary. But then I read 452 | Joe's thesis and realised that `gen_server` and `gen_statem` basically have the 453 | same type, so I stopped being concerned about it. As I think that if a structure 454 | is found to be useful by different people, then it's usually a sign that it 455 | isn't arbitrary. 456 | 457 | Anyway, in, at least, one of Joe's [talks](https://youtu.be/cNICGEwmXLU?t=1439) 458 | he mentions how difficult it's to correctly implement distributed leader 459 | election. 460 | 461 | I believe this is a problem that would be greatly simplified by having access to 462 | a simulator. A bit like I'd imagine having access to a wind tunnel would make 463 | building an airplane easier. Both lets you test your system under extreme 464 | conditions, such as unreliable networking or power loss, before they happen in 465 | "production". Furthermore, this simulator can be generic in, or parametrised by, 466 | behaviours. Which means that the developer gets it for free while the complexity 467 | of the simulator is hidden away, just like the concurrent code of `gen_server`! 468 | 469 | FoundationDB is a good example of simulation testing working, as witnessed by 470 | this [tweet](https://twitter.com/aphyr/status/405017101804396546) where somebody 471 | asked Kyle "aphyr" Kingsbury to Jepsen test FoundationDB: 472 | 473 | > “haven’t tested foundation[db] in part because their testing appears to be 474 | > waaaay more rigorous than mine.” 475 | 476 | Formal verification is also made easier if the program is written a state 477 | machine. Basically all of Lamport's model checking 478 | [work](https://www.microsoft.com/en-us/research/publication/computation-state-machines/) 479 | with TLA+ assumes that the specification is a state machine. Also more recently 480 | Kleppmann has 481 | [shown](https://lawrencecpaulson.github.io/2022/10/12/verifying-distributed-systems-isabelle.html) 482 | how to exploit the state machine structure to do proof by (structural) induction 483 | to solve the state explosion problem. 484 | 485 | So there you have it, we've gone full circle. We started by taking inspiration 486 | from Joe and Erlang's behaviours, and ended up using the structure of the 487 | `gen_server` behaviour to make it easier to solve a problem that Joe used to 488 | have. 489 | 490 | ## Contributing 491 | 492 | There are a bunch of related ideas that I have started working on: 493 | 494 | * Stealing ideas from Martin Thompson's work on the LMAX Disruptor and 495 | [aeron](https://github.com/real-logic/aeron) to 496 | [make](https://github.com/stevana/pipelined-state-machines) a fast event 497 | loop, on top of which the behaviours run; 498 | * Enriching the state machine type with [async 499 | I/O](https://github.com/stevana/coroutine-state-machines); 500 | * How to implement 501 | [supervisors](https://github.com/stevana/supervised-state-machines) in more 502 | detail; 503 | * Hot code [swapping](https://github.com/stevana/hot-swapping-state-machines) 504 | of state machines. 505 | 506 | Feel free to get in touch, if you find any of this interesting and would like to 507 | get involved, or if you have have comments, suggestions or questions. 508 | 509 | ## See also 510 | 511 | * Chapter 6.1 on behaviours in Joe Armstrong's 512 | [thesis](http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A9492&dswid=-1166), 513 | p. 129; 514 | * [OTP design principles](https://www.erlang.org/doc/design_principles/des_princ.html); 515 | * The documentation for behaviours: 516 | - [`gen_server`](https://www.erlang.org/doc/man/gen_server.html); 517 | - [`gen_event`](https://www.erlang.org/doc/man/gen_event.html); 518 | - [`gen_statem`](https://www.erlang.org/doc/man/gen_statem.html); 519 | - [`supervisor`](https://www.erlang.org/doc/man/supervisor.html); 520 | - [`application`](https://www.erlang.org/doc/man/application.html); 521 | - [release](https://www.erlang.org/doc/design_principles/release_structure.html). 522 | * [Hewitt, Meijer and Szyperski: The Actor Model (everything you wanted to know, 523 | but were afraid to ask)](https://youtube.com/watch?v=7erJ1DV_Tlo) (2012); 524 | * Erlang the [movie](https://www.youtube.com/watch?v=xrIjfIjssLE) (1990); 525 | * [Systems that run forever self-heal and 526 | scale](https://www.youtube.com/watch?v=cNICGEwmXLU) by Joe Armstrong (Strange 527 | Loop, 2013); 528 | * [The Do's and Don'ts of Error 529 | Handling](https://www.youtube.com/watch?v=TTM_b7EJg5E) by Joe Armstrong (GOTO, 530 | 2018); 531 | * [The Zen of Erlang](https://ferd.ca/the-zen-of-erlang.html) by Fred Hebert 532 | (2016); 533 | * [The Hitchhiker's Guide to the 534 | Unexpected](https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html) by 535 | Fred Hebert (2018); 536 | * [Why Do Computers Stop and What Can Be Done About 537 | It?](https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf) by Jim Gray 538 | (1985); 539 | * The supervision trees chapter of [*Adopting 540 | Erlang*](https://adoptingerlang.org/docs/development/supervision_trees/) 541 | (2019); 542 | * "If there's one thing I'd say to the Erlang folks, it's you got the stuff 543 | right from a high-level, but you need to invest in your messaging 544 | infrastructure so it's super fast, super efficient and obeys all the right 545 | properties to let this stuff work really well." 546 | [quote](https://youtu.be/OqsAGFExFgQ?t=2532) by Martin Thompson (Functional 547 | Conf, 2017). 548 | 549 | ## Discussion 550 | 551 | * [Hacker News](https://news.ycombinator.com/item?id=34545061) 552 | * [lobste.rs](https://lobste.rs/s/7dguth/erlang_s_not_about_lightweight_processes) 553 | * [r/programming](https://old.reddit.com/r/programming/comments/10mt6hz/erlangs_not_about_lightweight_processes_and/) 554 | * [r/haskell](https://old.reddit.com/r/haskell/comments/10mgd0a/erlangs_not_about_lightweight_processes_and/) 555 | * [r/erlang](https://old.reddit.com/r/erlang/comments/10g0zbg/erlangs_not_about_lightweight_processes_and/) 556 | * [Elixir Forum](https://elixirforum.com/t/erlangs-not-about-lightweight-processes-and-message-passing/53484/7) 557 | 558 | 559 | [^0]: From Joe Armstrong's thesis (p. 6): 560 | 561 | > In February 1998 Erlang was banned for new product development within 562 | > Ericsson—the main reason for the ban was that Ericsson wanted to be a consumer 563 | > of sodware technologies rather than a producer. 564 | 565 | From Bjarne Däcker's thesis (2000, p. 37): 566 | 567 | > In February 1998, Erlang was banned within Ericsson Radio AB (ERA) for new 568 | > product projects aimed for external customers because: 569 | > 570 | > “The selection of an implementation language implies a more long-term 571 | > commitment than selection of processors and OS, due to the longer life cycle 572 | > of implemented products. Use of a proprietary language, implies a continued 573 | > effort to maintain and further develop the support and the development 574 | > environment. It further implies that we cannot easily benefit from, and find 575 | > synergy with, the evolution following the large scale deployment of globally 576 | > used languages.” 577 | 578 | Joe also says, in this [talk](https://vimeo.com/97329186) (34:30), that there 579 | were two reasons for Erlang getting banned: 1) that it wasn't Java, and 2) that 580 | it wasn't C++. 581 | 582 | [^1]: It's a common misconception is that Erlang is about actors. 583 | 584 | The actor model first presented in [*A Universal Modular Actor Formalism for 585 | Artificial 586 | Intelligence*](https://www.ijcai.org/Proceedings/73/Papers/027B.pdf) by Carl 587 | Hewitt, Peter Bishop, Richard Steiger (1973) and refined by others over time, 588 | e.g. see Irene Greif's [thesis](https://dspace.mit.edu/handle/1721.1/57710) 589 | (1975) or Gul Agha's [thesis](https://dspace.mit.edu/handle/1721.1/6952) 590 | (1985). 591 | 592 | Erlang first appeard later in 1986, but the Erlang developers were [not 593 | aware](https://erlang.org/pipermail/erlang-questions/2014-June/079794.html) of 594 | the actor model. In fact Robert Virding, one of the original Erlang designers, 595 | [claims](https://erlang.org/pipermail/erlang-questions/2014-June/079865.html) 596 | that knowing about the actor model might even have slowed them down. 597 | 598 | Carl Hewitt has written a paper called [*Actor Model of Computation: Scalable 599 | Robust Information Systems*](https://arxiv.org/abs/1008.1459) (2015) which 600 | documents the differences between Erlang's processes and the actor model. 601 | 602 | [^2]: Scala's Akka seems to be of this opinion. They got something they call 603 | "actors", not to be confused with the actor model as per footnote 1, and 604 | obligatory supervisors trees. They don't appear to have any analogues of the 605 | other Erlang behaviours though. 606 | 607 | Confusingly Akka has a concept called 608 | ["behavior"](https://doc.akka.io/docs/akka/current/general/actors.html#behavior), 609 | but it has nothing to do with Erlang behaviours. 610 | 611 | [^3]: The intuition being that since every program using the state monad can be 612 | rewritten to a normal form where a single `read`/`get` followed by a single 613 | `write`/`put`, it seems reasonable to assume that something similar would 614 | work for `recv` and `send` over the network. I forget the reference for the 615 | state monad normal form, either Plotkin and Power or Uustalu? 616 | -------------------------------------------------------------------------------- /docs/modular-state-machines.md: -------------------------------------------------------------------------------- 1 | --- 2 | status: Still in research (please don't share) 3 | --- 4 | 5 | # Modular state machines 6 | 7 | ## Motivation 8 | 9 | * One big state machine can become difficult to understand 10 | 11 | * How can we break a state machine into parts such that when combined they form 12 | the whole? 13 | 14 | ## Products of states 15 | 16 | * The Cartesian product of states can be used when we need to keep track of two 17 | states running in parallel. 18 | 19 | * There are different two different ways we can advance two state machines that 20 | run in parallel, either by stepping one of them and leaving the other alone or 21 | stepping both of them in lockstep. 22 | 23 | * Angelic product, allows of stepping one of the state machines: 24 | 25 | ``` 26 | type SM s i o = i -> s -> (s, o) 27 | 28 | angelic : SM s i o -> SM t j p -> SM (s, t) (i + j) (o + p) 29 | angelic f g ij (s, t) case ij of 30 | Left i -> let (s', o) = f i s in ((s', t), Left o) 31 | Right j -> let (t', p) = g j t in ((s, t'), Right p) 32 | 33 | video : SM {playing, stopped} {stop, play} () 34 | audio : SM {mutued, unmuted} {mute, unmute} () 35 | 36 | player = angelic video audio 37 | ``` 38 | 39 | * Tensor product, allows us to step both state machines in lockstep: 40 | 41 | ``` 42 | tensor : SM s i o -> SM t j p -> SM (s, t) (i, j) (o, p) 43 | tensor f g (i, j) (s, t) = 44 | let 45 | (o, s') = f i s 46 | (p, t') = g j t 47 | in 48 | ((s', t'), (o, p)) 49 | ``` 50 | 51 | See also: 52 | 53 | * ["Concurrent state 54 | machines"](http://gameprogrammingpatterns.com/state.html#concurrent-state-machines) 55 | in Nystrom; 56 | * [Orthogonal 57 | regions](https://en.wikipedia.org/wiki/UML_state_machine#Orthogonal_regions) 58 | in UML state machines; 59 | * [Programming interfaces and basic topology](https://arxiv.org/abs/0905.4063) 60 | by Peter Hancock and Pierre Hyvernat (2009). 61 | 62 | ## The state pattern 63 | 64 | * http://gameprogrammingpatterns.com/state.html#the-state-pattern 65 | 66 | ## Hierarchical states 67 | 68 | * http://gameprogrammingpatterns.com/state.html#hierarchical-state-machines 69 | 70 | * [Hierarchically nested 71 | states](https://en.wikipedia.org/wiki/UML_state_machine#Hierarchically_nested_states) 72 | in UML state machines 73 | 74 | ## Stack of states / pushdown automaton 75 | 76 | * http://gameprogrammingpatterns.com/state.html#pushdown-automata 77 | 78 | ## See also 79 | 80 | * [Game Programming Patterns](http://gameprogrammingpatterns.com/state.html) by 81 | Robert Nystrom (2014, chapter 7); 82 | * Development and Deployment of Multiplayer Online Games, Vol. II by Sergey 83 | Ignatchenko (2020, chapter 5); 84 | * [*Statecharts: A visual formalism for complex 85 | systems*](http://www.wisdom.weizmann.ac.il/~dharel/SCANNED.PAPERS/Statecharts.pdf) 86 | by David Harel (1987); 87 | * [UML state machines](https://en.wikipedia.org/wiki/UML_state_machine). 88 | -------------------------------------------------------------------------------- /docs/practice-in-programming.md: -------------------------------------------------------------------------------- 1 | --- 2 | status: Still in research (please don't share) 3 | --- 4 | 5 | # On the role of practice in programming 6 | 7 | ## Motivation 8 | 9 | Practice makes perfect, the saying goes. Yet, very dedicated, people appear to 10 | spend a lifetime programming without producing a masterpiece. Why is that? The 11 | perhaps most common explaination is that the field is still young and we haven't 12 | figured out how to engineer things with the same degree of accurency and 13 | predicability as other more established fields. A less common explaination, 14 | which I'd like to explore in this post, is that just because we go to work and 15 | program the whole day long it doesn't mean that we are in fact practicing. 16 | 17 | ## Defining practice 18 | 19 | Mike Acton gave an [interview](https://youtu.be/qWJpI2adCcs?t=3506) where he 20 | said that practice starts *from scratch every time*, unlike a project at work or 21 | as a hobby which builds upon previous work. 22 | 23 | I suppose the key thing is that you redo some *specific* thing many times until 24 | you become really good at it. Most projects involve many parts, so it's unlikely 25 | that you are repeatedly doing some specific thing over and over again. This 26 | could explain why working on projects isn't practice. 27 | 28 | One exception could be if your project is relatively small and you are doing it 29 | over and over again, then that might qualify as practice. 30 | 31 | Mike gives the example of setting aside half an hour per day to practice to try 32 | to implement an 33 | [Asteroids](https://en.wikipedia.org/wiki/Asteroids_(video_game)) clone. In the 34 | beginning you'll probably not get very far, throw it away, start from scratch, 35 | by day 300 you might be able to finish implementing the whole game in the 36 | allocated time. Perhaps Asteroids isn't the best thing to practice, but you get 37 | the idea. 38 | 39 | While on the topic of games and getting good at programming, it's interesting to 40 | note that John Carmack and the rest of id Software [developing 13 games in a 41 | year](https://youtu.be/IzqdZAYcwfY?t=540). Early id Software is an extreme 42 | example, most of us probably need to practice on something much smaller. 43 | 44 | ## Putting practice into a practice 45 | 46 | Having established what we mean by practice, let's turn our attention to how to 47 | put practice into a practice. 48 | 49 | Joe Armstrong is a good example. He 50 | [explains](https://vimeo.com/1344065#t=8m30s) that he often wrote a piece of 51 | code and the next day he threw it away and rewrote it from scratch. In the early 52 | days of Erlang it was possible to do a total rewrite of the whole language in 53 | less than a week. New language features were added in one work session, if you 54 | couldn't get the idea out of your brain and code it up in that time then you 55 | didn't do it, Joe 56 | [explained](https://dl.acm.org/action/downloadSupplement?doi=10.1145%2F1238844.1238850&file=m6-armstrong-h.mov) 57 | (17:10). In a later talk he elaborated 58 | [saying](https://youtu.be/rQIE22e0cW8?t=3492): 59 | 60 | > "We need to break systems down into small understandable components with 61 | > message passing between them and with contracts describing whats going on 62 | > between them so we can understand them, otherwise we just won't be able to 63 | > make software that works. I think the limit of human understandability is 64 | > something like 128KB of code in any language. So we really need to box things 65 | > down into small units of computation and formally verify them and the 66 | > protocols in particular." 67 | 68 | Chuck Moore 69 | [said](https://www.red-gate.com/simple-talk/opinion/geek-of-the-week/chuck-moore-geek-of-the-week/) 70 | something in similar: 71 | 72 | > "Instead of being rewritten, software has features added. And becomes more 73 | > complex. So complex that no one dares change it, or improve it, for fear of 74 | > unintended consequences. But adding to it seems relatively safe. We need 75 | > dedicated programmers who commit their careers to single applications. 76 | > Rewriting them over and over until they're perfect." (2009) 77 | 78 | > "... Such people will never exist. The world is too full of more interesting 79 | > things to do. The only hope is to abandon complex software. Embrace simple. 80 | > Forget backward compatibility." 81 | 82 | * Both Joe and Chuck ask for simple systems, so that they can easily be 83 | rewritten (i.e. practiced on) 84 | 85 | - Chuck Moore reimplemented the same Forth many times, in fact Forth was 86 | designed to be easily reimplementable on new hardware (this was back when new 87 | CPUs had new instruction sets), he also iterated on the Forth itself (..., 88 | colorForth, what where the earlier iterations?) 89 | 90 | * OKAD, [VLSI](https://en.wikipedia.org/wiki/Very_Large_Scale_Integration) 91 | design tools, "I’ve spent more time with it that any other; have re-written it 92 | multiple times; and carried it to a satisfying level of maturity." 93 | 94 | * John McCarthy's Lisp with its meta-circular implementation? 95 | 96 | ## Practice and software engineering 97 | 98 | * We've defined pratice, we've seen examples of people who appear to use it on 99 | personal level, what about scaling it up to teams? 100 | 101 | - Can whole projects be simple? Here's two Turing award winners who think so: 102 | 103 | - "At last, there breezed into my office the most senior manager of all, a 104 | general manager of our parent company, Andrew St. Johnston. I was surprised 105 | that he had even heard of me. "You know what went wrong?" he shouted -- he 106 | always shouted -- "You let your programmers do things which you yourself do 107 | not understand." I stared in astonishment. He was obviously out of touch with 108 | present day realities. How could one person ever understand the whole of a 109 | modem software product like the Elliott 503 Mark II software system? I 110 | realized later that he was absolutely right; he had diagnosed the true cause 111 | of the problem and he had planted the seed of its later solution." -- 112 | 113 | The emperor's old clothes, Tony Hoare (1980) 114 | https://dl.acm.org/doi/10.1145/1283920.1283936 115 | 116 | - "The belief that complex systems require armies of designers and programmers 117 | is wrong. A system that is not understood in its entirety, or at least to a 118 | significant degree of detail by a single individual, should probably not be 119 | built." -- [A Plea for Lean 120 | Software](https://people.inf.ethz.ch/wirth/Articles/LeanSoftware.pdf) by 121 | Niklaus Wirth (1995) 122 | 123 | - Joe story about new manager asking for somebody who can explain the whole 124 | system to him. "Does anybody understand the entire system? If so, please come 125 | and talk to me. Nobody put their hand up. In the Erlang group, if somebody 126 | would have asked that question, several hands would have gone up." 127 | https://youtu.be/-I_jE0l7sYQ?t=1389 128 | 129 | ## Processes and tools that encourage practice? 130 | 131 | - Let's assume that Mike, Joe, Chuck, Tony and Niklaus are on to something and 132 | that it's in fact possible to *designing for practice* even in team-sized 133 | projects (aka software development as opposed to programming) 134 | 135 | - What would processes and tools that help encourge such development look like? 136 | 137 | ### The status quo 138 | 139 | - TDD: write test that fails, simplest possilbe implementation that makes test 140 | pass, then refactor *incrementally* until satisfactory while keeping tests 141 | green. 142 | + start from scratch rather than refactor? 143 | + time limit (e.g. Joe's one working session) 144 | + size limit (e.g. Joe's 128KB?) 145 | + test-driven design: easily testable system is well-designed system 146 | + practice-driven design: easily rewritable system is a well-designed system? 147 | * the fact that a system is easily and fully testable surely helps when 148 | rewriting from scratch, but it feels like there's more to it? 149 | * good spec / documentation / literate programming? Joe's "the research" 150 | * ability to "zoom" in and out on spec / docs? Refinement. 151 | 152 | - Refactor vs rewrite from scratch debate? 153 | 154 | - diy vs 3rd party: http://ithare.com/overused-code-reuse/ 155 | + https://lobste.rs/s/yubtob/build_vs_buy 156 | + https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-invented-here-syndrome/ 157 | + https://eli.thegreenplace.net/2017/benefits-of-dependencies-in-software-projects-as-a-function-of-effort/ 158 | + See chapter 4 of *Development and Deployment of Multiplayer Online Games, 159 | Vol. II* by Sergey Ignatchenko (2020). 160 | 161 | ### Possible tricks to steal 162 | 163 | - Parallel and independent development, c.f. Dave Snowden and [wisdom of the 164 | crowd](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) 165 | 166 | - Encourage new team members to rewrite? 167 | 168 | - What would programming languages look like if we applied these principles? 169 | Forth is a good example, are there others? 170 | 171 | - Forth, bounded by blocks: "Disk memory is divided into units called “blocks.” 172 | Each block holds 1,024 characters of source text or binary data, traditionally 173 | organized as 16 lines of 64 characters." 174 | https://www.forth.com/starting-forth/3-forth-editor-blocks-buffer/ 175 | 176 | > There is a great similarity between colorForth and classic Forth: 1024-byte 177 | > blocks. Factoring source code into blocks is equivalent to creating paragraphs 178 | > in English. It breaks a wall of text into pieces that highlight related ideas. 179 | > Many Forth implementations abandoned this advantage and organized source into 180 | > files of arbitrary length. Bad idea. Hugely debated. In addition, the text in 181 | > a block can be displayed on a monitor without scrolling. A quantized unit of 182 | > source code all visible at once. 183 | 184 | - What about libraries? Can we design building blocks that allow us to build 185 | reliable, scalable and maintainable systems, in a way such that the building 186 | blocks can be understood and implemented by a single programmer in a day? 187 | Similar to how features were added to early Erlang, as mentioned above? 188 | 189 | - One of the goals of this repository is to try to identify those building 190 | blocks and try to understand them well enough, document what Joe 191 | [calls](https://youtu.be/h8nmzPh5Npg?t=1302) "the research", perhaps through 192 | several reimplementations from scratch, so that they can be implemented by 193 | others in a day. 194 | 195 | - More important to provide what Joe called "the research" than to provide a 196 | library? 197 | 198 | - "To gain experience, there's no sustitute for one's own programming effort. 199 | Organizing a team into managers, designers, programers, analysts and users is 200 | detrimental. All should participate (with differing degrees of emphasis) in 201 | all aspects of development. In particular, everybody -- including managers -- 202 | should also be product users for a time. This last measure is the best 203 | guarantee to correct mistakes and perhaps also eliminate redundancies." -- 204 | Niklaus Wirth 205 | 206 | * "What I cannot create, I do not understand" -- Richard Feynman 207 | 208 | ## Contributing 209 | 210 | * Other examples of processes or tools that encourage practice? 211 | 212 | * Any references in the same general (or completely opposite) direction would be 213 | appreciated! 214 | 215 | - About Rich Hickey (creator of Clojure), some interview probably with somebody 216 | else from Cognitect (half?) jokingly: "Rick doesn't write programs longer than 217 | 1000 lines" (I cannot find the reference, I think it was a clojure meetup in 218 | london with a discussion after the talk? perhaps on simulant or repl driven 219 | development?) 220 | 221 | ## See also 222 | 223 | * [Coding dojos](https://codingdojo.org/practices/WhatIsCodingDojo/) are spaces 224 | specifically designed for *practicing*; 225 | * [Are We Really 226 | Engineers?](https://www.hillelwayne.com/post/are-we-really-engineers/); 227 | * Mike Acton's talk on [Data-Oriented Design](https://youtube.com/watch?v=rX0ItVEVjHc). 228 | -------------------------------------------------------------------------------- /docs/specification-language.md: -------------------------------------------------------------------------------- 1 | --- 2 | status: research (please don't share) 3 | --- 4 | 5 | # Specification language 6 | 7 | ## Motivation 8 | 9 | * Joe on languages vs specification languages 10 | - https://youtu.be/ed7A7r6DBsM?t=632 11 | 12 | * API vs protocol 13 | - Martin Thompson: https://youtu.be/bzDAYlpSbrM?t=837 14 | 15 | * Scaffolding 16 | - testing 17 | - docs 18 | + glossary: https://youtu.be/fTtnx1AAJ-c?t=518 19 | - code generation 20 | - program inference? 21 | 22 | ## Joe's Universal Binary Format (UBF) 23 | 24 | * https://erlang.org/workshop/2002/Armstrong.pdf 25 | 26 | ## Cleanroom software engineering 27 | 28 | * blackbox vs statebox vs clearbox specification 29 | - https://trace.tennessee.edu/utk_harlan/16/ 30 | - https://trace.tennessee.edu/utk_harlan/12/ 31 | * trace specifications 32 | * usage model/ operational profile 33 | - users/actors are part of the specification 34 | 35 | ## Dave Snowden 36 | 37 | * connections are more important than components in complex systems 38 | * "people are objects too" 39 | * break down a concept until there's no disagreement about its meaning 40 | 41 | ## Some of my own ideas 42 | 43 | * Mostly stealing from all above and gluing things together 44 | * Informal escape hatches, but still structured and checked 45 | - useful for documentation and program inference? 46 | 47 | ## Contributing 48 | 49 | * Thoughts? 50 | 51 | ## See also 52 | 53 | * https://github.com/stevana/bits-and-bobs 54 | * https://github.com/stevana/spec-lang 55 | -------------------------------------------------------------------------------- /talks/zurihac-2023/Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: all 2 | 3 | all: 4 | pandoc --from markdown --to beamer -o slides_stevan.pdf slides_stevan.md 5 | -------------------------------------------------------------------------------- /talks/zurihac-2023/abstract.txt: -------------------------------------------------------------------------------- 1 | How can we better build reliable, scalable and maintainable computer systems? 2 | 3 | In this talk I'd like to argue that one of the main building blocks ought to be 4 | the humble state machine of type: 5 | 6 | SM state input output = input -> state -> (state, output) 7 | 8 | I'll back up my argument with examples of state-of-the-art techniques from 9 | industry that involve verification, availability and fault-tolerance using state 10 | machines. 11 | 12 | I'll then sketch the other building blocks that I think we need in order to 13 | "glue" our state machines together as well as means to inspect, debug and 14 | upgrade and scale running systems. 15 | 16 | My hope that by the end of the talk I'll have managed to make you think 17 | differently about some of the many aspects involved in developing, deploying and 18 | maintaining distributed systems. 19 | -------------------------------------------------------------------------------- /talks/zurihac-2023/slides_stevan.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Distributed systems and state machines 3 | date: 11th of June, 2023 4 | author: Stevan Andjelkovic 5 | institute: ZuriHac lightning talk 6 | header-includes: 7 | - \definecolor{links}{HTML}{2A1B81} 8 | - \hypersetup{colorlinks,linkcolor=,urlcolor=links} 9 | --- 10 | 11 | # Motivation 12 | 13 | 1. There are many implicit state machines in distributed systems; 14 | 2. By making them explicit a bunch of things become easier! 15 | 16 | # Implicit state machines 17 | 18 | \note{ 19 | * Distributed systems are hard! 20 | * One of the reasons: many failure cases of networking 21 | * Good test-suite helps a lot 22 | * All successful means of verification use state machines! 23 | } 24 | 25 | ## Verification 26 | 27 | * Property-based testing using state machine 28 | * Kyle "aphyr" Kingsbury's Jepsen testing tool 29 | * FoundationDB's simulation testing 30 | * Leslie Lamport's model checking of TLA+ 31 | * Martin Kleppmann's formal verification work in Isabelle 32 | 33 | ## Development 34 | 35 | * Joe Armstrong on Erlang's 36 | [behaviours](https://github.com/stevana/armstrong-distributed-systems/blob/main/docs/erlang-is-not-about.md) 37 | (`gen_server` and `gen_statem`) 38 | * Fault-tolerance via replicated state machines 39 | * Martin Thompson's work on LMAX disruptor and Aeron 40 | 41 | ## Theory 42 | 43 | * Yuri Gurevich's generalisation of the Church-Turing thesis 44 | 45 | # Explicit state machines 46 | 47 | * Some ideas of what we can do if our state machines were explicit 48 | * Rapid overview, slides and links to longer explainations are available 49 | 50 | \note{ 51 | * Simulation testing 52 | * Time travelling debugger 53 | * Arrow syntax for state machines and hot-swappable code 54 | * Pipelines of state machines 55 | * Modular state machines 56 | * Protocols between communicating state machines 57 | * Supervisor trees 58 | } 59 | 60 | # Simulation testing 61 | 62 | \note{ 63 | * System consisting of several state machines communicating with each other over the network 64 | * Note that networking is abstracted out from the state machine definition 65 | * Implement networking twice: once using real networking primitives and once using a fake in-memory network 66 | } 67 | 68 | * `SM state input output = input -> state -> (state, output)` 69 | * Networking interface 70 | * Property-based testing, fault injection, discrete-event simulation 71 | 72 | # Time-travelling debugger 73 | 74 | * Record all inputs 75 | * Assuming state machine is deterministic we can recompute the state (and logs!) from the inputs 76 | * This gives us a way of replaying the execution of a state machine and see how 77 | its state changes over time 78 | 79 | 80 | # Arrow syntax and hot-swappable code 81 | 82 | * `instance Arrow (SM state)` 83 | * Conal Elliott's compiling to categories (`Arrow` modulo `arr`) 84 | * CCCs are first-order, i.e. easily seralisable 85 | * So we can send them over the network and upgrade running state machines without downtime! 86 | 87 | # Pipelining of state machines 88 | 89 | * Serving a request typically involves several stages, e.g.: 90 | 1. Read bytes from socket 91 | 2. Parse bytes into structured data 92 | 3. Validate data 93 | 4. Process the data using our business logic and produce some response 94 | 5. Serialise response into bytes 95 | 6. Write response bytes back to socket 96 | * What if each such stage was run on a separate CPU/core? Pipelining! 97 | * Monitor queue lengths of each stage and shard if stages are slow 98 | * What's the relation to dataflow and FRP? 99 | 100 | # Modular state machines 101 | 102 | * Pipelining is *horizontal* composition, the outputs of one is fed into the another 103 | * What about *vertical* composition? 104 | * State machines inside state machines? 105 | * Hierarchical states? 106 | * Stack of states / pushdown automaton? 107 | 108 | # Protocols between communicating state machines 109 | 110 | * If we think of state machines as black boxes which provide some API via their inputs 111 | * Then protocols between two black boxes are sequences of input-output pairs 112 | * These too can be described using state machines! 113 | 114 | # Supervisor trees and deployment 115 | 116 | * Tree of supervisors in the nodes and state machines in leaves 117 | * Supervisors' job is to do error handling, if one of their children fail 118 | * Jim Grey's idea of failing fast 119 | * Supervisors contain restart strategies, i.e. in which order to restart the 120 | children if one fails 121 | * Use restart strategy as a means of deployment (start up order) 122 | 123 | # Contributing 124 | 125 | * For more, also see: 126 | + \url{https://github.com/stevana/armstrong-distributed-systems} 127 | + \url{https://github.com/stevana/property-based-testing-stateful-systems-tutorial} 128 | + \url{https://github.com/stevana/hot-swapping-state-machines} 129 | + \url{https://github.com/stevana/smarrow-lang} 130 | + \url{https://github.com/stevana/pipelined-state-machines} 131 | + \url{https://github.com/stevana/elastically-scalable-thread-pools} 132 | + \url{https://github.com/stevana/supervised-state-machines} 133 | * If you have any questions or comments, feel free to get in touch! 134 | * Thanks for listening! 135 | -------------------------------------------------------------------------------- /talks/zurihac-2023/slides_stevan.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stevana/armstrong-distributed-systems/434a2999edd36ba1cedafb381aa4dec246a757b1/talks/zurihac-2023/slides_stevan.pdf --------------------------------------------------------------------------------