├── .gitignore
├── README.md
├── docs
    ├── domain-specific-debuggers.md
    ├── erlang-is-not-about.md
    ├── modular-state-machines.md
    ├── practice-in-programming.md
    └── specification-language.md
└── talks
    └── zurihac-2023
        ├── Makefile
        ├── abstract.txt
        ├── slides_stevan.md
        └── slides_stevan.pdf


/.gitignore:
--------------------------------------------------------------------------------
1 | # haskell
2 | dist-newstyle/
3 | cabal.project.local*
4 | 
5 | # emacs
6 | .\#*
7 | *~
8 | \#*\#
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Armstrong distributed systems
 2 | 
 3 | How do we build reliable, scalable and maintainable computer systems?
 4 | 
 5 | This repository contains notes on how, I think, we can improve on the state of
 6 | development, documentation, testing, deployment, observability, debugging, and
 7 | upgrading of distributed systems. Most of the ideas are stolen from others, many
 8 | from Erlang and Joe Armstrong. Over time I hope to turn this into a more
 9 | coherent text, for now think of it a crude blog or some basic scaffolding for me
10 | to hang my thoughts on.
11 | 
12 | If any of this interests you as well, please do get in touch -- one of the
13 | reasons I'm writing this down is to find people to collaborate with.
14 | 
15 | ## Development
16 | 
17 | * [Erlang's not about lightweight processes and message
18 |   passing...](docs/erlang-is-not-about.md)
19 | 
20 | * [Implementing
21 |   behaviours](https://github.com/stevana/armstrong-distributed-systems/blob/implementing-behaviours/docs/implementing-behaviours.md)
22 | 
23 | * [Modular state machines](docs/modular-state-machines.md)
24 | 
25 | * Implementing `gen_event` using the LMAX disruptor
26 |   - https://github.com/stevana/pipelined-state-machines
27 |   - Shard on: "people, stuff or deals",
28 |     [says](https://youtu.be/1KRYH75wgy4?t=2781) Martin Thompson.
29 | 
30 | * [State machines with of async I/O](https://github.com/stevana/coroutine-state-machines)
31 | 
32 | * Persisted/`mmap`ped lock-free concurrent data structures
33 |   - [Working with binary data](https://github.com/stevana/bits-and-bobs)
34 |   - bytebuffer
35 |   - journal
36 |   - hashmap
37 |   - arena allocator
38 |   - buffered actions
39 |   - [Efficient Tree-Traversals: Reconciling Parallelism and
40 |     Dense](https://arxiv.org/pdf/2107.00522.pdf)
41 | 
42 | * [On the role of practice in programming](docs/practice-in-programming.md)
43 | 
44 | ## Documentation
45 | 
46 | * [Specification language](docs/specification-language.md)
47 |   - Interfaces
48 |   - Messages
49 |   - Compression
50 |   - Protocols
51 |   - Usage model or operational profile
52 | 
53 | * Joe's idea of ["the bigger picture"](https://youtu.be/h8nmzPh5Npg?t=1220) in
54 |   particular the "research" part, not just the end result (code) but how you got
55 |   to the result.
56 | 
57 | ## Testing
58 | 
59 | * [Simulation testing using state
60 |   machines](https://github.com/stevana/property-based-testing-stateful-systems-tutorial)
61 | 
62 | * Usage model or operational profile
63 | 
64 | * Load/soak testing
65 | 
66 | ## Deployment
67 | 
68 | * [Deploying and restarting state machines using supervisor
69 |   trees](https://github.com/stevana/supervised-state-machines)
70 | 
71 | * [Elastically scalable thread
72 |   pools](https://github.com/stevana/elastically-scalable-thread-pools)
73 | 
74 | ## Observability
75 | 
76 | * [Visualising datastructures over time using
77 |   SVG](https://github.com/stevana/svg-viewer-in-svg)
78 | 
79 | ## Debugging
80 | 
81 | * [Domains-specific debuggers](docs/domain-specific-debuggers.md)
82 | 
83 | ## Upgrading
84 | 
85 | * [Hot-code swapping à la Erlang with `Arrow`-based state
86 |   machines](https://github.com/stevana/hot-swapping-state-machines)
87 | 
88 | * [`smarrow-lang`](https://github.com/stevana/smarrow-lang) an experimental
89 |   programming language where programs are state machines expressed in arrow
90 |   notation to allow easy hot-code swapping
91 | 
92 | * Version everything
93 | 


--------------------------------------------------------------------------------
/docs/domain-specific-debuggers.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | status: research (please don't share)
 3 | ---
 4 | 
 5 | # Domain-specific debuggers
 6 | 
 7 | ## Motivation
 8 | 
 9 | * General purpose debuggers such as [`gdb`](https://www.sourceware.org/gdb/) are underused
10 | 
11 | * Theory: people who use general purpose debuggers also care about memory layout, e.g.:
12 | 
13 |   - John Carmack on using
14 |   [debuggers](https://www.youtube.com/watch?v=tzr7hRXcwkw) and his old
15 |   [.plan](https://github.com/oliverbenns/john-carmack-plan/blob/53c00aedaeeb23dee06b28ba985c3b3b4d61107c/archive/1998-10-14.md)
16 |   (1998);
17 | 
18 |   - Martin Thompson also [says](https://youtu.be/1KRYH75wgy4?t=411) "step
19 |     through your code using a debugger".
20 | 
21 | * Possible fix: domain-specific debuggers, that focus on displaying your
22 |   application state at the level of abstraction that you think of it, rather
23 |   than how the programming language that you are using happens to lay it out in
24 |   memory
25 | 
26 | ## Idea
27 | 
28 | * Assumptions: determinism, state machine
29 | * Record inputs (and states) in circular buffer, dump to disk/SQLite db on error
30 | 
31 | ## Examples
32 | 
33 | ### Distributed systems
34 | 
35 | * TigerbeetleDB's demo https://youtu.be/w3WYdYyjek4?t=3175
36 | * https://spritely.institute/news/introducing-a-distributed-debugger-for-goblins-with-time-travel.html
37 | 
38 | ### Games
39 | 
40 | * [Tomorrow Corporation Tech Demo](https://www.youtube.com/watch?v=72y2EC5fkcE)
41 | 
42 | ## Optimisations
43 | 
44 | * Avoid storing the state in the trace (if deterministic)
45 | * Avoiding infinite traces via state snapshots
46 | * Turn off logging (if deterministic)
47 | * Audit trails
48 | 
49 | ## Contributing
50 | 
51 | * General purpose formats for states, inputs and outputs against which generic
52 |   debuggers can be written against? Structured JSON? Binary?
53 | 
54 | ## See also
55 | 
56 | * The history of [time traveling
57 |   debuggers](http://jakob.engbloms.se/archives/1564)
58 | * https://werat.dev/blog/what-a-good-debugger-can-do/
59 | 
60 | * Mozilla's [`rr` debugger](https://github.com/rr-debugger/rr)
61 | 
62 | * [Visualising application state](https://youtu.be/-HhI3BEIqWw?t=1199)
63 | 
64 | * Command sourcing
65 | 
66 | * Jamie Brandon's post [*Local state is
67 |   harmful*](https://www.scattered-thoughts.net/writing/local-state-is-harmful/)
68 |   (2014)
69 | 


--------------------------------------------------------------------------------
/docs/erlang-is-not-about.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | date: 2023-01-18
  3 | ---
  4 | 
  5 | # Erlang's not about lightweight processes and message passing...
  6 | 
  7 | I used to think that the big idea of Erlang is its lightweight processes and
  8 | message passing. Over the last couple of years I've realised that there's a
  9 | bigger insight to be had, and in this post I'd like to share it with you.
 10 | 
 11 | ## Background
 12 | 
 13 | Erlang has an interesting history. If I understand things correctly, it started
 14 | off as a Prolog library for building reliable distributed systems, morphed into a
 15 | Prolog dialect, before finally becoming a language in its own right.
 16 | 
 17 | The goal seemed to have always been to solve the problem of building reliable
 18 | distributed systems. It was developed at Ericsson and used to program their
 19 | telephone switches. This was sometime in the 80s and 90s, before internet use
 20 | become widespread. I suppose they were already dealing with "internet scale"
 21 | traffic, i.e. hundreds of millions of users, with stricter SLAs than most
 22 | internet services provide today. So in a sense they were ahead of their time.
 23 | 
 24 | In 1998 Ericsson decided to ban all use of Erlang[^0]. The people responsible
 25 | for developing it argued that if they were going to ban it, then they might as
 26 | well open source it. Which Ericsson did and shortly after most of the team that
 27 | created Erlang quit and started their own company.
 28 | 
 29 | One of these people was Joe Armstrong, which also was one of the main people
 30 | behind the design and implementation of Erlang. The company was called Bluetail
 31 | and they got bought up a couple of times but in the end Joe got fired in 2002.
 32 | 
 33 | Shortly after, still in 2002, Joe starts writing his PhD thesis at the Swedish
 34 | Institute of Computer Science (SICS). Joe was born 1950, so he was probably 52
 35 | years old at this point. The topic of the thesis is *Making reliable distributed
 36 | systems in the presence of software errors* and it was finished the year after
 37 | in 2003.
 38 | 
 39 | It's quite an unusual thesis in many ways. For starters, most theses are written
 40 | by people in their twenties with zero experience of practical applications.
 41 | Whereas in Joe's case he has been working professionally on this topic since the
 42 | 80s, i.e. about twenty years. The thesis contains no math nor theory, it's
 43 | merely a presentation of the ideas that underpin Erlang and how they used Erlang
 44 | to achieve the original goal of building reliable distributed systems.
 45 | 
 46 | I highly commend reading his
 47 | [thesis](http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A9492&dswid=-1166)
 48 | and forming your own opinion, but to me it's clear that the big idea there isn't
 49 | lightweight processes[^1] and message passing, but rather the generic components
 50 | which in Erlang are called *behaviours*.
 51 | 
 52 | ## Behaviours
 53 | 
 54 | I'll first explain in more detail what behaviours are, and then I'll come back
 55 | to the point that they are more important than the idea of lightweight processes.
 56 | 
 57 | Erlang behaviours are like interfaces in, say, Java or Go. It's a collection of
 58 | type signatures which can have multiple implementations, and once the programmer
 59 | provides such an implementation they get access to functions written against
 60 | that interface. To make it more concrete here's a contrived example in Go:
 61 | 
 62 | ```go
 63 | // The interface.
 64 | type HasName interface {
 65 |         Name() string
 66 | }
 67 | 
 68 | // A generic function written against the interface.
 69 | func Greet(n HasName) {
 70 |     fmt.Printf("Hello %s!\n", n.Name())
 71 | }
 72 | 
 73 | // First implementation of the interface.
 74 | type Joe struct {}
 75 | 
 76 | func (_ *Joe) Name() string {
 77 |         return "Joe"
 78 | }
 79 | 
 80 | // Second implementation of the interface.
 81 | type Mike struct {}
 82 | 
 83 | func (_ *Mike) Name() string {
 84 |         return "Mike"
 85 | }
 86 | 
 87 | func main() {
 88 |         joe := &Joe{}
 89 |         mike := &Mike{}
 90 |         Greet(mike)
 91 |         Greet(joe)
 92 | }
 93 | ```
 94 | 
 95 | Running the above program will display:
 96 | 
 97 | ```
 98 | Hello Mike!
 99 | Hello Joe!
100 | ```
101 | 
102 | This hopefully illustrates how `Greet` is generic in, or parametrised by, the
103 | interface `HasName`.
104 | 
105 | ### Generic server behaviour
106 | 
107 | Next lets have a look at a more complicated example in Erlang taken from Joe's
108 | thesis (p. 136). It's a key-value store where we can `store` a key value pair or
109 | `lookup` the value of a key, the `handle_call` part is the most interesting:
110 | 
111 | ```erlang
112 | -module(kv).
113 | -behaviour(gen_server).
114 | 
115 | -export([start/0, stop/0, lookup/1, store/2]).
116 | 
117 | -export([init/1, handle_call/3, handle_cast/2, terminate/2]).
118 | 
119 | start() ->
120 |   gen_server:start_link({local,kv},kv,arg1,[]).
121 | 
122 | stop() -> gen_server:cast(kv, stop).
123 | 
124 | init(arg1) ->
125 |   io:format("Key-Value server starting~n"),
126 |   {ok, dict:new()}.
127 | 
128 | store(Key, Val) ->
129 |   gen_server:call(kv, {store, Key, Val}).
130 | 
131 | lookup(Key) -> gen_server:call(kv, {lookup, Key}).
132 | 
133 | handle_call({store, Key, Val}, From, Dict) ->
134 |   Dict1 = dict:store(Key, Val, Dict),
135 |   {reply, ack, Dict1};
136 | handle_call({lookup, crash}, From, Dict) ->
137 |   1/0; %% <- deliberate error :-)
138 | handle_call({lookup, Key}, From, Dict) ->
139 |   {reply, dict:find(Key, Dict), Dict}.
140 | 
141 | handle_cast(stop, Dict) -> {stop, normal, Dict}.
142 | 
143 | terminate(Reason, Dict) ->
144 |   io:format("K-V server terminating~n").
145 | ```
146 | 
147 | This is an implementation of the `gen_server` behaviour/interface. Notice how
148 | `handle_call` updates the state (`Dict`) in case of a `store` and `lookup`s the
149 | key in the state. Once `gen_server` is given this implementation it will provide
150 | a server which can handle concurrent `store` and `lookup` requests, similarly to
151 | how `Greet` provided the displaying functionality.
152 | 
153 | At this point you might be thinking "OK, so what? Lots of programming languages
154 | have interfaces...". That's true, but notice how `handle_call` is completely
155 | sequential, i.e. all concurrency is hidden away in the generic `gen_server`
156 | component. "Yeah, but that's just good engineering practice which can be done in
157 | any language" you say. That's true as well, but the thesis pushes this idea
158 | quite far. It identifies six behaviours: `gen_server`, `gen_event`, `gen_fsm`,
159 | `supervisor`, `application`, and `release` and then says these are enough to
160 | build reliable distributed systems. As a case study Joe uses one of Ericsson's
161 | telephone switches (p. 157):
162 | 
163 | > When we look at the AXD301 project in chapter 8, we will see that there were
164 | > 122 instances of gen_server, 36 instances of gen_event and 10 instances of
165 | > gen_fsm. There were 20 supervisors and 6 applications. All this is packaged
166 | > into one release.
167 | 
168 | Joe gives several arguments for why behaviour should be used (pp. 157-158):
169 | 
170 |   1. The application programmer only has to provide the part of the code which
171 |      defines the *semantics* (or "business logic") of their problem, while the
172 |      *infrastructure* code is provided automatically by the behaviour;
173 | 
174 |   2. The application programmer writes sequential code, all concurrency is hidden
175 |      away in the behaviour;
176 | 
177 |   3. Behaviours are written by experts, and based on years of experience and
178 |      represent "best practices";
179 | 
180 |   4. Easier for new team members to get started: business logic is sequential,
181 |      similar structure that they might have seen before elsewhere;
182 | 
183 |   5. If whole systems are implemented reusing a small set of behaviours: as
184 |      behaviour implementations improve the whole systems will improve without
185 |      requiring any code changes;
186 | 
187 |   6. Sticking to only using behaviours enforces structure, which in turn makes
188 |      testing and formal verification much easier.
189 | 
190 | We'll come back to this last point about testing later.
191 | 
192 | ### Event manager behaviour
193 | 
194 | Lets come back to the behaviours we listed above first. We looked at
195 | `gen_server`, but what are the others for? There's `gen_event` which is a
196 | generic event manager, which lets you register event handlers that are then run
197 | when the event manager gets messages associated with the handlers. Joe says this
198 | is useful for, e.g., error logging and gives the following example of an simple
199 | logger (p. 142):
200 | 
201 | ```erlang
202 | -module(simple_logger).
203 | -behaviour(gen_event).
204 | 
205 | -export([start/0, stop/0, log/1, report/0]).
206 | 
207 | -export([init/1, terminate/2,
208 |          handle_event/2, handle_call/2]).
209 | 
210 | -define(NAME, my_simple_event_logger).
211 | 
212 | start() ->
213 |   case gen_event:start_link({local, ?NAME}) of
214 |     Ret = {ok, Pid} ->
215 |       gen_event:add_handler(?NAME,?MODULE,arg1),
216 |       Ret;
217 |   Other ->
218 |     Other
219 |   end.
220 | 
221 | stop() -> gen_event:stop(?NAME).
222 | 
223 | log(E) -> gen_event:notify(?NAME, {log, E}).
224 | 
225 | report() ->
226 |   gen_event:call(?NAME, ?MODULE, report).
227 | 
228 | init(arg1) ->
229 |   io:format("Logger starting~n"),
230 |   {ok, []}.
231 | 
232 | handle_event({log, E}, S) -> {ok, trim([E|S])}.
233 | 
234 | handle_call(report, S) -> {ok, S, S}.
235 | 
236 | terminate(stop, _) -> true.
237 | 
238 | trim([X1,X2,X3,X4,X5|_]) -> [X1,X2,X3,X4,X5];
239 | trim(L) -> L.
240 | ```
241 | 
242 | The interesting part is `handle_event`, `trim` and `report`. Together they let
243 | the user log, keep track and display the last five error messages.
244 | 
245 | ### State machine behaviour
246 | 
247 | The `gen_fsm` behavior has been renamed to `gen_statem` (for state machine)
248 | since thesis was written. It's very similar to `gen_server`, but more geared
249 | towards implementing protocols, which often are specified as state machines. I
250 | believe any `gen_server` can be implemented as a `gen_statem` and vice versa so
251 | we won't go into the details of `gen_statem`.
252 | 
253 | ### Supervisor behaviour
254 | 
255 | The next interesting behavior is `supervisor`. Supervisors are processes which
256 | sole job is to make sure that other processes are healthy and doing their job.
257 | If a supervised process fails then the supervisor can restart it according
258 | to some predefined strategy. Here's an example due to Joe (p. 148):
259 | 
260 | ```erlang
261 | -module(simple_sup).
262 | -behaviour(supervisor).
263 | 
264 | -export([start/0, init/1]).
265 | 
266 | start() ->
267 |   supervisor:start_link({local, simple_supervisor},
268 |   ?MODULE, nil).
269 | 
270 | init(_) ->
271 |   {ok,
272 |   {{one_for_one, 5, 1000},
273 |   [
274 |    {packet,
275 |      {packet_assembler, start, []},
276 |      permanent, 500, worker, [packet_assembler]},
277 |    {server,
278 |      {kv, start, []},
279 |      permanent, 500, worker, [kv]},
280 |    {logger,
281 |      {simple_logger, start, []},
282 |      permanent, 500, worker, [simple_logger]}]}}.
283 | ```
284 | 
285 | The `{one_for_one, 5, 1000}` is the restart strategy. It says that if one of the
286 | supervised processes (`packet_assembler`, `kv`, and `simple_logger`) fail then
287 | only restart the failing process (`one_for_one`). If the supervisor needs to
288 | restart more than `5` times in `1000` seconds then the supervisor itself should
289 | fail.
290 | 
291 | The `permanent, 500, worker` part means that this is a worker process which
292 | should be permanently kept alive and its given 500 milliseconds to gracefully
293 | stop what it's doing in case the supervisor wants to restart it.
294 | 
295 | "Why would the supervisor want to restart it if it's not dead already?", one
296 | might wonder. Well, there are other restart strategies than `one_for_one`. For
297 | example, `one_for_all` where if one process fails then the supervisor restarts
298 | all of its children.
299 | 
300 | If we also consider that supervisors can supervise supervisors, which are not
301 | necessarily running on the same computer, then I hope that you get an idea of how
302 | powerful this behaviour can be. And, no, this isn't "just Kubernetes", because
303 | it's at the thread/lightweight process level rather than docker container level.
304 | 
305 | The idea for supervisors and their restart strategies comes from the observation
306 | that often a restart appears to fix the problem, as captured in the *Have You
307 | Tried Turning It Off And On Again?* sketches from IT Crowd.
308 | 
309 | Knowing that failing processes will get restarted coupled with Jim Gray's idea
310 | of failing fast, that's either produce the output according to the specification
311 | or signal failure and stop operating, leads to Joe's slogan: "Let it crash!" (p.
312 | 107). Another way to think of it is that a program should only express its
313 | "happy path", should anything go wrong on its happy way it should crash, rather
314 | than trying to be clever about it and try to fix the problem (potentially making
315 | it worse), and another program higher up the supervisor tree will handle it.
316 | 
317 | Supervisors and the "let it crash" philosophy, appear to produce reliable
318 | systems. Joe uses the Ericsson AXD301 telephone switch example again (p. 191):
319 | 
320 | > Evidence for the long-term operational stability of the system had also not
321 | > been collected in any systematic way. For the Ericsson AXD301 the only
322 | > information on the long-term stability of the system came from a power-point
323 | > presentation showing some figures claiming that a major customer had run an 11
324 | > node system with a 99.9999999% reliability, though how these figure had been
325 | > obtained was not documented.
326 | 
327 | To put this in perspective, five nines (99.999%) reliability is considered good
328 | (5.26 minutes of downtime per year). "59% of Fortune 500 companies experience a
329 | minimum of 1.6 hours of downtime per week", according to some
330 | [report](https://courseware.cutm.ac.in/wp-content/uploads/2020/06/Assessing-the-Financial-Impact-of-Downtime-UK.pdf)
331 | from a biased company. Notice per *year* vs per *week*, but as we don't know how
332 | either reliability numbers are obtained its probably safe to assume that the
333 | truth is somewhere in the middle -- still a big difference, but not 31.56
334 | milliseconds (nine nines) of downtime per year vs 1.6 hours of downtime per
335 | week.
336 | 
337 | ### Application and release behaviours
338 | 
339 | I'm not sure if `application` and `release` technically are behaviours, i.e.
340 | interfaces. They are part of the same chapter as the other behaviours in the
341 | thesis and they do provide a clear structure which is a trait of the other
342 | behaviours though, so we'll include them in the discussion.
343 | 
344 | So far we've presented behaviours from the bottom up. We started with "worker"
345 | behaviours `gen_server`, `gen_statem` and `gen_event` which together capture the
346 | semantics of our problem. We then saw how we can define `supervisor` trees whose
347 | children are other supervisor trees or workers, to deal with failures and
348 | restarts.
349 | 
350 | Next level up is an `application` which consists of a supervisor tree together
351 | with everything else we need to deliver a particular application.
352 | 
353 | A system can consist of several `application` and that's where the final
354 | "behaviour" comes in. A `release` packages up one or more applications. They
355 | also contain code to handle upgrades. If the upgrade fails, it must be able to
356 | rollback to the previous stable state.
357 | 
358 | ## How behaviours can be implemented
359 | 
360 | I hope that by now I'm managed to convince you that it's not actually the
361 | lightweight processes and message passing by themselves that make Erlang great
362 | for building reliable systems.
363 | 
364 | At best one might be able to claim that lightweight processes and supervisors
365 | are the key mechanisms at play[^2], but I think it would be more honest to recognise
366 | the structure that behaviours provide and how that ultimately leads to reliable
367 | software.
368 | 
369 | I've not come across any other language, library, or framework which provides
370 | such relatively simple building blocks that compose into big systems like the
371 | AXD301 ("over a million lines of Erlang code", p. 167).
372 | 
373 | This begs the question: why aren't language and library designers stealing the
374 | structure behind Erlang's behaviours, rather than copying the ideas of
375 | lightweight processes and message passing?
376 | 
377 | Let's take a step back. We said earlier that behaviours are interfaces and many
378 | programming languages have interfaces. How would we go about starting to
379 | implement behaviours in other languages?
380 | 
381 | Lets start with `gen_server`. I like to think its interface signature as being:
382 | 
383 | ```haskell
384 | Input -> State -> (State, Output)
385 | ```
386 | 
387 | That's it takes some input, its current state and produces a pair of the new
388 | updated state and an output.
389 | 
390 | How do we turn this sequential signature into something that can handle
391 | concurrent requests? One way would be to fire up a HTTP server which transforms
392 | requests into `Input`s and puts them on a queue, have an event loop which pops
393 | inputs from the queue and feeds it to the sequential implementation, then
394 | writing the output back to the client response. It wouldn't be difficult to
395 | generalise this to be able to handle multiple `gen_server`s at the same time, by
396 | giving each a name and let the request include the name in addition to the
397 | input.
398 | 
399 | `gen_event` could be implemented by allowing registration of callbacks to
400 | certain types of event on the queue.
401 | 
402 | `supervisor`s is more interesting, one simple way to think of it is: when we
403 | feed the `gen_server` function the next input from the queue, we wrap that call
404 | in an exception handler, and should it throw we notify its supervisor. It gets a
405 | bit more complicated if the supervisor is not running on the same computer as
406 | the `gen_server`.
407 | 
408 | I haven't thought about `application` and `release`s much yet, but given that
409 | configuration, deployment and upgrades are difficult problems they seem
410 | important.
411 | 
412 | ## Correctness of behaviours
413 | 
414 | Writing a post solely about stealing from Erlang doesn't seem fair, even though
415 | it's the right thing to do, so I'd like to finish off with how we can build upon
416 | the insights of Joe and the Erlang community.
417 | 
418 | I've been interesting in testing for a while now. Most recently I've been
419 | looking into [simulation
420 | testing](https://github.com/stevana/property-based-testing-stateful-systems-tutorial)
421 | distributed systems à la
422 | [FoundationDB](https://www.youtube.com/watch?v=4fFDFbi3toc).
423 | 
424 | Simulation testing in a nutshell is running your system in a simulated world,
425 | where the simulation has full control over which messages get sent when over the
426 | network.
427 | 
428 | FoundationDB built their own programming language, or dialect of C++ with
429 | actors, in order do the simulation testing. Our team seemed to be able to get
430 | quite far with merely using state machines of type:
431 | 
432 | ```haskell
433 | Input -> State -> (State, [Output])
434 | ```
435 | 
436 | where `[Output]` is a sequence of outputs.
437 | 
438 | The idea being that the simulator keeps track of a priority queue of messages
439 | sorted by their arrival time, it pops a message, advances the clock to the
440 | arrival time of that message, feeds the message to the receiving state machine,
441 | generates new arrival times for all output messages and puts them back into the
442 | priority queue, rinse and repeat. As long as everything is deterministic and the
443 | arrival times are generated using a seed we can explore many different
444 | interleavings and get reproducible failures. It's also much faster than Jepsen,
445 | because messaging is done in-memory and we advance the clock to the arrival
446 | time, thereby triggering any timeouts without having to wait for them.
447 | 
448 | We used to say that programs of this state machine type where written in
449 | "network normal form", and conjectured that every program which can receive and
450 | send stuff over the network can be refactored into this shape[^3]. Even if we
451 | had a proof, "network normal form" always felt a bit arbitrary. But then I read
452 | Joe's thesis and realised that `gen_server` and `gen_statem` basically have the
453 | same type, so I stopped being concerned about it. As I think that if a structure
454 | is found to be useful by different people, then it's usually a sign that it
455 | isn't arbitrary.
456 | 
457 | Anyway, in, at least, one of Joe's [talks](https://youtu.be/cNICGEwmXLU?t=1439)
458 | he mentions how difficult it's to correctly implement distributed leader
459 | election.
460 | 
461 | I believe this is a problem that would be greatly simplified by having access to
462 | a simulator. A bit like I'd imagine having access to a wind tunnel would make
463 | building an airplane easier. Both lets you test your system under extreme
464 | conditions, such as unreliable networking or power loss, before they happen in
465 | "production". Furthermore, this simulator can be generic in, or parametrised by,
466 | behaviours. Which means that the developer gets it for free while the complexity
467 | of the simulator is hidden away, just like the concurrent code of `gen_server`!
468 | 
469 | FoundationDB is a good example of simulation testing working, as witnessed by
470 | this [tweet](https://twitter.com/aphyr/status/405017101804396546) where somebody
471 | asked Kyle "aphyr" Kingsbury to Jepsen test FoundationDB:
472 | 
473 | > “haven’t tested foundation[db] in part because their testing appears to be
474 | > waaaay more rigorous than mine.”
475 | 
476 | Formal verification is also made easier if the program is written a state
477 | machine. Basically all of Lamport's model checking
478 | [work](https://www.microsoft.com/en-us/research/publication/computation-state-machines/)
479 | with TLA+ assumes that the specification is a state machine. Also more recently
480 | Kleppmann has
481 | [shown](https://lawrencecpaulson.github.io/2022/10/12/verifying-distributed-systems-isabelle.html)
482 | how to exploit the state machine structure to do proof by (structural) induction
483 | to solve the state explosion problem.
484 | 
485 | So there you have it, we've gone full circle. We started by taking inspiration
486 | from Joe and Erlang's behaviours, and ended up using the structure of the
487 | `gen_server` behaviour to make it easier to solve a problem that Joe used to
488 | have.
489 | 
490 | ## Contributing
491 | 
492 | There are a bunch of related ideas that I have started working on:
493 | 
494 |   * Stealing ideas from Martin Thompson's work on the LMAX Disruptor and
495 |     [aeron](https://github.com/real-logic/aeron) to
496 |     [make](https://github.com/stevana/pipelined-state-machines) a fast event
497 |     loop, on top of which the behaviours run;
498 |   * Enriching the state machine type with [async
499 |     I/O](https://github.com/stevana/coroutine-state-machines);
500 |   * How to implement
501 |     [supervisors](https://github.com/stevana/supervised-state-machines) in more
502 |     detail;
503 |   * Hot code [swapping](https://github.com/stevana/hot-swapping-state-machines)
504 |     of state machines.
505 | 
506 | Feel free to get in touch, if you find any of this interesting and would like to
507 | get involved, or if you have have comments, suggestions or questions.
508 | 
509 | ## See also
510 | 
511 | * Chapter 6.1 on behaviours in Joe Armstrong's
512 |   [thesis](http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A9492&dswid=-1166),
513 |   p. 129;
514 | * [OTP design principles](https://www.erlang.org/doc/design_principles/des_princ.html);
515 | * The documentation for behaviours:
516 |     - [`gen_server`](https://www.erlang.org/doc/man/gen_server.html);
517 |     - [`gen_event`](https://www.erlang.org/doc/man/gen_event.html);
518 |     - [`gen_statem`](https://www.erlang.org/doc/man/gen_statem.html);
519 |     - [`supervisor`](https://www.erlang.org/doc/man/supervisor.html);
520 |     - [`application`](https://www.erlang.org/doc/man/application.html);
521 |     - [release](https://www.erlang.org/doc/design_principles/release_structure.html).
522 | * [Hewitt, Meijer and Szyperski: The Actor Model (everything you wanted to know,
523 |   but were afraid to ask)](https://youtube.com/watch?v=7erJ1DV_Tlo) (2012);
524 | * Erlang the [movie](https://www.youtube.com/watch?v=xrIjfIjssLE) (1990);
525 | * [Systems that run forever self-heal and
526 |   scale](https://www.youtube.com/watch?v=cNICGEwmXLU) by Joe Armstrong (Strange
527 |   Loop, 2013);
528 | * [The Do's and Don'ts of Error
529 |   Handling](https://www.youtube.com/watch?v=TTM_b7EJg5E) by Joe Armstrong (GOTO,
530 |   2018);
531 | * [The Zen of Erlang](https://ferd.ca/the-zen-of-erlang.html) by Fred Hebert
532 |   (2016);
533 | * [The Hitchhiker's Guide to the
534 |   Unexpected](https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html) by
535 |   Fred Hebert (2018);
536 | * [Why Do Computers Stop and What Can Be Done About
537 |   It?](https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf) by Jim Gray
538 |   (1985);
539 | * The supervision trees chapter of [*Adopting
540 |   Erlang*](https://adoptingerlang.org/docs/development/supervision_trees/)
541 |   (2019);
542 | * "If there's one thing I'd say to the Erlang folks, it's you got the stuff
543 |   right from a high-level, but you need to invest in your messaging
544 |   infrastructure so it's super fast, super efficient and obeys all the right
545 |   properties to let this stuff work really well."
546 |   [quote](https://youtu.be/OqsAGFExFgQ?t=2532) by Martin Thompson (Functional
547 |   Conf, 2017).
548 | 
549 | ## Discussion
550 | 
551 | * [Hacker News](https://news.ycombinator.com/item?id=34545061)
552 | * [lobste.rs](https://lobste.rs/s/7dguth/erlang_s_not_about_lightweight_processes)
553 | * [r/programming](https://old.reddit.com/r/programming/comments/10mt6hz/erlangs_not_about_lightweight_processes_and/)
554 | * [r/haskell](https://old.reddit.com/r/haskell/comments/10mgd0a/erlangs_not_about_lightweight_processes_and/)
555 | * [r/erlang](https://old.reddit.com/r/erlang/comments/10g0zbg/erlangs_not_about_lightweight_processes_and/)
556 | * [Elixir Forum](https://elixirforum.com/t/erlangs-not-about-lightweight-processes-and-message-passing/53484/7)
557 | 
558 | 
559 | [^0]: From Joe Armstrong's thesis (p. 6):
560 | 
561 |     > In February 1998 Erlang was banned for new product development within
562 |     > Ericsson—the main reason for the ban was that Ericsson wanted to be a consumer
563 |     > of sodware technologies rather than a producer.
564 | 
565 |     From Bjarne Däcker's thesis (2000, p. 37):
566 | 
567 |     > In February 1998, Erlang was banned within Ericsson Radio AB (ERA) for new
568 |     > product projects aimed for external customers because:
569 |     >
570 |     > “The selection of an implementation language implies a more long-term
571 |     > commitment than selection of processors and OS, due to the longer life cycle
572 |     > of implemented products. Use of a proprietary language, implies a continued
573 |     > effort to maintain and further develop the support and the development
574 |     > environment. It further implies that we cannot easily benefit from, and find
575 |     > synergy with, the evolution following the large scale deployment of globally
576 |     > used languages.”
577 | 
578 |     Joe also says, in this [talk](https://vimeo.com/97329186) (34:30), that there
579 |     were two reasons for Erlang getting banned: 1) that it wasn't Java, and 2) that
580 |     it wasn't C++.
581 | 
582 | [^1]: It's a common misconception is that Erlang is about actors.
583 | 
584 |     The actor model first presented in [*A Universal Modular Actor Formalism for
585 |     Artificial
586 |     Intelligence*](https://www.ijcai.org/Proceedings/73/Papers/027B.pdf) by Carl
587 |     Hewitt, Peter Bishop, Richard Steiger (1973) and refined by others over time,
588 |     e.g. see Irene Greif's [thesis](https://dspace.mit.edu/handle/1721.1/57710)
589 |     (1975) or Gul Agha's [thesis](https://dspace.mit.edu/handle/1721.1/6952)
590 |     (1985).
591 | 
592 |     Erlang first appeard later in 1986, but the Erlang developers were [not
593 |     aware](https://erlang.org/pipermail/erlang-questions/2014-June/079794.html) of
594 |     the actor model. In fact Robert Virding, one of the original Erlang designers,
595 |     [claims](https://erlang.org/pipermail/erlang-questions/2014-June/079865.html)
596 |     that knowing about the actor model might even have slowed them down.
597 | 
598 |     Carl Hewitt has written a paper called [*Actor Model of Computation: Scalable
599 |     Robust Information Systems*](https://arxiv.org/abs/1008.1459) (2015) which
600 |     documents the differences between Erlang's processes and the actor model.
601 | 
602 | [^2]: Scala's Akka seems to be of this opinion. They got something they call
603 |     "actors", not to be confused with the actor model as per footnote 1, and
604 |     obligatory supervisors trees. They don't appear to have any analogues of the
605 |     other Erlang behaviours though.
606 | 
607 |     Confusingly Akka has a concept called
608 |     ["behavior"](https://doc.akka.io/docs/akka/current/general/actors.html#behavior),
609 |     but it has nothing to do with Erlang behaviours.
610 | 
611 | [^3]: The intuition being that since every program using the state monad can be
612 |     rewritten to a normal form where a single `read`/`get` followed by a single
613 |     `write`/`put`, it seems reasonable to assume that something similar would
614 |     work for `recv` and `send` over the network. I forget the reference for the
615 |     state monad normal form, either Plotkin and Power or Uustalu?
616 | 


--------------------------------------------------------------------------------
/docs/modular-state-machines.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | status: Still in research (please don't share)
 3 | ---
 4 | 
 5 | # Modular state machines
 6 | 
 7 | ## Motivation
 8 | 
 9 | * One big state machine can become difficult to understand
10 | 
11 | * How can we break a state machine into parts such that when combined they form
12 |   the whole?
13 | 
14 | ## Products of states
15 | 
16 | * The Cartesian product of states can be used when we need to keep track of two
17 |   states running in parallel.
18 | 
19 | * There are different two different ways we can advance two state machines that
20 |   run in parallel, either by stepping one of them and leaving the other alone or
21 |   stepping both of them in lockstep.
22 | 
23 | * Angelic product, allows of stepping one of the state machines:
24 | 
25 | ```
26 | type SM s i o = i -> s -> (s, o)
27 | 
28 | angelic : SM s i o -> SM t j p -> SM (s, t) (i + j) (o + p)
29 | angelic f g ij (s, t) case ij of
30 |   Left i  -> let (s', o) = f i s in ((s', t), Left o)
31 |   Right j -> let (t', p) = g j t in ((s, t'), Right p)
32 | 
33 | video : SM {playing, stopped} {stop, play} ()
34 | audio : SM {mutued, unmuted} {mute, unmute} ()
35 | 
36 | player = angelic video audio
37 | ```
38 | 
39 | * Tensor product, allows us to step both state machines in lockstep:
40 | 
41 | ```
42 | tensor : SM s i o -> SM t j p -> SM (s, t) (i, j) (o, p)
43 | tensor f g (i, j) (s, t) =
44 |   let
45 |     (o, s') = f i s
46 |     (p, t') = g j t
47 |   in
48 |     ((s', t'), (o, p))
49 | ```
50 | 
51 | See also:
52 | 
53 | * ["Concurrent state
54 |   machines"](http://gameprogrammingpatterns.com/state.html#concurrent-state-machines)
55 |   in Nystrom;
56 | * [Orthogonal
57 |   regions](https://en.wikipedia.org/wiki/UML_state_machine#Orthogonal_regions)
58 |   in UML state machines;
59 | * [Programming interfaces and basic topology](https://arxiv.org/abs/0905.4063)
60 |   by Peter Hancock and Pierre Hyvernat (2009).
61 | 
62 | ## The state pattern
63 | 
64 | * http://gameprogrammingpatterns.com/state.html#the-state-pattern
65 | 
66 | ## Hierarchical states
67 | 
68 | * http://gameprogrammingpatterns.com/state.html#hierarchical-state-machines
69 | 
70 | * [Hierarchically nested
71 |   states](https://en.wikipedia.org/wiki/UML_state_machine#Hierarchically_nested_states)
72 |   in UML state machines
73 | 
74 | ## Stack of states / pushdown automaton
75 | 
76 | * http://gameprogrammingpatterns.com/state.html#pushdown-automata
77 | 
78 | ## See also
79 | 
80 | * [Game Programming Patterns](http://gameprogrammingpatterns.com/state.html) by
81 |   Robert Nystrom (2014, chapter 7);
82 | * Development and Deployment of Multiplayer Online Games, Vol. II by Sergey
83 |   Ignatchenko (2020, chapter 5);
84 | * [*Statecharts: A visual formalism for complex
85 |   systems*](http://www.wisdom.weizmann.ac.il/~dharel/SCANNED.PAPERS/Statecharts.pdf)
86 |   by David Harel (1987);
87 | * [UML state machines](https://en.wikipedia.org/wiki/UML_state_machine).
88 | 


--------------------------------------------------------------------------------
/docs/practice-in-programming.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | status: Still in research (please don't share)
  3 | ---
  4 | 
  5 | # On the role of practice in programming
  6 | 
  7 | ## Motivation
  8 | 
  9 | Practice makes perfect, the saying goes. Yet, very dedicated, people appear to
 10 | spend a lifetime programming without producing a masterpiece. Why is that? The
 11 | perhaps most common explaination is that the field is still young and we haven't
 12 | figured out how to engineer things with the same degree of accurency and
 13 | predicability as other more established fields. A less common explaination,
 14 | which I'd like to explore in this post, is that just because we go to work and
 15 | program the whole day long it doesn't mean that we are in fact practicing.
 16 | 
 17 | ## Defining practice
 18 | 
 19 | Mike Acton gave an [interview](https://youtu.be/qWJpI2adCcs?t=3506) where he
 20 | said that practice starts *from scratch every time*, unlike a project at work or
 21 | as a hobby which builds upon previous work.
 22 | 
 23 | I suppose the key thing is that you redo some *specific* thing many times until
 24 | you become really good at it. Most projects involve many parts, so it's unlikely
 25 | that you are repeatedly doing some specific thing over and over again. This
 26 | could explain why working on projects isn't practice.
 27 | 
 28 | One exception could be if your project is relatively small and you are doing it
 29 | over and over again, then that might qualify as practice.
 30 | 
 31 | Mike gives the example of setting aside half an hour per day to practice to try
 32 | to implement an
 33 | [Asteroids](https://en.wikipedia.org/wiki/Asteroids_(video_game)) clone. In the
 34 | beginning you'll probably not get very far, throw it away, start from scratch,
 35 | by day 300 you might be able to finish implementing the whole game in the
 36 | allocated time. Perhaps Asteroids isn't the best thing to practice, but you get
 37 | the idea.
 38 | 
 39 | While on the topic of games and getting good at programming, it's interesting to
 40 | note that John Carmack and the rest of id Software [developing 13 games in a
 41 | year](https://youtu.be/IzqdZAYcwfY?t=540). Early id Software is an extreme
 42 | example, most of us probably need to practice on something much smaller.
 43 | 
 44 | ## Putting practice into a practice
 45 | 
 46 | Having established what we mean by practice, let's turn our attention to how to
 47 | put practice into a practice.
 48 | 
 49 | Joe Armstrong is a good example. He
 50 | [explains](https://vimeo.com/1344065#t=8m30s) that he often wrote a piece of
 51 | code and the next day he threw it away and rewrote it from scratch. In the early
 52 | days of Erlang it was possible to do a total rewrite of the whole language in
 53 | less than a week. New language features were added in one work session, if you
 54 | couldn't get the idea out of your brain and code it up in that time then you
 55 | didn't do it, Joe
 56 | [explained](https://dl.acm.org/action/downloadSupplement?doi=10.1145%2F1238844.1238850&file=m6-armstrong-h.mov)
 57 | (17:10). In a later talk he elaborated
 58 | [saying](https://youtu.be/rQIE22e0cW8?t=3492):
 59 | 
 60 | > "We need to break systems down into small understandable components with
 61 | > message passing between them and with contracts describing whats going on
 62 | > between them so we can understand them, otherwise we just won't be able to
 63 | > make software that works. I think the limit of human understandability is
 64 | > something like 128KB of code in any language. So we really need to box things
 65 | > down into small units of computation and formally verify them and the
 66 | > protocols in particular."
 67 | 
 68 | Chuck Moore
 69 | [said](https://www.red-gate.com/simple-talk/opinion/geek-of-the-week/chuck-moore-geek-of-the-week/)
 70 | something in similar:
 71 | 
 72 | > "Instead of being rewritten, software has features added. And becomes more
 73 | > complex. So complex that no one dares change it, or improve it, for fear of
 74 | > unintended consequences. But adding to it seems relatively safe. We need
 75 | > dedicated programmers who commit their careers to single applications.
 76 | > Rewriting them over and over until they're perfect." (2009)
 77 | 
 78 | > "... Such people will never exist. The world is too full of more interesting
 79 | > things to do. The only hope is to abandon complex software. Embrace simple.
 80 | > Forget backward compatibility."
 81 | 
 82 | * Both Joe and Chuck ask for simple systems, so that they can easily be
 83 |   rewritten (i.e. practiced on)
 84 | 
 85 | - Chuck Moore reimplemented the same Forth many times, in fact Forth was
 86 |   designed to be easily reimplementable on new hardware (this was back when new
 87 |   CPUs had new instruction sets), he also iterated on the Forth itself (...,
 88 |   colorForth, what where the earlier iterations?)
 89 | 
 90 | * OKAD, [VLSI](https://en.wikipedia.org/wiki/Very_Large_Scale_Integration)
 91 |   design tools, "I’ve spent more time with it that any other; have re-written it
 92 |   multiple times; and carried it to a satisfying level of maturity."
 93 | 
 94 | * John McCarthy's Lisp with its meta-circular implementation?
 95 | 
 96 | ## Practice and software engineering
 97 | 
 98 | * We've defined pratice, we've seen examples of people who appear to use it on
 99 |   personal level, what about scaling it up to teams?
100 | 
101 | - Can whole projects be simple? Here's two Turing award winners who think so:
102 | 
103 | - "At last, there breezed into my office the most senior manager of all, a
104 |   general manager of our parent company, Andrew St. Johnston. I was surprised
105 |   that he had even heard of me. "You know what went wrong?" he shouted -- he
106 |   always shouted -- "You let your programmers do things which you yourself do
107 |   not understand." I stared in astonishment. He was obviously out of touch with
108 |   present day realities. How could one person ever understand the whole of a
109 |   modem software product like the Elliott 503 Mark II software system? I
110 |   realized later that he was absolutely right; he had diagnosed the true cause
111 |   of the problem and he had planted the seed of its later solution." --
112 | 
113 |   The emperor's old clothes, Tony Hoare (1980)
114 |   https://dl.acm.org/doi/10.1145/1283920.1283936
115 | 
116 | - "The belief that complex systems require armies of designers and programmers
117 |   is wrong. A system that is not understood in its entirety, or at least to a
118 |   significant degree of detail by a single individual, should probably not be
119 |   built." -- [A Plea for Lean
120 |   Software](https://people.inf.ethz.ch/wirth/Articles/LeanSoftware.pdf) by
121 |   Niklaus Wirth (1995)
122 | 
123 | - Joe story about new manager asking for somebody who can explain the whole
124 |   system to him. "Does anybody understand the entire system? If so, please come
125 |   and talk to me. Nobody put their hand up. In the Erlang group, if somebody
126 |   would have asked that question, several hands would have gone up."
127 |   https://youtu.be/-I_jE0l7sYQ?t=1389
128 | 
129 | ## Processes and tools that encourage practice?
130 | 
131 | - Let's assume that Mike, Joe, Chuck, Tony and Niklaus are on to something and
132 |   that it's in fact possible to *designing for practice* even in team-sized
133 |   projects (aka software development as opposed to programming)
134 | 
135 | - What would processes and tools that help encourge such development look like?
136 | 
137 | ### The status quo
138 | 
139 | - TDD: write test that fails, simplest possilbe implementation that makes test
140 |   pass, then refactor *incrementally* until satisfactory while keeping tests
141 |   green.
142 |   + start from scratch rather than refactor?
143 |   + time limit (e.g. Joe's one working session)
144 |   + size limit (e.g. Joe's 128KB?)
145 |   + test-driven design: easily testable system is well-designed system
146 |   + practice-driven design: easily rewritable system is a well-designed system?
147 |     * the fact that a system is easily and fully testable surely helps when
148 |       rewriting from scratch, but it feels like there's more to it?
149 |     * good spec / documentation / literate programming? Joe's "the research"
150 |     * ability to "zoom" in and out on spec / docs? Refinement.
151 | 
152 | - Refactor vs rewrite from scratch debate?
153 | 
154 | - diy vs 3rd party: http://ithare.com/overused-code-reuse/
155 |   + https://lobste.rs/s/yubtob/build_vs_buy
156 |   + https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-invented-here-syndrome/
157 |   + https://eli.thegreenplace.net/2017/benefits-of-dependencies-in-software-projects-as-a-function-of-effort/
158 |   + See chapter 4 of *Development and Deployment of Multiplayer Online Games,
159 |     Vol. II* by Sergey Ignatchenko (2020).
160 | 
161 | ### Possible tricks to steal
162 | 
163 | - Parallel and independent development, c.f. Dave Snowden and [wisdom of the
164 |   crowd](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd)
165 | 
166 | - Encourage new team members to rewrite?
167 | 
168 | - What would programming languages look like if we applied these principles?
169 |   Forth is a good example, are there others?
170 | 
171 | - Forth, bounded by blocks: "Disk memory is divided into units called “blocks.”
172 |   Each block holds 1,024 characters of source text or binary data, traditionally
173 |   organized as 16 lines of 64 characters."
174 |   https://www.forth.com/starting-forth/3-forth-editor-blocks-buffer/
175 | 
176 | > There is a great similarity between colorForth and classic Forth: 1024-byte
177 | > blocks. Factoring source code into blocks is equivalent to creating paragraphs
178 | > in English. It breaks a wall of text into pieces that highlight related ideas.
179 | > Many Forth implementations abandoned this advantage and organized source into
180 | > files of arbitrary length. Bad idea. Hugely debated. In addition, the text in
181 | > a block can be displayed on a monitor without scrolling. A quantized unit of
182 | > source code all visible at once.
183 | 
184 | - What about libraries? Can we design building blocks that allow us to build
185 |   reliable, scalable and maintainable systems, in a way such that the building
186 |   blocks can be understood and implemented by a single programmer in a day?
187 |   Similar to how features were added to early Erlang, as mentioned above?
188 | 
189 | - One of the goals of this repository is to try to identify those building
190 |   blocks and try to understand them well enough, document what Joe
191 |   [calls](https://youtu.be/h8nmzPh5Npg?t=1302) "the research", perhaps through
192 |   several reimplementations from scratch, so that they can be implemented by
193 |   others in a day.
194 | 
195 | - More important to provide what Joe called "the research" than to provide a
196 |   library?
197 | 
198 | - "To gain experience, there's no sustitute for one's own programming effort.
199 |   Organizing a team into managers, designers, programers, analysts and users is
200 |   detrimental. All should participate (with differing degrees of emphasis) in
201 |   all aspects of development. In particular, everybody -- including managers --
202 |   should also be product users for a time. This last measure is the best
203 |   guarantee to correct mistakes and perhaps also eliminate redundancies." --
204 |   Niklaus Wirth
205 | 
206 | * "What I cannot create, I do not understand" -- Richard Feynman
207 | 
208 | ## Contributing
209 | 
210 | * Other examples of processes or tools that encourage practice?
211 | 
212 | * Any references in the same general (or completely opposite) direction would be
213 |   appreciated!
214 | 
215 | - About Rich Hickey (creator of Clojure), some interview probably with somebody
216 |   else from Cognitect (half?) jokingly: "Rick doesn't write programs longer than
217 |   1000 lines" (I cannot find the reference, I think it was a clojure meetup in
218 |   london with a discussion after the talk? perhaps on simulant or repl driven
219 |   development?)
220 | 
221 | ## See also
222 | 
223 | * [Coding dojos](https://codingdojo.org/practices/WhatIsCodingDojo/) are spaces
224 |   specifically designed for *practicing*;
225 | * [Are We Really
226 |   Engineers?](https://www.hillelwayne.com/post/are-we-really-engineers/);
227 | * Mike Acton's talk on [Data-Oriented Design](https://youtube.com/watch?v=rX0ItVEVjHc).
228 | 


--------------------------------------------------------------------------------
/docs/specification-language.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | status: research (please don't share)
 3 | ---
 4 | 
 5 | # Specification language
 6 | 
 7 | ## Motivation
 8 | 
 9 | * Joe on languages vs specification languages
10 |   - https://youtu.be/ed7A7r6DBsM?t=632
11 | 
12 | * API vs protocol
13 |   - Martin Thompson: https://youtu.be/bzDAYlpSbrM?t=837
14 | 
15 | * Scaffolding
16 |   - testing
17 |   - docs
18 |     + glossary: https://youtu.be/fTtnx1AAJ-c?t=518
19 |   - code generation
20 |   - program inference?
21 | 
22 | ## Joe's Universal Binary Format (UBF)
23 | 
24 | * https://erlang.org/workshop/2002/Armstrong.pdf
25 | 
26 | ## Cleanroom software engineering
27 | 
28 | * blackbox vs statebox vs clearbox specification
29 |   - https://trace.tennessee.edu/utk_harlan/16/
30 |   - https://trace.tennessee.edu/utk_harlan/12/
31 | * trace specifications
32 | * usage model/ operational profile
33 |   - users/actors are part of the specification
34 | 
35 | ## Dave Snowden
36 | 
37 | * connections are more important than components in complex systems
38 | * "people are objects too"
39 | * break down a concept until there's no disagreement about its meaning
40 | 
41 | ## Some of my own ideas
42 | 
43 | * Mostly stealing from all above and gluing things together
44 | * Informal escape hatches, but still structured and checked
45 |   - useful for documentation and program inference?
46 | 
47 | ## Contributing
48 | 
49 | * Thoughts?
50 | 
51 | ## See also
52 | 
53 | * https://github.com/stevana/bits-and-bobs
54 | * https://github.com/stevana/spec-lang
55 | 


--------------------------------------------------------------------------------
/talks/zurihac-2023/Makefile:
--------------------------------------------------------------------------------
1 | .PHONY: all
2 | 
3 | all:
4 | 	pandoc --from markdown --to beamer -o slides_stevan.pdf slides_stevan.md
5 | 


--------------------------------------------------------------------------------
/talks/zurihac-2023/abstract.txt:
--------------------------------------------------------------------------------
 1 | How can we better build reliable, scalable and maintainable computer systems?
 2 | 
 3 | In this talk I'd like to argue that one of the main building blocks ought to be
 4 | the humble state machine of type:
 5 | 
 6 |     SM state input output = input -> state -> (state, output)
 7 | 
 8 | I'll back up my argument with examples of state-of-the-art techniques from
 9 | industry that involve verification, availability and fault-tolerance using state
10 | machines.
11 | 
12 | I'll then sketch the other building blocks that I think we need in order to
13 | "glue" our state machines together as well as means to inspect, debug and
14 | upgrade and scale running systems.
15 | 
16 | My hope that by the end of the talk I'll have managed to make you think
17 | differently about some of the many aspects involved in developing, deploying and
18 | maintaining distributed systems.
19 | 


--------------------------------------------------------------------------------
/talks/zurihac-2023/slides_stevan.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Distributed systems and state machines
  3 | date: 11th of June, 2023
  4 | author: Stevan Andjelkovic
  5 | institute: ZuriHac lightning talk
  6 | header-includes:
  7 |   - \definecolor{links}{HTML}{2A1B81}
  8 |   - \hypersetup{colorlinks,linkcolor=,urlcolor=links}
  9 | ---
 10 | 
 11 | # Motivation
 12 | 
 13 | 1. There are many implicit state machines in distributed systems;
 14 | 2. By making them explicit a bunch of things become easier!
 15 | 
 16 | # Implicit state machines
 17 | 
 18 | \note{
 19 |   * Distributed systems are hard!
 20 |   * One of the reasons: many failure cases of networking
 21 |   * Good test-suite helps a lot
 22 |   * All successful means of verification use state machines!
 23 | }
 24 | 
 25 | ## Verification
 26 | 
 27 | * Property-based testing using state machine
 28 | * Kyle "aphyr" Kingsbury's Jepsen testing tool
 29 | * FoundationDB's simulation testing
 30 | * Leslie Lamport's model checking of TLA+
 31 | * Martin Kleppmann's formal verification work in Isabelle
 32 | 
 33 | ## Development
 34 | 
 35 | * Joe Armstrong on Erlang's
 36 |   [behaviours](https://github.com/stevana/armstrong-distributed-systems/blob/main/docs/erlang-is-not-about.md)
 37 |   (`gen_server` and `gen_statem`)
 38 | * Fault-tolerance via replicated state machines
 39 | * Martin Thompson's work on LMAX disruptor and Aeron
 40 | 
 41 | ## Theory
 42 | 
 43 | * Yuri Gurevich's generalisation of the Church-Turing thesis
 44 | 
 45 | # Explicit state machines
 46 | 
 47 | * Some ideas of what we can do if our state machines were explicit
 48 | * Rapid overview, slides and links to longer explainations are available
 49 | 
 50 | \note{
 51 | * Simulation testing
 52 | * Time travelling debugger
 53 | * Arrow syntax for state machines and hot-swappable code
 54 | * Pipelines of state machines
 55 | * Modular state machines
 56 | * Protocols between communicating state machines
 57 | * Supervisor trees
 58 | }
 59 | 
 60 | # Simulation testing
 61 | 
 62 | \note{
 63 | * System consisting of several state machines communicating with each other over the network
 64 | * Note that networking is abstracted out from the state machine definition
 65 | * Implement networking twice: once using real networking primitives and once using a fake in-memory network
 66 | }
 67 | 
 68 | * `SM state input output = input -> state -> (state, output)`
 69 | * Networking interface
 70 | * Property-based testing, fault injection, discrete-event simulation
 71 | 
 72 | # Time-travelling debugger
 73 | 
 74 | * Record all inputs
 75 | * Assuming state machine is deterministic we can recompute the state (and logs!) from the inputs
 76 | * This gives us a way of replaying the execution of a state machine and see how
 77 |   its state changes over time
 78 | 
 79 | 
 80 | # Arrow syntax and hot-swappable code
 81 | 
 82 | * `instance Arrow (SM state)`
 83 | * Conal Elliott's compiling to categories (`Arrow` modulo `arr`)
 84 | * CCCs are first-order, i.e. easily seralisable
 85 | * So we can send them over the network and upgrade running state machines without downtime!
 86 | 
 87 | # Pipelining of state machines
 88 | 
 89 | * Serving a request typically involves several stages, e.g.:
 90 |     1. Read bytes from socket
 91 |     2. Parse bytes into structured data
 92 |     3. Validate data
 93 |     4. Process the data using our business logic and produce some response
 94 |     5. Serialise response into bytes
 95 |     6. Write response bytes back to socket
 96 | * What if each such stage was run on a separate CPU/core? Pipelining!
 97 | * Monitor queue lengths of each stage and shard if stages are slow
 98 | * What's the relation to dataflow and FRP?
 99 | 
100 | # Modular state machines
101 | 
102 | * Pipelining is *horizontal* composition, the outputs of one is fed into the another
103 | * What about *vertical* composition?
104 | * State machines inside state machines?
105 | * Hierarchical states?
106 | * Stack of states / pushdown automaton?
107 | 
108 | # Protocols between communicating state machines
109 | 
110 | * If we think of state machines as black boxes which provide some API via their inputs
111 | * Then protocols between two black boxes are sequences of input-output pairs
112 | * These too can be described using state machines!
113 | 
114 | # Supervisor trees and deployment
115 | 
116 | * Tree of supervisors in the nodes and state machines in leaves
117 | * Supervisors' job is to do error handling, if one of their children fail
118 | * Jim Grey's idea of failing fast
119 | * Supervisors contain restart strategies, i.e. in which order to restart the
120 |   children if one fails
121 | * Use restart strategy as a means of deployment (start up order)
122 | 
123 | # Contributing
124 | 
125 | * For more, also see:
126 |   + \url{https://github.com/stevana/armstrong-distributed-systems}
127 |   + \url{https://github.com/stevana/property-based-testing-stateful-systems-tutorial}
128 |   + \url{https://github.com/stevana/hot-swapping-state-machines}
129 |   + \url{https://github.com/stevana/smarrow-lang}
130 |   + \url{https://github.com/stevana/pipelined-state-machines}
131 |   + \url{https://github.com/stevana/elastically-scalable-thread-pools}
132 |   + \url{https://github.com/stevana/supervised-state-machines}
133 | * If you have any questions or comments, feel free to get in touch!
134 | * Thanks for listening!
135 | 


--------------------------------------------------------------------------------
/talks/zurihac-2023/slides_stevan.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevana/armstrong-distributed-systems/434a2999edd36ba1cedafb381aa4dec246a757b1/talks/zurihac-2023/slides_stevan.pdf


--------------------------------------------------------------------------------