├── Complexity.md
├── Readme.md
├── graph
    └── Readme.md
├── list
    ├── Readme.md
    ├── binary-search.js
    ├── shuffle.js
    └── sort
    │   ├── merge-sort.js
    │   └── quicksort.js
├── misc
    ├── cryptography.md
    ├── engineering.md
    ├── memory.md
    ├── network.md
    ├── network
    │   ├── physical.md
    │   └── protocol.md
    ├── reliability.md
    ├── statistics.md
    ├── synchronization.md
    └── time.md
└── tree
    ├── Readme.md
    ├── binary-search.js
    ├── heap.js
    └── red-black.js


/Complexity.md:
--------------------------------------------------------------------------------
 1 | ```
 2 | f(n) = O(g(n))  ⇔  |f(n)| ≤ |g(n)|·k           "grows less than"
 3 | f(n) = Ω(g(n))  ⇔  f(n) ≥ g(n)·k               "grows more than"
 4 | f(n) = Θ(g(n))  ⇔  g(n)·k1 ≤ f(n) ≤ g(n)·k2    "bounded by"
 5 | ```
 6 | 
 7 | (Insert "∃k>0: ∃n0: ∀n>n0" where needed.)
 8 | 
 9 | # Master theorem
10 | 
11 | An algorithm has complexity `T(n) = a T(n/b) + f(n)`.
12 | 
13 | 1. `f(n) = Θ(n^(<logb(a)))` → `T(n) = Θ(n^(logb(a)))`
14 | 2. `f(n) = Θ(n^(logb(a)) log(n)^k)` → `T(n) = Θ(n^(logb(a)) log(n)^(k+1))`
15 | 3. `f(n) = Θ(n^(>logb(a)))` → `T(n) = Θ(f(n))`
16 | 
17 | # Typical complexities
18 | 
19 | | Complexity | Description |
20 | |------------|-------------|
21 | | O(1)       | Getting an element from an array |
22 | | O(m α(m,n))| Best minimum spanning tree (α = inverse Ackermann) |
23 | | O(log n)   | Binary search |
24 | | O(n)       | Maximum of an unsorted array |
25 | | O(n log n) | Best comparison sort |
26 | | O(n^2)     | Naive vector cross product |
27 | | O(n^3)     | Naive matrix multiplication |
28 | | 2^O(log n) | P; Karmarkar (Linear programming); AKS (Primes) |
29 | | 2^o(n)     | Integer factorization |
30 | | 2^O(n)     | E; TSP, 3-SAT, Graph-coloring |
31 | 
32 | - SAT: can a given expression with variables, AND, OR, NOT and nesting be true?
33 | - TSP: minimum distance for a traveling saleswoman which must go through a set
34 |   of cities.
35 | - Graph coloring: give each vertex a different color than its neighbors, with a
36 |   fixed number of colors (2 is O(n), 3 is O(1.3289^n), k is O(2.445^n)).
37 | - Knapsack: keep a maximal value out of a set of elements with a value and a
38 |   mass, given a limit to how much mass you can keep.
39 |   It has a known O(n Mass) dynamic programming solution, but is NP-complete.
40 | - Exact cover: given a bunch of subsets (tetris piece locations) of a set
41 |   (board), select the subsets that cover the whole set with no overlap.
42 |   Pentomino tilings, Sudoku and N queens are of that form.
43 |   Donald Knuth implements it using "Dancing Links".
44 | 
45 | # Complexity classes
46 | 
47 | - P: solved on a deterministic Turing machine in polynomial time.
48 | - NP: solved on a non-deterministic Turing machine in polynomial time (has
49 |   overlapping parallel universes). A solution is verified by a deterministic
50 |   Turing machine in polynomial time. (The non-deterministic solution generates
51 |   all candidates and checks them in parallel.)
52 | - NP-hard: if that problem was O(1), all problems in NP would be in P.
53 | - NP-complete: NP-hard and NP.
54 | 
55 | ```
56 | ← easy … hard →
57 | ├────────NP─────────┤
58 | ├─P─┤ ├─NP-complete─┤
59 |       ├────NP-hard────┤
60 | ```
61 | 
62 | P could be equal to NP, but for all practical purposes, it is not (which may be
63 | proved in the future).
64 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | # Succinct Cybernetics
 2 | 
 3 | All decidable problems can be solved with algorithms. For all we know, humans
 4 | are Turing machines.
 5 | 
 6 | Most problems can be structured into a **graph**. [Graphs](/graph/Readme.md) have vertices (entities
 7 | holding data) and edges (from one vertex to another, sometimes with a value).
 8 | 
 9 | Many problems can be structured into a **tree**. [Trees](/tree/Readme.md) are connected graphs
10 | without cycles. Most trees we use are rooted: one vertex is the entry point (the
11 | root); it is the only vertex with no edge pointing to it, all other have exactly
12 | one. Many trees are ordered: vertices order their children like a list.
13 | 
14 | Some problems can be structured into a **list**. [Lists](/list/Readme.md) are rooted trees where a
15 | maximum of one child is allowed.
16 | 
17 | A few problems can be structured into a **map**. Maps are directed graphs where
18 | every vertex has either a single edge coming from them (keys) or at least one
19 | edge coming from a key (values).
20 | 
21 | (An uncommon variation of maps are multimaps, where keys can have more than a
22 | single edge coming from them.)
23 | 
24 | Another structure is a **set**. Sets are graphs with no edges.
25 | 
26 | ## Index
27 | 
28 | 1. [Complexity](/Complexity.md)
29 | 2. [Graphs](/graph/Readme.md)
30 | 3. [Trees](/tree/Readme.md)
31 | 4. [Lists](/list/Readme.md)
32 | 5. Misc:
33 |     - [Memory](/misc/memory.md)
34 |     - [Time](/misc/time.md)
35 |     - [Network](/misc/network.md)
36 |     - [Synchronization](/misc/synchronization.md)
37 |     - [Reliability](/misc/reliability.md)
38 |     - [Statistics](/misc/statistics.md)
39 |     - [Cryptography](/misc/cryptography.md)
40 |     - [Engineering](/misc/engineering.md)
41 | 
42 | ## Going further
43 | 
44 | - [Introduction to Algorithms](https://mitpress.mit.edu/books/introduction-algorithms)
45 | 


--------------------------------------------------------------------------------
/graph/Readme.md:
--------------------------------------------------------------------------------
 1 | Graphs have vertices (entities holding data) and edges (from one vertex to
 2 | another, sometimes with a value).
 3 | 
 4 | # Implementation
 5 | 
 6 | - **Adjacency matrix**: n by n matrix. Each slot is 0 (no edge between i and j),
 7 |   1 (edge from i to j), potentially more for colored edges.
 8 | - **Adjacency list**: list of n items, each with their vertex data and a pointer
 9 |   to a list of indices of vertices it points to.
10 | - **Pointers**: vertices with a list of pointers to vertices. Useful for trees
11 |   or with a fixed maximum of adjacent vertices.
12 | 
13 | |             | Adjacency matrix | Adjacency list |
14 | |-------------|------------------|----------------|
15 | | Storage     | O(v^2)           | O(v + e)       |
16 | | Add vertex  | O(v^2)           | O(1)           |
17 | | Add edge    | O(1)             | O(1)           |
18 | | Rm vertex   | O(v^2)           | O(e)           |
19 | | Rm edge     | O(1)             | O(e)           |
20 | | Is adjacent | O(1)             | O(v)           |
21 | 
22 | # Graph traversal
23 | 
24 | ## Depth-first
25 | 
26 | ```
27 | Search(G, v):
28 |   explored(v)
29 |   ∀e∈edges(G, v):
30 |     w = vertex(G, v, e)
31 |     if unexplored(w):
32 |       check w
33 |       Search(G, w)
34 | ```
35 | 
36 | O(m) time, O(n) space (worst case).
37 | 
38 | ## Breadth-first
39 | 
40 | ```
41 | Search(G, v):
42 |   queue Q, set S
43 |   enqueue(Q, v)
44 |   add(S, v)
45 |   while Q not empty:
46 |     w = dequeue(Q)
47 |     check w
48 |     ∀e∈edges(G, w):
49 |       u = vertex(G, w, e)
50 |       if u not in S:
51 |         add(S, u)
52 |         enqueue(Q, u)
53 | ```
54 | 
55 | O(m) time, O(n) space (worst case).
56 | 


--------------------------------------------------------------------------------
/list/Readme.md:
--------------------------------------------------------------------------------
 1 | # Sort
 2 | 
 3 | Sorted lists are easier to search through / extract data, but maintaining that
 4 | property requires either care or full-list sorting (as below).
 5 | 
 6 | |           | Average       | Worst         | Stable | Memory         | Note   |
 7 | |-----------|---------------|---------------|--------|----------------|--------|
 8 | |Merge sort | O(n log n)    | O(n log n)    | yes    | O(n)           |        |
 9 | |" in-place | O(n log(n)^2) | O(n log(n)^2) | yes    | O(1)           |        |
10 | |Quicksort  | O(n log n)    | O(n^2)        | no     | O(log n) / O(n)|The pivot technique is used elsewhere|
11 | |Heapsort   | O(n log n)    | O(n log n)    | no     | O(1)           |In-place|
12 | |Insertion  | O(n^2)        | O(n^2)        | yes    | O(1)           |Booklike|
13 | 
14 | ## Radix sort
15 | 
16 | Famous for being "linear". O(wn) worst-case, with n the size of the list, and w
17 | the size of the items (eg, for 64 bit integers, 64). If the list has no
18 | duplicates, w will be ≥log(n), so it is not really linear. Use this only if you
19 | have mostly duplicates (this sort is stable).
20 | 
21 | # Shuffle
22 | 
23 | Fisher–Yates shuffle is O(n) and can be in-place (O(1) memory).
24 | 
25 | # Search
26 | 
27 | ## Binary search
28 | 
29 | O(log n) search for a key with no index in a sorted list.
30 | 
31 | You know the index is between a lower bound and an upper bound (initially
32 | including all of the list), and you reduce their span by checking if the item is
33 | on the left or on the right half of the span.
34 | 
35 | If you don't know how big the list is, you can do **exponential search**: first
36 | exponentially try to find an index whose item is bigger than the searched item,
37 | then do binary search.
38 | 
39 | ## Interpolation search
40 | 
41 | Better average complexity. Instead of halving the span, cut the span in
42 | proportion to where the item should be. For instance, in the dictionary, the
43 | word "zebra" would be around the end. It only works if items are uniformly
44 | distributed (O(log log n)).
45 | 


--------------------------------------------------------------------------------
/list/binary-search.js:
--------------------------------------------------------------------------------
 1 | // A list and an item that may be in that list.
 2 | // Returns -1 if it is not there, or the index of that item if it is.
 3 | function search(list, item) {
 4 |   return binarySearch(list, item, 0, list.length - 1);
 5 | }
 6 | 
 7 | function binarySearch(list, item, imin, imax) {
 8 |   while (imin < imax) {
 9 |     // Idea: average of imin and imax, (imin + imax) / 2.
10 |     // If imin and imax are too large, their sum could trigger an integer
11 |     // overflow (go past the largest integer representable).
12 |     // (imin + imin - imin + imax) / 2 = (imax - imin) / 2 + imin
13 |     var imid = Math.floor((imax - imin) / 2) + imin;
14 | 
15 |     // 0 <= imin <= imid < imax
16 | 
17 |     if (list[imid] < item) {
18 |       // |----|--x-|
19 |       imin = imid + 1;
20 |     } else {
21 |       // |--x-|----|
22 |       imax = imid;
23 |     }
24 |   }
25 | 
26 |   // Now, imin >= imax.
27 |   // imin > imax if the list is empty.
28 |   if ((imax === imin) && (list[imin] === item)) {
29 |     return imin;
30 |   } else {
31 |     return -1;
32 |   }
33 | }
34 | 
35 | // 3
36 | console.log(search([2, 34, 321, 834, 854, 856], 834));
37 | 


--------------------------------------------------------------------------------
/list/shuffle.js:
--------------------------------------------------------------------------------
 1 | function shuffle(list) {
 2 |   for (var i = list.length - 1; i > 0; i--) {
 3 |     // i goes left:
 4 |     //        j  i
 5 |     //      --x--x---
 6 |     // unshuffled shuffled
 7 |     var j = Math.floor(Math.random() * (i + 1));
 8 |     // i+1 because it should be able to stay in place.
 9 |     [list[i], list[j]] = [list[j], list[i]];
10 |   }
11 |   return list;
12 | }
13 | 
14 | console.log(shuffle([1, 2, 3, 4, 5]));
15 | 


--------------------------------------------------------------------------------
/list/sort/merge-sort.js:
--------------------------------------------------------------------------------
 1 | // Classic divide-and-conquer algorithm.
 2 | // This implementation is not in-place.
 3 | 
 4 | function sort(list) {
 5 |   // Cut the list in a left piece (which we sort) and a right piece (which we
 6 |   // sort).
 7 |   var n = list.length;
 8 |   var target = new Array(n);
 9 |   // The smallest case is sublists of 1 item (which are then sorted).
10 |   // Then we use sublists that double in size every time.
11 |   for (var width = 1; width < n; width *= 2) {
12 |     // Go from sublist to sublist.
13 |     for (var i = 0; i < n; i += (2 * width)) {
14 |       // Merge the sorted sublists from the last run.
15 |       merge(list, i, Math.min(i + width, n), Math.min(i + 2*width, n), target);
16 |     }
17 |     // The target contains the better data, we'll use list as the new buffer.
18 |     var tmp = target;
19 |     target = list;
20 |     list = tmp;
21 |   }
22 |   return list;
23 | }
24 | 
25 | // Items from ileft to iright-1 are sorted on their own,
26 | // items from iright to iend are sorted on their own.
27 | function merge(list, ileft, iright, iend, target) {
28 |   // |------|------|
29 |   // ileft  iright iend
30 |   var imiddle = iright;
31 | 
32 |   // We will cover each item eventually, by increasing ileft and iright.
33 |   for (var j = ileft; j < iend; j++) {
34 |     if (ileft < imiddle  // We still have a left item.
35 |       // We don't have a right item or it is larger.
36 |       && ((iright >= iend) || (list[ileft] <= list[iright]))) {
37 |       // Put the left item.
38 |       target[j] = list[ileft];
39 |       ileft += 1;
40 |     } else {
41 |       // Put the right item.
42 |       target[j] = list[iright];
43 |       iright += 1;
44 |     }
45 |   }
46 | }
47 | 
48 | console.log(sort([2, 5, 4, 1, 3]));
49 | 


--------------------------------------------------------------------------------
/list/sort/quicksort.js:
--------------------------------------------------------------------------------
 1 | // Classic recursive algorithm.
 2 | 
 3 | function sort(list) {
 4 |   return quicksort(list, 0, list.length - 1);
 5 | }
 6 | 
 7 | function quicksort(list, lo, hi) {
 8 |   if (lo < hi) {
 9 |     // Find a pivot in the middle (can be random, here, it's at the end).
10 |     // Things smaller than the pivot will all accumulate on the left.
11 |     var pivot = list[hi];
12 |     var i = lo;
13 |     for (var j = lo; j < hi; j++) {
14 |       // Put all the items smaller than the pivot on the left.
15 |       // |--------|------|---p
16 |       // lo  <=p  i  >p  j   hi
17 |       if (list[j] <= pivot) {
18 |         // Swap i and j.
19 |         var tmp = list[j];
20 |         list[j] = list[i];
21 |         list[i] = tmp;
22 |         i += 1;
23 |       }
24 |     }
25 |     // i has now all items <=p on the left, and >p on the right.
26 |     // Put p (which is still on hi) between them.
27 |     var tmp = list[hi];
28 |     list[hi] = list[i];
29 |     list[i] = tmp;
30 | 
31 |     // Sort the left side and the right side of the pivot.
32 |     quicksort(list, lo, i - 1);
33 |     quicksort(list, i + 1, hi);
34 |   }
35 |   return list;
36 | }
37 | 
38 | console.log(sort([2, 5, 4, 1, 3]));
39 | 


--------------------------------------------------------------------------------
/misc/cryptography.md:
--------------------------------------------------------------------------------
  1 | # Cryptography
  2 | 
  3 | ## One-way function
  4 | 
  5 | **One-way functions** are such that:
  6 | 
  7 | - `one-way(input)` is computed in [polynomial-time](../Complexity.md)
  8 | - All randomized polynomial-time functions `inverse(output)` such that
  9 |   `inverse(one-way(input))` have on average a near-zero probability to
 10 |   return `input` as the result of the computation of `inverse`.
 11 | 
 12 | In practice, cryptanalysis can discover new ways to find the input from the
 13 | output which changes the estimated probability, or machines can become more
 14 | powerful than planned. As a result, it is necessary to stay up-to-date to
 15 | correctly estimate risk.
 16 | 
 17 | ### Hash
 18 | 
 19 | A **Hash** is a one-way function returning a fixed-sized (typically small)
 20 | output such that the following functions are computationally too hard:
 21 | - `collision() = (m1, m2)` such that `hash(m1) = hash(m2)`
 22 | - `preimage` such that `preimage(hash(m)) = m`
 23 | - `second_preimage(m1) = m2` such that `hash(m1) = hash(m2)`
 24 | 
 25 | It is useful to uniquely identify a large message in a small amount of memory
 26 | (typically 64 bits (weak), 128 bits, or 256 bits) so that checking identity is
 27 | fast.
 28 | 
 29 | A regular NIST competition is performed to select a good hash function: the
 30 | Secure Hash Algorithm (SHA).
 31 | 
 32 | Ron Rivest’s MD5 is broken; SHA-0 and SHA-1 are considered broken; some SHA-2
 33 | constructions have dangerous properties (vulnerability to *length-extension
 34 | attacks*) that require the use of the HMAC algorithm for message authentication
 35 | (but SHA-512/256 (ie. SHA-512 truncated to 256 bits) does not), and Joan
 36 | Daemen’s SHA-3 is the latest as of 2018 (and does not have the SHA-2 issues).
 37 | 
 38 | Famous non-SHA cryptographic hash functions include BLAKE2 (derived from SHA
 39 | finalist BLAKE, itself derived from djb’s ChaCha20), Kangaroo12 (derived from
 40 | SHA-3).
 41 | 
 42 | Famous non-cryptographic hash functions include Zobrist (eg. to detect unique
 43 | states in a game), FNV, CityHash, MurmurHash, SipHash (for hash tables).
 44 | 
 45 | **Universal hash functions** are a family of hash functions where a key
 46 | determines which function of the family is picked (usually, a random number
 47 | picked when the hash table is created in memory). They only target
 48 | **collision-resistance** against an adversary *that doesn’t know the key*.
 49 | *SipHash* (by JP Aumasson and djb) is secure under those assumptions, and fast
 50 | enough to be used to avoid malicious collisions in hash tables causing
 51 | performance degradation and unavailability.
 52 | 
 53 | A MAC (see below) can be used as a universal hash function by using a random
 54 | key.
 55 | 
 56 | **Rolling hash functions** TODO
 57 | 
 58 | ### Message Authentication Code
 59 | 
 60 | A **Message Authentication Code** (MAC) can assert the following properties if
 61 | you share a secret key with a given entity:
 62 | - **authentication**: the message was validated by a keyholder,
 63 | - **integrity**: the message was not modified by a non-keyholder.
 64 | 
 65 | It can be done with a hash; it is then called **keyed hash function**.
 66 | Modern hash functions such as SHA-3 or BLAKE2 offer this functionality this way:
 67 | - `mac = authentify(message, key) = hash(key + message)` (`+` is string
 68 |   concatenation),
 69 | - `verify(message, mac, key) = authentify(message, key) == mac`.
 70 | 
 71 | Older hashes suffer from *length-extension attacks* with this approach. However,
 72 | they can be used by relying on the **HMAC** algorithm:
 73 | - `authentify(message, key) = hash((rehash(key)^outerPad) +
 74 |   hash((rehash(key)^innerPad) + message))` where `^` is XOR, the pads are fixed
 75 |   and the size of the hash block, and `rehash` depends on the hash size.
 76 | 
 77 | Another common MAC is djb’s *Poly1305*.
 78 | 
 79 | HOTP, TOTP: TODO
 80 | 
 81 | ### Key derivation function
 82 | 
 83 | One use of cryptographic hash functions is to store password, but only via a
 84 | metafunction called a **key derivation function** (KDF) that performs **key
 85 | stretching** (passing a low-[entropy][] password through the hash function in a
 86 | loop a large number of times) and salting (putting a known random prefix to
 87 | protect against *rainbow attacks*).
 88 | 
 89 | Famous KDFs include PBKDF2, bcrypt, scrypt and Argon2, ordered by date of
 90 | creation and increased confidence in security. They typically have a stretching
 91 | parameter to increase their complexity so that the same algorithm can be used
 92 | when computers get better at brute-forcing passwords.
 93 | 
 94 | [entropy]: ./information.md
 95 | 
 96 | (Note that key stretching is only about low entropy: large-enough purely-random
 97 | passwords hashed through a cryptographic hash cannot be brute-forced. For
 98 | instance, a 256-bit BLAKE2 of a 128-bit CSPRNG output has 256/2 = 128 bits of
 99 | security. Brute-forcing it would require `2^127` attempts on average. Computers
100 | take at best 1 ns to perform an elementary operation, and `2^127` ns is 4 times
101 | the age of the known universe. Parallelizing would cost 20 septillion € of
102 | machines to brute-force a typical hash in 100 years, if Earth had enough
103 | material to build the computers. It goes down to 1 million € with 64 bits of
104 | security, which is why 128 bits are used when security matters, eg. with
105 | UUIDv4.)
106 | 
107 | ## Randomness
108 | 
109 | Humans are terrible at estimating randomness, and machines (and cryptographers)
110 | are pretty good at exploiting weaknesses in randomness.
111 | 
112 | A good random source obeys certain *statistical* characteristics to ensure that
113 | the probability of someone predicting its output is near zero.
114 | (See for instance the [NIST randomness recommendation][].)
115 | 
116 | [NIST randomness recommandation]: https://csrc.nist.gov/projects/random-bit-generation
117 | 
118 | One option is to gather and privately store data from physical events that are
119 | very hard for someone else to control, such as electric or atmospheric noise,
120 | or time noise in the occurrence of events in a booting operating system.
121 | 
122 | Another is to make a **pseudo-random** number generator (PRNG), such that
123 | `random = prng(seed)` is a function that yields a bit (or a fixed-sized list of
124 | bits, as a number) every time it is called, such that:
125 | 
126 | - it yields the same sequence of bits given the same seed,
127 | - the sequence of bits obeys the statistical characteristics we talked about.
128 | 
129 | Cryptographically-Secure PRNG (CSPRNG) are designed more carefully but are
130 | typically slower. They are usually instances of a **pseudorandom function
131 | family** (PRF).
132 | 
133 | Examples: arc4random (based on a leaked version of the RC4 cipher), AES-CTR,
134 | ChaCha20 (eg. in Linux’ /dev/urandom).
135 | 
136 | Examples of non-cryptographically-secure: LCG, XorShift, Mersenne Twister, PCG
137 | (in order of quality against predictability).
138 | 
139 | Typically, on Unix systems, `/dev/urandom` is a CSPRNG fed with a pool of
140 | entropy from boot-time randomness extracted from the operating system.
141 | 
142 | ## Symmetric-key ciphers
143 | 
144 | Cipher that defines two functions `encrypt(msg, key)` and `decrypt(msg, key)` such that:
145 | 
146 | - `decrypt(encrypt(msg, key), key) = msg`
147 | - `encrypt` is a one-way function (and often `decrypt` too).
148 | 
149 | **Reciprocal ciphers** are such that `decrypt = encrypt`, eg. the Enigma machine.
150 | 
151 | Two common designs: stream and block ciphers.
152 | 
153 | ### Stream ciphers
154 | 
155 | They use a construct where every bit of information is encrypted
156 | one at a time; you give it the next bit of message, you instantly get the next
157 | bit of ciphertext out.
158 | 
159 | The *Vigenère cipher* survived hundreds of years of cryptanalysis, earning it
160 | the name of “chiffre indéchiffrable” (indecipherable). While Babbage broke it by
161 | noticing repeated sequences in the plaintext exhibited repeated sequences in the
162 | ciphertext, it inspired the creation of the only provably unbreakable cipher,
163 | the *one-time pad*.
164 | 
165 | The **one-time pad** simply performs modular addition on each symbol (in the
166 | case of bits, this corresponds to a XOR with the secret key). It requires a
167 | perfectly random secret key of a size equal to the plaintext that is never
168 | reused.  *Claude Shannon* proved [information-theoretically](./information.md)
169 | that it is unbreakable (the only such cipher known to date), as for all
170 | plaintexts, there is a key that yields a given ciphertext. Practical risks in
171 | managing the key caused it to fall into disuse.
172 | 
173 | Ron Rivest designed **RC4** (Rivest Cipher 4) as a proprietary algorithm for
174 | the RSA Security company. Following an anonymous online description, it was
175 | reverse-engineered. To avoid trademark conflicts, many systems adopted it as
176 | *ARC4*, and a derived CSPRNG was called *arc4random*. It was a common cipher in
177 | SSL/TLS and WEP/WPA, until a 2015 flaw was discovered.
178 | 
179 | Daniel J. Bernstein (djb) designed *Salsa20* for the eSTREAM competition (a
180 | follow-up to the NESSIE competition where all stream ciphers submitted were
181 | broken). Many servers switched from RC4 to a derived cipher, **ChaCha20**, along
182 | with djb’s Poly1305 MAC, to have authenticated encryption ([RFC 7905][]).
183 | 
184 | [RFC 7905]: https://tools.ietf.org/html/rfc7905
185 | 
186 | ### Block ciphers
187 | 
188 | A block cipher, by contrast with a stream cipher, can only encrypt a fixed
189 | number of bits (its *block size*).
190 | 
191 | Most block ciphers are **product ciphers**: the generation of an encrypted block
192 | relies on repeating an operation (typically performing substitutions
193 | (**s-boxes**) and permutations (**p-boxes**)) multiple times by linking the
194 | output of one to the input of the next in a sophisticated *network* which
195 | increases security every time. (They achieve that by distributing the impact of
196 | each input bit to output bits, producing statistically more random output.)
197 | 
198 | The number of times the network is repeated is called the number of **rounds**.
199 | Typical cryptanalysis first tries to break a cipher with a lower number of
200 | rounds. If they find a better algorithm than brute-force on all rounds, the
201 | cipher is considered *theoretically broken*, but the algorithm typically
202 | requires impractical amounts of time and memory. If it achieves a scale close to
203 | human lives and memory close to that of a country, it is considered *practically
204 | broken*.
205 | 
206 | The US government’s NBS (ancestor to NIST) requested proposals for a cipher. IBM
207 | proposed **DES** (Data Encryption Standard), a 64-bit block cipher (based on a
208 | Feistel network), whose s-boxes were then tweaked by NSA and key size reduced to
209 | 56 bits before publication.
210 | 
211 | When DES’s key size became dangerously close to brute-force-worthy, 3DES was
212 | produced, but it was very slow. NIST organized a more open competition, **AES**
213 | (Advanced Encryption Standard). The finalist, Vincent Rijmen and Joan Daemen’s
214 | Rijndael, is a 128-bit block cipher based on a SP-network, with three variants:
215 | 128-, 192-, 256-bit keys (with 10, 12, or 14 rounds). It was rebaptized AES when
216 | it won.
217 | 
218 | ### Block modes
219 | 
220 | Block modes convert a block cipher into a stream cipher by breaking the
221 | plaintext into blocks and encrypting each block with the cipher and a parameter
222 | that depends on the processing of previous blocks.
223 | 
224 | The lack of use of that parameter, eg. by encrypting each block individually
225 | (*ECB* mode), falls to shifted plaintext analysis: identical plaintext blocks
226 | will have identical ciphertext blocks.
227 | 
228 | **CBC** (Cipher Block Chaining) for instance XORs each block of plaintext with
229 | the ciphertext of the previous block. Diffie and Hellman also designed **CTR**
230 | (Counter) mode, which XORs the plaintext with an encrypted **nonce** (a unique
231 | input) that is incremented for every block.
232 | 
233 |     Parameters: key, nonce, plaintext.
234 |     ciphertext block 1 = encrypt(nonce + 0, key) XOR (plaintext block 1)
235 |     ciphertext block 2 = encrypt(nonce + 1, key) XOR (plaintext block 2)
236 |     etc.
237 | 
238 | When the parameter for each block is obtained from the previous block, the first
239 | block needs an initial parameter. It must be unique (ie. a nonce, “number used
240 | once”), so that encrypting twice the same message does not yield the same
241 | ciphertext, which would leak information. Often, it needs to be random in a
242 | cryptographically-secure way. That first parameter is called an **initialization
243 | vector**. It must be sent along with the ciphertext so it can be deciphered.
244 | 
245 | Those ciphers only ensure **confidentiality** (the message can only be read by
246 | keyholders), but they lack:
247 | - **authentication**: the message was validated by a keyholder,
248 | - **integrity**: the message was not modified by a non-keyholder.
249 | 
250 | The lack of those guarantees can allow a non-keyholder to tamper with the
251 | encrypted content unnoticed. The decrypted plaintext would then contain
252 | planted or substituted information.
253 | 
254 | **Authenticated Encryption** (AE) offer authentication and integrity by adding a
255 | MAC. For instance, **GCM** (Galois Counter mode) converts a block cipher to an
256 | authenticated stream cipher which encrypts in counter mode and also produces a
257 | fixed-sized tag (a MAC) for the whole message.
258 | 
259 | There are three variants of AE: **Encrypt-then-MAC** (EtM), which hashes
260 | encrypted data, **Encrypt-and-MAC** (E&M), which hashes plaintext data, and
261 | **MAC-then-Encrypt** (MtE), which encrypts hashed plaintext data.
262 | 
263 | EtM is considered the most secure. MtE, for instance, has caused vulnerabilities
264 | such as Lucky13 in the way it interacts with padding.
265 | 
266 | Usually, you can also insert non-encrypted metadata along with the ciphertext,
267 | which you wish to include for integrity in the AE MAC. That design is called
268 | **Authenticated Encryption with Associated Data** (AEAD). For instance, GCM mode
269 | supports that.
270 | 
271 | ## Asymmetrical cryptography
272 | 
273 | Cipher that defines three functions `public, private = keys(random)`,
274 | `encryptPublic(msg, public)`, `encryptPrivate(msg, private)`, such that:
275 | 
276 | - `encryptPrivate(encryptPublic(msg, public), private) = msg`
277 | - `encryptPublic(encryptPrivate(msg, private), public) = msg`
278 | - `encryptPublic` and `encryptPrivate` are one-way functions.
279 | 
280 | ```
281 | ┌───────────┐ ─ encryptPublic  → ┌───────────┐
282 | │ message 1 │                    │ message 2 │
283 | └───────────┘ ← encryptPrivate ─ └───────────┘
284 | ```
285 | 
286 | Most ciphers rely on one of two common mathematically difficult problems to
287 | enforce the one-way constraint:
288 | - factoring primes (**RSA**),
289 | - elliptic curves (**EC**, eg. NIST P-256 (aka secp256r1), or Curve25519). The
290 |   keys are typically smaller (eg. 256-bit, compare to 4096 bits for RSA).
291 | 
292 | **RSA** (Rivest, Shamir, Adleman) was the first public-key cryptosystem, and
293 | shows how to encrypt data in its original formulation. However, it is usually
294 | used as **RSAES-OAEP** ([RFC 2437][]) for use in encryption, detailing the
295 | proper use of padding by relying on a hash function.
296 | 
297 | [RFC 2437]: https://tools.ietf.org/html/rfc2437
298 | 
299 | Elliptic curves don’t by themselves have an encryption algorithm, but **ECIES**
300 | (Elliptic Curve Integrated Encryption Scheme) combines an EC, a *KDF*, a *MAC*,
301 | and a *symmetric encryption scheme* to encrypt data just with a public key.
302 | 
303 | ### Key exchange
304 | 
305 | Encryption is much more computationally expensive than symmetric schemes for
306 | large (> 400 bytes) messages. Since the goal of asymmetric encryption is to
307 | allow secure communication over a public channel without needing a shared secret
308 | (the issue with symmetric ciphers), this is limiting.
309 | 
310 | A **key exchange** is a protocol where two entities communicate in public,
311 | resulting in them generating a secret key that only they know.
312 | 
313 | **Diffie-Hellman** is a key exchange that lets two parties A and B obtain a
314 | shared secret over a public channel. That secret can then be used as the key of
315 | a symmetric cipher.
316 | 
317 | 1. They each generate public and private keys with the same parameters (the
318 |    modulo portion for RSA, the domain parameters for elliptic curves (ECDH)).
319 | 2. They agree on a base message `m`.
320 | 3. A sends `encryptPrivate(m, privateA)`, and B sends `encryptPrivate(m,
321 |    privateB)`.
322 | 4. A computes `secret = encrypPrivate(encryptPrivate(m, privateB), privateA)`
323 |    and b `secret = encryptPrivate(encryptPrivate(m, privateA), privateB)`, whose
324 |    equality results from commutativity in underlying math.
325 | 5. `secret` is now shared exclusively between A and B.
326 | 
327 | Advice for the common values to choose is detailed in
328 | [RFC 5114](https://tools.ietf.org/html/rfc5114).
329 | 
330 | Daniel J. Bernstein’s **X25519** is a famous ECDH using Curve25519 (picked by
331 | djb for that use).
332 | 
333 | Systems which generate a new random secret key for every new session are said to
334 | have **forward secrecy**.
335 | 
336 | ### Digital signature
337 | 
338 | **Digital signatures** associated with a message give the following guarantees:
339 | - **authentication**: the message was validated by a keyholder,
340 | - **non-repudiation**: the message cannot be un-validated by the keyholder,
341 | - **integrity**: the message was not modified by a non-keyholder.
342 | 
343 | Unlike a MAC, it does not require a shared secret, just shared public keys.
344 | 
345 | It relies on two functions, `sign()` and `verify()`, and a public/private key
346 | pair, such that:
347 | - `verify(message, sign(message, private), public) = true` and all other
348 |   parameter combinations yield `false` with high probability.
349 | 
350 | For RSA, this is achieved this way:
351 | - `sign(message, private) = encryptPrivate(hash(message), private)`
352 | - `verify(message, signature, public) = encryptPublic(signature, public) ==
353 |   hash(message)`
354 | 
355 | In practice, **RSASSA-PKCS1-v1\_5** defines an RSA signature scheme for a given
356 | hash.
357 | 
358 | **RSASSA-PSS** defines another RSA signature scheme for a given hash, mask
359 | generation formula, and a randomly generated salt of a given size. Both of those
360 | schemes are defined in [RFC 3447][].
361 | 
362 | [RFC 3447]: https://tools.ietf.org/html/rfc3447
363 | 
364 | **ECDSA** (Elliptic Curve Digital Signature Algorithm) achieves that scheme,
365 | given any EC and a hash.
366 | 
367 | Daniel J. Bernstein’s **EdDSA** is another digital signature scheme relying on
368 | Edwards curves (such as Curve25519), and tends to be faster than ECDSA. The
369 | primary example is ed25519, which is included in OpenSSH.
370 | 
371 | ## Quantum Cryptography
372 | 
373 | **Shor’s algorithm** solves integer factorization in logarithmic time on a
374 | quantum computer, which breaks the complexity assumption of RSA (and a variant
375 | would also break elliptic curve cryptography). The part of the algorithm running
376 | on a classical computer is randomized: pick a number F < N, discard it in some
377 | edge-cases. The quantum part finds the period of f(x) = F^x mod(N) by entangling
378 | photons in a circuit such that interference will cause the observation of the
379 | photons to collapse to one of several states which follow an equation of the
380 | period. Multiple equations allow to solve the period. Then the classical
381 | computer checks that it can use F and the period to find factors of N.
382 | 
383 | However, quantum computers struggle with:
384 | - **Coherence time** (how much time the qubits stay uncorrupted). Keeping an
385 |   algorithm running > 10 s is a challenge.
386 | - Number of qubits. The largest quantum computer has just 2000 qubits. Shor’s
387 |   algorithm needs twice the size of the RSA key, eg. 4096 for 2048-bit RSA, and
388 |   likely ten times that to correct errors.
389 | 
390 | The largest number factorized by a quantum computer is 19-bit. There are no case
391 | yet of a quantum computation going faster than the equivalent classical
392 | computation (aka. **quantum supremacy**).
393 | 
394 | It should eventually happen, which is why other techniques for asymmetric
395 | cryptography are researched, such as **lattice cryptography**.
396 | 
397 | ## Going further
398 | 
399 | - [Serious Cryptography](https://seriouscrypto.com)
400 | 


--------------------------------------------------------------------------------
/misc/engineering.md:
--------------------------------------------------------------------------------
  1 | # Engineering
  2 | 
  3 | 1. Define the desired features of a solution for your problem.
  4 | 2. Break the problem into tractable subproblems.
  5 | 3. Make the simplest working solution to each subproblem.
  6 | 4. Compute all limits of your solution (performance, storage… even if the
  7 |    results are astronomically high).
  8 | 5. Research the mathematical and physical laws causing those limits.
  9 | 6. Improve the implementation.
 10 | 7. When that is no longer enough, improve the design.
 11 | 
 12 | ## Example problem: image store
 13 | 
 14 | 1. We add images with `POST /images` which returns a string ID, and
 15 |    `GET /images/ID` returns the image.
 16 | 2. We must generate unique IDs, store the image, associate an ID to a
 17 |    location in the store, get the image from its ID.
 18 | 3. Generate the ID by incrementing a global variable, keep a vector of pointers
 19 |    to images in the heap, fetch an image by looking it up in the vector at the
 20 |    ID's index.
 21 | 4. Limits:
 22 |    - Storage costs are about 6 €/GB and the maximum amount of images is about
 23 |    600 GB (about 0.6 million photos) since we must have a single server.
 24 |    - A 1 MB image takes about 10 ms gigabit ethernet throughput + 50 μs RAM
 25 |    throughput (unnoticeable) = 10 ms to load.
 26 |    (cf. [memory limits](./memory.md).)
 27 |    - The probability of data loss in a given year on a server with
 28 |    99.999% SLA for operating system uptime with a 3-min reboot (ie, averaging
 29 |    1.8 reboots a year) can be computed from a [Poisson](./statistics.md)
 30 |    distribution as 1-exp(-1.8) (about 0.8).
 31 | 5. The price limit is about the diminishing returns of economies of scale of
 32 |    DRAM production and the infrastructure of its construction and distribution.
 33 |    The amount limit is about efficiently packing DRAM on a single board for
 34 |    cloud providers. The probability limit is about electrical volatility of DRAM
 35 |    state.
 36 | 
 37 | The latter is the primary issue to address. We can store images on disk instead,
 38 | on a drive mounted on /img, with the file name equal to the hex representation
 39 | of the 64-bit ID, eg. /img/000000000000000c for ID 12.
 40 | 
 41 | (Depending on how the file system deals with directories with a large number of
 42 | files, you may need to segment the key space: /img/a0/00000000000000a0 for ID
 43 | 160, /img/ef/00000000deadbeef for ID 3735928559.)
 44 | 
 45 | - Storage reaches 40 €/TB, the maximum (still single-server) is about 100 TB
 46 |   (about 100 million photos).
 47 | - A 1 MB image takes 10 ms gigabit ethernet throughput + 10 ms disk throughput +
 48 |   10 ms disk latency = 30 ms.
 49 | - The probability of loss in a given year (aka. annualized failure rate) for a
 50 |   disk averaging 0.02 failures a year is `1-exp(-0.02)` (about 0.02).
 51 | 
 52 | Going with an SSD instead yields:
 53 | - 250 €/TB, maxing out at about 100 TB.
 54 | - A 1 MB image takes about 10 ms gigabit ethernet throughput + 1 μs drive
 55 |   throughput + 30 μs drive latency = 10 ms.
 56 | - The annualized failure rate is about 0.007.
 57 | 
 58 | From then on, you can increase by synchronizing multiple servers. A master
 59 | server holds a persisted map from ID to the storage server where the image is
 60 | stored, loads the data from there to RAM and transits it through.
 61 | 
 62 | When storing an image, the master server sends the image to the storage server
 63 | (10 ms) and they simultaneously write to drive: the image for the storage server
 64 | (30 μs latency + 1 μs throughput) and the mapping from ID to the storage server
 65 | for the master server.
 66 | 
 67 | Both when posting and getting the image, the image can be buffered instead of
 68 | being fully loaded by the metadata server and then transmitted, which reduces
 69 | image loading latency (time-to-first-byte) to just 500 μs round trip between
 70 | servers within a datacenter + 50 μs SSD latency = 550 μs. The full image will
 71 | still be loaded after 10 ms.
 72 | 
 73 | Buffering also reduces the amount of RAM necessary to the number of concurrent
 74 | requests multiplied by the size of the buffer, plus the mapping between ID and
 75 | storage server.
 76 | 
 77 | To reduce the RAM cost, the mapping can be put on disk with a **key-value
 78 | store** (eg. [RocksDB](http://rocksdb.org/)) without affecting latency (a write
 79 | is about 60 μs, a read 8 μs).
 80 | 
 81 | The new bottleneck is the storage of the mapping from ID to storage server.
 82 | - We can reach about 100 TB drive ÷ (64 bits ID + 32 bits index of storage
 83 |   server) = 8 trillion images (ignoring the overhead of storing into SST files).
 84 | - The death of the drive causes complete loss of the data, and it is still at an
 85 |   AFR of 0.7%.
 86 | 
 87 | We can mitigate the latter through backups. If we do them every B hours, get
 88 | P posts per hour and recover in R hours, we will lose 0.7 × (B + R) × P ÷ (P ×
 89 | 8760) percent of posts a year (eg. a 99.9999% SLA for hourly backups with 100
 90 | posts per second and 2-minute recovery).
 91 | **Streaming replication** can get our SLA very high.
 92 | 
 93 | Individual storage servers can also fail, so we can replicate: we save images
 94 | to two servers in parallel, and we fetch it from the first server that is not
 95 | dead. The probability that any two replicas die within the time it takes to
 96 | rereplicate (say, 1h) is very low, 1-(1-8×10^-7)^I (where I is the number of
 97 | images), but at a million images, a loss is as likely as a coin toss.
 98 | **Trireplication** brings it down to 1-(1-6×10^-13)^I; you need a trillion
 99 | images for a loss to again be a coin toss.
100 | 
101 | (Another issue is that of [bitrot](./memory.md). A solution used in GFS (Google
102 | File System) is to check for corruption by computing a checksum, and fetching
103 | from one of the other two replicas if it doesn't match the stored checksum. Its
104 | successor Colossus doesn't trireplicate, but stores a replica off-site and a
105 | Reed-Solomon erasure code on a separate server in the datacenter. If a
106 | corruption is found, the data is replaced by its remote replica.)
107 | 
108 | The new bottleneck is the throughput of image posts, since they all go through a
109 | single server. Write frequency reaches 100 kHz [in some benchmarks][badger].
110 | 
111 | [badger]: https://blog.dgraph.io/post/badger/
112 | 
113 | To solve that, we rely on a **distributed key-value store**, such as etcd or
114 | FoundationDB, which distribute writes through automatic segmentation of the key
115 | space into chunks each assigned to a server, and through coordination of that
116 | repartition and of writes by using Paxos or [Raft][] (two similar
117 | [consensus algorithms](./synchronization.md)).
118 | 
119 | [Raft]: https://raft.github.io/
120 | 
121 | Since there is no longer a single writer, we can no longer generate IDs by
122 | incrementing a counter. We can solve this by generating 64-bit IDs that are the
123 | bitwise concatenation of the 32-bit ID of the master server and a 32-bit counter
124 | incremented on that server.
125 | 
126 | The write latency goes up a bit because the consensus algorithm typically
127 | requires two round trips between the servers (1 ms).
128 | 
129 | From then on, the next bottleneck is the gateway server your client is connected
130 | to: it has a limit in the number of concurrent requests it supports (the C10k
131 | problem). It can probably handle 10k concurrent requests, and each request takes
132 | 10 ms, which computes to a request frequency of 1 MHz.
133 | 
134 | We can change the requirements to have the gateway redirect you to a random
135 | master server, which ultimately will be bottlenecked by the fact that
136 | coordination of the distributed key-value store requires all servers to keep
137 | some information about all other servers. That coordination also has hidden
138 | costs beyond thousands of master servers relating to gossip protocols,
139 | clock skew, and datacenter management.
140 | 
141 | At that point, we can store on the order of 1k servers × 100 TB ÷ (64-bit ID +
142 | 3 × 32-bit storage server ID) = 5 quadrillion images.
143 | The storage servers can still hold them with a 32-bit ID: their limit is at 2^32
144 | servers × 100 TB ÷ 1 MB = 42 quadrillion images.
145 | 
146 | The next step is then to use a **Distributed Hash Table** (DHT), wherein each
147 | server (and client) only need to know a subset of all servers, and they get the
148 | value of a key typically in log(N) hops, where N is the number of servers. The
149 | latency is then at 10 + log(N) × L ms (with L the average latency between two
150 | servers: 1 ms within a city, but it rises up fast when going to other cities).
151 | 
152 | At that point, since serveurs are no longer assigned an incrementing ID, image
153 | IDs are randomly generated, typically 128 bits (**UUID**), or obtained by
154 | [hashing][] the image (**content-addressing**).
155 | 
156 | [hashing]: ./cryptography.md
157 | 
158 | This is the design used by most object stores, such as S3, Ceph, GlusterFS,
159 | Bittorrent, or IPFS, and related databases, such as Cassandra and Dynamo.
160 | 
161 | Then we reach the limit of Earth's surface area. The problem then involves the
162 | severe latency of interplanetary communication (13 minutes on average for light
163 | from Earth to reach Mars).
164 | 


--------------------------------------------------------------------------------
/misc/memory.md:
--------------------------------------------------------------------------------
  1 | # Memory
  2 | 
  3 | Memory is either **volatile** (requires constant power) or persistent, random-access (**RAM**) or sequential, read-only (**ROM**) or writable, and has varying costs, durability and transfer speeds.
  4 | 
  5 | As far as speed is concerned, it is important to distinguish **latency** (duration between the request and the start of the response) and **throughput** (amount of bits per second).
  6 | For instance, the fastest way to transmit 20 TB across the globe is on SD cards by cargo plane: even on 1 Gbps Ethernet, it takes 44 hours, while a plane takes 22 hours.
  7 | Planes have tremendous throughput, but their latency is 22 hours, while Ethernet will be around 100 ms.
  8 | 
  9 | ## CPU
 10 | 
 11 | The CPU has **registers** to store memory for computation. They are as fast as the processor (about 1 cycle per read or write), but there are about 16 64-bit registers (and 16 128-bit registers for floating point computation).
 12 | (There are a handful of other registers for SIMD, SSE, …)
 13 | 
 14 | The CPU also has caches with varying latencies and amounts: **L1** (0.5 ns), **L2** (7 ns), sometimes **L3**, altogether worth about 3 MB on laptops.
 15 | 
 16 | ## Main Memory
 17 | 
 18 | The main location of volatile storage; 100 ns latency, 20 GB/s. When a process is run, its code and data are located in the RAM. The memory given to a process is separated into five **segments**:
 19 | 
 20 | - The **stack** stores local variables of all function calls leading to and including the currently executed function. All local variables of a function are destroyed when the function ends.
 21 | - The **heap** stores data dynamically allocated and deallocated by the program, pointed to by a **pointer** on the stack or on the heap. It is useful to manage a common data structure from several functions, but the indirection is slower than accessing data on the stack.
 22 |   - In glibc, allocation here is performed by `malloc()` and deallocation by `free()`.
 23 |   - In *reference counting*, each object on the heap stores an integer that keeps track of the number of pointers on the stack and on the heap pointing to it. When that object reaches zero, it decrements the reference counts of the objects it has pointers to, and gets deallocated. When done automatically (without explicity incrementing and decrementing the reference count), since reference cycles would prevent reference counts from ever reaching zero (eg. stack pointer → A(1) → B(1), then set a pointer in B to A: stack pointer → A(2) → B(1) → A(2), then pop the stack: A(1) → B(1) → A(1)), it is common to rely on explicit weak references (that don't contribute to the reference count; here, we would set the B → A reference as a weak reference). Another approach is to set them to be garbage-collected through mark-and-sweep.
 24 |   - In tracing GC (*garbage collection*), a heuristic determines when deallocation is needed, at which point an algorithm is run to flag all unreachable heap objects, and then deallocates them.
 25 | - The **data** contains statically allocated data (ie. created when the program starts, destroyed when it ends): global variables, string constants… The ones that are not initialized at startup are in **BSS** (Block Started by Symbol).
 26 | - The **text** stores the code as executable machine instructions.
 27 | 
 28 | ```
 29 |  ┌──────┬──────┬─────┬──────────────────────┬─────────┬──────────────────────┐
 30 |  │ text │ data │ bss │ heap (grows right) → │ (empty) │ ← stack (grows left) │
 31 |  └──────┴──────┴─────┴──────────────────────┴─────────┴──────────────────────┘
 32 | ```
 33 | 
 34 | Each program has its own read, write, and execute access to parts of memory, enforced by the operating system, which can terminate the program with a **segmentation fault** if it does an unauthorized operation.
 35 | 
 36 | The process' memory relies on **virtual memory**: it normally lives on volatile storage. However, when main memory becomes scarce, some *memory pages* (fixed blocks of virtual memory) are *swapped*: they are transfered to auxiliary memory (ie, storage drives) to make room in main memory.
 37 | 
 38 | **Primary storage** refers to registers, CPU caches, and main memory.
 39 | 
 40 | ## Secondary Storage
 41 | 
 42 | **Auxiliary memory** persists data for a few years without being powered.
 43 | 
 44 | - **Hard Disk Drive** (HDD for short): rotating disk coated with magnetic material, with a magnetic head moving from the edge to the center of the disk to go to a particular memory position.
 45 | - **Solid State Drive** (SSD, Flash Storage): integrated circuit, where bits are stored in transistor cells as trapped electrons on an insulator.
 46 | 
 47 | The counterpart: transfer speeds are much lower than RAM.
 48 | 
 49 | - HDD: 100 MB/s, 10 ms latency, but it pays that latency every time it needs to move to a completely different location on disk.
 50 | - SSD: 1 GB/s, 30 μs latency.
 51 | 
 52 | Persistence is not a certainty. First, environmental events like water damage, shock, melting or electromagnetic fields can destroy stored information.
 53 | 
 54 | Second, drives age probabilistically.
 55 | Manufacturers use two metrics: Mean Time Between Failures (MTBF, expected lifetime) and Annualized Failure Rates (AFR, probability that a drive dies within a year). The two are linked by a [Poisson](./statistics.md) distribution: `AFR = 1-exp(-8760/MTBF)`.
 56 | 
 57 | - HDD: AFR of [2%][Backblaze AFR]. Note that aging affects AFR (it is about 5% for 1.5 years, then 1.5% for 1.5 years, then 12% [according to Backblaze][Backblaze age analysis].
 58 | - SSD: AFR of [0.7%][Microsoft SSD Failures].
 59 | 
 60 | [Backblaze AFR]: https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/
 61 | [Backblaze age analysis]: https://www.backblaze.com/blog/how-long-do-disk-drives-last/
 62 | [Microsoft SSD Failures]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf
 63 | 
 64 | Third, **bitrot** (ie. a random change of a bit from 0 to 1 or vice-versa) happens. HDD are exposed to a huge amount of cosmic radiation than can impact their magnetic material. It happens maybe once a year? In addition to that, SSD are made of 1- or 2-bit floating-gate transistors that have a very predictable wear. Eventually, they return the wrong value.
 65 | 
 66 | ### File system
 67 | 
 68 | Persistent storage is typically organized hierarchically, as a tree of files: leafs contain blobs of bytes, while their ancestors in the tree, directories, associate each child with a name.
 69 | 
 70 | #### Disk Layout
 71 | 
 72 | Typically, on Linux, macOS and similar Unix operating systems, each file has an **inode** stored on disk, which includes the following information:
 73 | 
 74 | - Which device is it on?
 75 | - What user owns it? What group owns it?
 76 | - What are its permissions? (Can the user read it? write to it? execute it? How about the group's users? How about other users?)
 77 | - What type of file is it? (regular, directory, link, socket, device, FIFO… Along with the permissions, they are stored as six octal digits in `st_mode`)
 78 | - When was it last modified (`ctime`)? When was the content last modified (`mtime`)? When was it last accessed (`atime`)?
 79 | - It also contains pointers to fixed-sized **blocks** on disk holding the file's content. Reading the file gets the bytes from those blocks one after the other.
 80 | 
 81 | Traditional file systems (eg. Linux' ext3, ufs) typically have inodes include a dozen pointers to blocks holding content (direct blocks). If the file is too big to fit in those blocks, it uses a couple of pointers from the inode to blocks holding pointers to blocks holding content (indirect blocks). If the file is still too big, it uses another pointer from the inode to blocks holding pointers to blocks holding pointers to blocks holding content (double indirect blocks).
 82 | 
 83 | More modern designs (Linux' [ext4], macOS' HFS) rely on **extents** for content storage: contiguous blocks. Instead of the inode holding a pointer to the start of a block, it has a pointer to the start of the extent, and the number of blocks that the extent covers. It also has a field to determine whether the extent contains the file's content, or pointers to other extents. Having contiguous blocks reduces the amount of indexing data (eg, only one pointer and the number "4" to say we have a 4-block extent, instead of four pointers to four blocks), and it avoids making hard drives seek to new locations on-disk after each block is read (which can cost 10 ms every time).
 84 | 
 85 | Because newly created extents may not fit between existing extents, the file system needs to deal with that *external fragmentation*. Among the tricks used to fight this, *delayed allocation* (aka. allocate-on-flush) is a technique that aggregates information about all file writes across the file system for a few seconds in something called the journal, and writes (*flushes*) them all at once to the disk in a way that avoids leaving small unused spaces between extents.
 86 | 
 87 | An alternative technique to journaling, *copy-on-write* (COW), used by Linux' btrfs and macOS' APFS, never directly edits existing extents: instead, it writes to a brand-new extent, and then makes the inode point to the new extent.
 88 | 
 89 | [ext4]: https://ext4.wiki.kernel.org/index.php/Ext4_Design
 90 | 
 91 | #### File Operation
 92 | 
 93 | When a process needs to access a file, it opens it with its path (eg. `/home/user/file`) and with flags determining how it can be manipulated (one of `O_RDONLY` (read-only access), `O_RDWR` (read and write access), `O_WRONLY` (write), and optionally `O_CREAT` (create the file if it doesn't exist), `O_APPEND` (only write at the end of the file), etc.).
 94 | 
 95 | The system then gives an integer to the process: the **file descriptor**. The list of file descriptors a process has can be obtained with `ls /proc/<process_id>/fd`. The process uses the file descriptor to interact with the file.
 96 | 
 97 | The file descriptor has a cursor determining where it starts reading from, which is initially at the start (or the end, for `O_APPEND`). When you read a number of bytes from the file, the cursor moves forward by that amount. You can change the position of the cursor with `lseek()`. When you write bytes to a file, it inserts them at the cursor position.
 98 | 
 99 | Once the file is dealt with, it must be closed, to allow the operating system to release resources.
100 | 
101 | ### Disk arrays
102 | 
103 | Storing data in a single location is dangerous, as the drive may fail. Regularly copying the data to another drive (making a **backup**) is common: when the drive fails, it can be replaced and the data copied from the backup. That can be done by physically mounting the backup drive and using `cp`, or `rsync`, or by using `rsync` over the network to a remote backup server (rsync avoids resending data that is already backed up), or by using `btrfs send` over SSH, (which also avoids resending data in more subtle ways).
104 | 
105 | However, when the drive fails, the data stops being available in the meantime. More importantly, corrupted data goes undetected, as the drive won't complain, and the corruption will blindly be copied to the backup.
106 | 
107 | Instead of an only-connected-for-backup external or remote drive, your computer can have two permanently connected drives. You can use the second drive exactly as if it was a backup. It contains a perfect copy of the content. That way, if one drive fails, your data is still safe, and is still available while the second is replaced. This is called **RAID1**.
108 | 
109 | To automatically correct bitrot, you can use a setup of multiple drives where consecutive blocks are distributed across drives (**data striping**, improves the time needed to read extents), and you also distribute parity information across drives. That parity information lets you detect and correct corruption in the data. This is called **RAID5**, and it also has all the pros of RAID1, except that it requires using at least three drives.
110 | 
111 | ## Internet storage
112 | 
113 | Data persistence is not guaranteed even with those protections. A fire may destroy a whole building. Single-computer drive arrays are not enough. To support copying and synchronizing of data over long distances, the use of Internet storage is necessary. It has multiple facets:
114 | 
115 | - Replication: every piece of data must be in at least two different buildings, preferably in different cities. The **replication factor** is the number of long-distance copies.
116 | - Distribution: having all of the data be replicated in all locations burns egress, bandwidth, and disk usage, all of which is costly. Therefore, each piece of data is only located on a subset of the nodes. Determining which nodes are used is either manual or relies on a distributed hash table (**DHT**).
117 | - Synchronization: if multiple nodes allow updating the data, writes between those nodes can conflict, in which case a conflict resolution system or some kind of CAP compromise is required. The **CAP theorem** states that when a partition (P) in the network occurs (ie, some nodes can't communicate with others), the system must choose between maintaining consistency (C, the guarantee that reads return the most recent write) by making requests wait until the network recovers, or maintaining availability (A) by responding to requests immediately.
118 | - Parallelism: a piece of data can be **striped** over multiple nodes, so that reading it will be faster.
119 | - **Deduplication**: to use less disk space, the network can detect pieces of data that hold the same content, and only store it once (at the replication factor).
120 | - **Byzantine fault-tolerance**: while having multiple copies protects against having a data center being destroyed or cut off, some Internet storage need to assume that some nodes may be malicious. The system then needs to protect data and updates of non-malicious nodes up until half of the nodes become malicious.
121 | 
122 | Common solutions include [GlusterFS] (pieces of data are at the file level), [CephFS] (at the block level), Hadoop HDFS, Amazon S3.
123 | 
124 | [GlusterFS]: https://www.gluster.org/
125 | [CephFS]: http://ceph.com/
126 | 
127 | Within the same datacenter, you can expect 500 μs of latency; across the planet, about 100 ms.
128 | 


--------------------------------------------------------------------------------
/misc/network.md:
--------------------------------------------------------------------------------
  1 | # Networks
  2 | 
  3 | Computers [communicate](./information.md) to reach a goal. For instance, you
  4 | contact Youtube to see cat videos, Youtube responds to gain advertising revenue.
  5 | 
  6 | A **network** can be represented with a graph where vertices are processing 
  7 | machines and edges are transmission links. Examples of networks include the 
  8 | Internet, telephones, and walkie-talkies.
  9 | 
 10 | ## Protocols
 11 | 
 12 | Certain documents (typically, standards and Requests For Comments (RFCs)) set
 13 | the way in which information is transmitted through the network, first as bits,
 14 | then as higher-level concepts. They can depend on the existence of lower-level
 15 | protocols, forming a *protocol stack*. The **OSI model** theorizes the layers
 16 | that a protocol stack is made of:
 17 | 
 18 | - Physical: transmission of bits through a medium (eg: Ethernet PHY chip for
 19 |   100BASE-TX),
 20 | - Data link: transmission of frames mostly between adjacent nodes, to determine
 21 |   the start and end of messages (eg: MAC, PPP),
 22 | - Network: transmission of packets for routing across the graph (eg: IP),
 23 | - Transport: transmission of segments, so applications on both endpoints can
 24 |   exchange messages with given reliability guarantees (eg: TCP, UDP, ICMP
 25 |   (ping)),
 26 | - Session: setup and recognition of endpoints across messages,
 27 | - Presentation: encoding of data (charset, compression, encryption) (eg: TLS,
 28 |   HTTP with MIME to some extent),
 29 | - Application: serialization of data structures (eg: HTTP (documents), NTP
 30 |   (time), SMTP (email), FTP (file)).
 31 | 
 32 | Let's focus on a typical stack.
 33 | 
 34 | ### HTTP
 35 | 
 36 | **HyperText Transfer Protocol** ([HTTP][]) is an application-layer and
 37 | presentation-layer protocol designed for client-server document transmission.
 38 | For instance, to request the main page of an HTTP server on your computer:
 39 | 
 40 |     GET / HTTP/1.1
 41 |     Host: localhost:1234
 42 |     Accept: text/html
 43 | 
 44 | [HTTP]: https://tools.ietf.org/html/rfc2616
 45 | 
 46 | (Each newline is made of two bytes: 0x0D and 0x0A, aka. CR-LF; it ends with two
 47 | newlines). The server may respond:
 48 | 
 49 |     HTTP/1.1 200 OK
 50 |     Content-Type: text/html
 51 |     Date: Sat, 31 Dec 2016 15:31:45 GMT
 52 |     Connection: keep-alive
 53 |     Transfer-Encoding: chunked
 54 | 
 55 |     7E
 56 |     <!doctype html>
 57 |     <html>
 58 |      <head>
 59 |       <meta charset=utf-8>
 60 |       <title> This is HTML </title>
 61 |      </head>
 62 |      <body></body>
 63 |     </html>
 64 | 
 65 |     0
 66 | 
 67 | This response includes an [HTML][] file that the HTTP client (for instance, a
 68 | browser, like Firefox or Google Chrome) will read as instructions on how to lay
 69 | out a page, which determines the pixels to display, the animations to show, the
 70 | interactions to execute when the user moves or clicks the mouse, the sounds to
 71 | play, etc.
 72 | 
 73 | [HTML]: https://html.spec.whatwg.org/multipage/
 74 | 
 75 | All requests have a first line with a method (`GET`), a path (`/`), and a
 76 | protocol (`HTTP/1.1`), optionally followed by headers mapping header names
 77 | (`Accept`) to their value (`text/html`). Requests may also carry data.
 78 | 
 79 | Responses have a first line with a protocol (`HTTP/1.1`), a code (`200 OK`;
 80 | codes starting with 1 are informational, 2 for success, 3 for redirection, 4 for
 81 | client errors, 5 for server errors). Responses usually carry data (here, the
 82 | HTML file), and also have headers explaining what the data is, how it is encoded
 83 | (charset, compression), what time it is, whether to use caching, how to store
 84 | session information (through cookies) and so on.
 85 | 
 86 | As mentioned, HTTP includes presentation-layer "protocols" in headers, such as
 87 | **Multipurpose Internet Mail Extensions** ([MIME][]) in `Content-Type`, to
 88 | specify the file `<type>/<subtype>` (eg. `text/plain`), or whether it
 89 | recursively contains subfiles with [`multipart/form-data`][form-data], with each
 90 | subfile specifying their own headers:
 91 | 
 92 |     POST /upload HTTP/1.1
 93 |     Host: localhost:1234
 94 |     Content-Length: 882
 95 |     Content-Type: multipart/form-data; boundary=random0ACxeUx4Nxqy3roVtMxrAw
 96 | 
 97 |     --random0ACxeUx4Nxqy3roVtMxrAw
 98 |     Content-Disposition: form-data; name="name-of-first-part"
 99 |     Content-Type: text/plain
100 | 
101 |     This first file contains normal plain text.
102 |     --random0ACxeUx4Nxqy3roVtMxrAw
103 |     Content-Disposition: form-data; name="multiple-images"; filename="image.svg"
104 |     Content-Type: image/svg+xml; charset=UTF-8
105 | 
106 |     <?xml version="1.0" encoding="UTF-8"?>
107 |     <svg xmlns="http://www.w3.org/2000/svg" width="100" height="20">
108 |       <text x="10" y="15">This is an image</text>
109 |     </svg>
110 |     --random0ACxeUx4Nxqy3roVtMxrAw
111 |     Content-Disposition: form-data; name="multiple-images"; filename="image.png"
112 |     Content-Type: image/png
113 |     Content-Transfer-Encoding: base64
114 | 
115 |     iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAAXNSR0IArs4c6QAAAARnQU1BAACx
116 |     jwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAMSURBVBhXY2BgYAAAAAQAAVzN/2kAAAAASUVO
117 |     RK5CYII=
118 |     --random0ACxeUx4Nxqy3roVtMxrAw--
119 | 
120 | (Note that our use of base64 in image.png is deprecated; in real life, it would
121 | be replaced by the binary data directly.)
122 | 
123 | [MIME]: https://tools.ietf.org/html/rfc2045
124 | [form-data]: https://tools.ietf.org/html/rfc7578
125 | 
126 | HTTPS is HTTP transmitted over a TLS connection: all the HTTP data, including
127 | headers, is encrypted to prevent intermediate nodes on the network from reading
128 | or modifying the content, which is necessary when transmitting identification or
129 | banking information, and to avoid being fooled into performing dangerous acts.
130 | 
131 | ### TCP
132 | 
133 | **Transmission Control Protocol** ([TCP][]) is a transport-layer protocol to
134 | ensure that all sent segments are received uncorrupted in the same order. That
135 | is achieved by reordering received segments and resending corrupted ones.
136 | When using IP, it cuts its segments into pieces that fit in a packet.
137 | 
138 | 1. The server starts to listen to a port.
139 | 2. The client starts to connect with a SYN.
140 | 3. The server informs the client that it received it with a SYN+ACK.
141 | 4. The client sends an ACK.
142 | 5. The server and the client can now send a series of packets to each other
143 |    full-duplex, and they ACK each reception if all previously received packets
144 |    have been received in order.
145 | 6. The client sends a FIN.
146 | 7. The server sends a FIN+ACK (or an ACK followed by a FIN).
147 | 8. The client sends an ACK. (The connection stays open until it times out.)
148 | 
149 | A TCP header includes:
150 | 
151 | - source port in 2 bytes,
152 | - destination port in 2 bytes,
153 | - sequence number in 4 bytes:
154 |   - in a SYN, this is the client Initial Sequence Number (ISN), picked usually
155 |     randomly,
156 |   - otherwise it is (server ISN) + 1 + number of bytes previously sent, ensuring
157 |     that packets can be reordered to obtain the original segment.
158 | - acknowledgement number in 4 bytes:
159 |   - in a SYN-ACK, this is (client ISN) + 1, and the server sequence number is
160 |     picked.
161 |   - in an ACK, this is (server ISN) + number of bytes received + 1, which is the
162 |     expected next sequence number to be received from the server.
163 | - data offset in 4 bits, the size of the TCP header in 32-bit words (defaults to
164 |   5),
165 | - 000 (reserved),
166 | - flags in 9 bits: NS, CWR, ECE, URG (read urgent pointer), ACK (acknowledge
167 |   reception of data or SYN), PSH (push buffered data received to the
168 |   application), RST (reset connection), SYN (synchronize sequence number, only
169 |   used in the initial handshake), FIN (end of data, only used in the final
170 |   handshake),
171 | - window size in 2 bytes, allowing flow and congestion control,
172 | - checksum in 2 bytes to check header and data corruption,
173 | - urgent pointer in 2 bytes pointing to a sequence number,
174 | - options (if the data offset is > 5, zero-padded) eg. maximum segment size, or
175 |   window scale,
176 | - payload.
177 | 
178 | [TCP]: https://tools.ietf.org/html/rfc793
179 | 
180 | ### IP
181 | 
182 | **Internet Protocol** ([IP][]) is a network-layer protocol that ensures that
183 | packets go to their destination despite having to transit through several
184 | machines on the way.
185 | It also cuts the packets into fragments that fit in the link-layer frame.
186 | There are two major versions of IP in use: IPv4 is the most used, and is slowly
187 | replaced by IPv6.
188 | 
189 | IPv4 headers have the following fields:
190 | 
191 | - version in 4 bits,
192 | - Internet Header Length (IHL) in 4 bits, as a number of 32-bit words,
193 | - Quality of Service (QoS) in 1 byte: ranks packet priority; it is typically cut
194 |   into 6 bits of Differentiated Services Code Point (DSCP) and 2 bits of
195 |   Explicit Congestion Notification (ECN),
196 | - length of the packet in bytes, in 2 bytes,
197 | - identification tag in 2 bytes, to reconstruct the packet from multiple
198 |   fragments,
199 | - 0 in 1 bit,
200 | - Don't Fragment (DF) in 1 bit, if the packet can be fragmented,
201 | - Multiple Fragments (Mf) in 1 bit, if the rest of the packet is in subsequent
202 |   fragments,
203 | - fragment offset in 13 bits, identifying the position of the fragment in the
204 |   packet,
205 | - Time To Live (TTL) in 1 byte: the number of remaining nodes in the network
206 |   graph that the packet is allowed to go though; each node decrements that
207 |   number and drops the packet if it reaches 0, avoiding infinite loops,
208 | - protocol of the payload in 8 bits (TCP, UDP, ICMP, etc.),
209 | - header checksum in 16 bits to detect corruption,
210 | - source IP address in 32 bits,
211 | - destination IP address in 32 bits,
212 | - payload (eg, TCP content).
213 | 
214 | Fragmenting the packet and reordering it at the other end was designed for cases
215 | where the packet must be transmitted over a link which cannot hold the whole
216 | packet (typically, the maximum Ethernet frame size, or when the destination
217 | doesn't have enough memory to hold the packet).
218 | 
219 | However, TCP can cut its segments into arbitrarily sized packets to fit in an
220 | Ethernet frame, and fragmentation makes packet analysis harder as the tail
221 | fragments don't hold the TCP segment headers. Besides, Path MTU (Maximum
222 | Transportation Unit) Discovery (PMTUD) allows determining the size of the
223 | physical-layer frame in a path through the network. As a result, IPv6 disallows
224 | fragmenting packets within the path, requiring the sender to either form its
225 | packets at the right size (for TCP), or to form its fragments at the right size
226 | (for UDP and ICMP, which cannot cut their data into multiple packets).
227 | 
228 | Note that packets can be lost, duplicated, received out of order, or corrupted
229 | without the IP layer noticing. It is up to TCP to prevent that from happening.
230 | 
231 | IP addresses segment the network into increasingly smaller subnetworks, with
232 | **routers** processing packets in and across networks. They can be gained from
233 | the **Dynamic Host Configuration Protocol** (DHCP), auto-assigned, or manually
234 | set.
235 | 
236 | IPv4 addresses fit in 4 bytes, commonly written in dot-separated decimal, eg.
237 | `172.16.254.1`. To denote a subnetwork (which has adjacent numbers), we use
238 | Classless Inter-Domain Routing (CIDR) notation: `<network address>/<bitmask
239 | (number of leading bits that stay the same for all addresses)>`. For instance,
240 | 192.168.2.0/24 includes addresses from 192.168.2.0 to 192.168.2.255, although
241 | you cannot use the address ending in .0 (network address), used to identify the
242 | network, nor that ending in .255 (broadcast address), used to broadcast to all
243 | devices on the network.
244 | 
245 | There are special ranges of addresses:
246 | 
247 | - 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 for private networks (ie,
248 |   not globally routable; they are typically behind a Network Address Translator
249 |   (NAT)),
250 | - 0.0.0.0/8 for "no address", used as source address when getting an IP address,
251 | - 100.64.0.0/10 "shared address space", similar to private networks, but for
252 |   Carrier-Grade NAT (CGN),
253 | - 127.0.0.0/8 for loopback (sending network data within a single node), most
254 |   notably 127.0.0.1 (which the localhost hostname usually resolves to),
255 | - 169.254.0.0/16 for IP assignment between link-local, autoconf/zeroconf
256 |   addresses,
257 | - 192.0.0.0/24 for IANA,
258 | - 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24 are reserved for testing and
259 |   examples in documentation,
260 | - 192.88.99.0/24 for IPv6-to-IPv4 anycast routers for backwards compatibility,
261 | - 198.18.0.0/15 for network performance testing,
262 | - 224.0.0.0/4 for IP multicast,
263 | - 240.0.0.0/4 blocked for historical reasons,
264 | - 255.255.255.255 for broadcast.
265 | 
266 | IPv6 addresses fit in 16 bytes, with pairs of bytes represented as
267 | colon-separated hexadecimal numbers, with adjacent zeros replaced by `::` once
268 | in the address.
269 | 
270 | - unicast has a ≥ 48-bit routing prefix, a ≤ 16-bit subnet id defined by the
271 |   network administrator, and a 64-bit interface identifier obtained either by
272 |   DHCPv6, the MAC address, random, or manually.
273 | - :: for unspecified address.
274 | - ::1 for localhost,
275 | - fe80::/64 for link-local communication; cannot be routed; all other addresses
276 |   in fe80::/10 are disabled,
277 | - fc00::/7 for Unique Local Addresses (ULAs), similar to private networks:
278 |   - fc00::/8 for arbitrary allocation,
279 |   - fd00::/8 for random allocation (with a 40-bit pseudorandom number).
280 | - ff00::/8 for multicast, with 4 flag bits (reserved, rendezvous, prefix,
281 |   transient) and 4 scope bits:
282 |   - general multicast has a 112-bit group ID, including:
283 |     - ff01::1 to all interface-local nodes,
284 |     - ff02::1 to all link-local nodes,
285 |     - ff01::2 to all interface-local routers,
286 |     - ff02::2 to all link-local routers,
287 |     - ff05::2 to all site-local routers,
288 |     - ff0X::101 to all NTP servers,
289 |     - ff05::1:3 to all DHCP servers.
290 |   - ff02::1:ff00:0/104 sollicited-node multicast has a link-local scope and a
291 |     24-bit unicast address,
292 |   - unicast-prefix-based multicast has a 64-bit network prefix (= routing prefix
293 |     + subnet id) and a 32-bit group ID.
294 | - ::ffff:0:0/96 (IPv4-mapped IPv6 addresses), ::ffff:0:0:0/96 (IPv4-translated
295 |   addresses in the Stateless IP/ICMP Translation (SIIT) protocol), 64:ff9b:://96
296 |   (automatic IPv4/IPv6 translation), 2002::/16 (6to4) for transitioning from
297 |   IPv4,
298 | - 2001::/29 thru 2001:01f8::/29 for IANA special purposes (tunneling,
299 |   benchmarking, ORCHIDv2),
300 | - 2001:db8::/32 for examples in documentation,
301 | - 0100::/64 to discard traffic.
302 | 
303 | Some IP addresses can be mapped to a name (eg, `en.wikipedia.org` →
304 | 91.198.174.192) by using the **Domain Name System** (DNS), a naming system for
305 | Internet entities. Companies that can allocate a new domain name are called
306 | **registrars**. They publish their information as zone files, and allow
307 | authenticated editing of those files by the domain name owners as part of a
308 | business arrangement.
309 | 
310 |     ; Example zone file.
311 |     $ORIGIN example.com..
312 |     $TTL 1h
313 |     ; Indicates that the owner is admin@example.com.
314 |     example.com. IN SOA   ns.example.com. admin.example.com. (2017011201 1d 2h 4w 1h)
315 |     example.com. IN NS    ns  ; Indicates that ns.example.com is our nameserver.
316 |     example.com. IN MX    10 mail.example.com.
317 |     example.com. IN A     91.198.174.192        ; IPv4 address
318 |                     AAAA  2001:470:1:18::118    ; IPv6 address
319 |     ns           IN A     91.198.174.1          ; ns.example.com
320 |     www          IN CNAME example.com.          ; www.example.com = example.com
321 | 
322 | 
323 | [IP]: https://tools.ietf.org/html/rfc791
324 | 
325 | *While HTTP requires TCP which requires IP, lower layer protocols are usually
326 | interchangeable.*
327 | 
328 | ### Ethernet
329 | 
330 | At the link layer, communication mostly happens directly between two adjacent
331 | nodes.
332 | 
333 | Among link-layer protocols, **Ethernet** (aka. LAN, IEEE 802.3) is a link-layer
334 | protocol for transiting frames through a wire between two machines. A frame
335 | includes:
336 | 
337 | - preamble: 7 bytes to ensure we know this is a frame, not a lower-level header,
338 |   and to synchronize clocks (it contains alternating 0s and 1s),
339 | - Start of Frame Delimiter (SFD): 1 byte to break the pattern of the preamble
340 |   and mark the start of the frame metadata,
341 | - destination **Media Access Control** (MAC) address of the target machine: each
342 |   machine knows the MAC address of all machines it is directly connected to.
343 |   Among its 6 bytes, it contains two special bits:
344 |   - the Universal vs. Local (U/L) bit is 0 if the MAC is separated in 3 bytes
345 |     identifying the network card's constructor (Organisationally Unique
346 |     Identifier, OUI), and 3 bytes arbitrarily but uniquely assigned by the
347 |     constructor for each card (Network Interface Controller, NIC).
348 |   - the Unicast vs. Multicast bit is 0 if the frame must only be processed by a
349 |     single linked machine.
350 | - source MAC address,
351 | - EtherType indicates what protocol is used in the payload (eg, 0x86DD for
352 |   IPv6); if the value is < 1536, it represents the payload size in bytes,
353 | - payload: up to 1500 bytes of data from the layer above, typically IP,
354 | - Frame Check Sequence (FCS, implemented using a **Cyclic-Redundancy Check**
355 |   (CRC)): 4 bytes that verify that the frame is not corrupted; if it is, it is
356 |   dropped and upper layers may have to re-send it.
357 | - Interpacket gap: not really part of the frame, those 12 bytes of idle line
358 |   transmission are padding to avoid having frames right next to each other.
359 | 
360 | Ethernet relies on **repeaters** to transmit data over long distances, as the
361 | physical layer usually relies on cables that have a maximum length. Multiple
362 | machines are connected to the same repeater, creating a star topology.
363 | **Bridges** are smarter machines that remembers source MAC addresses, and uses
364 | that to avoid sending frames to machines that are not the recipient according to
365 | the frame. **Switches** are smarter, programmable machines that detect and block
366 | corrupted packets.
367 | 
368 | An alternative to Ethernet is **WiFi** (aka. WLAN, IEEE 802.11), a common
369 | wireless protocol.
370 | 
371 | ### 100BASE-TX
372 | 
373 | **100BASE-TX** (part of IEEE 802.3u, aka. Fast Ethernet) is a physical-layer
374 | protocol. It defines using RJ45, which uses an 8P8C (8 position 8 contact)
375 | connector with TIA/EIA-568B, ie. having eight copper wires with pin 1 through 8:
376 | white-orange, orange, white-green, blue, white-blue, green, white-brown, brown.
377 | x / white-x wires form pairs 1 through 4: blue, orange, green, brown, each
378 | twisted together at different rates in the cable. Orange pins 1 (TX+) and 2
379 | (TX-) transmit bits; green pins 3 (RX+) and 6 (RX-) receive bits, which makes
380 | this full-duplex.
381 | 
382 | From left to right on the female Ethernet connector:
383 | 
384 |     pin   1         2         3       4       5        6        7        8
385 |     ┌────────────┬──────┬───────────┬────┬──────────┬─────┬───────────┬─────┐
386 |     │white-orange│orange│white-green│blue│white-blue│green│white-brown│brown│
387 |     └────────────┴──────┴───────────┴────┴──────────┴─────┴───────────┴─────┘
388 |           TX+       TX-      RX+                      RX-
389 | 
390 | Bits are first encoded with 4B5B: each 4 bits are encoded as 5 bits according to
391 | a predetermined mapping that prevents having too many consecutive zeros, which
392 | would make locating individual bits harder, as clocks are not perfectly
393 | synchronized. 4B5B also has five extra 5-bit codes: one to indicate that no data
394 | is sent (Idle = 11111, which in NRZI means systematically alternating the
395 | current), two to indicate that we will start sending data (Start of Stream Data
396 | = SSD), two to indicate that we stop sending data (End of Stream Data = ESD).
397 | 
398 |     Bits: 0100 0111  (ASCII G)
399 |     4B5B: 0101001111
400 | 
401 | Bit transmission relies on Non-Return-to-Zero/Inverted (NRZI): a 1 is
402 | represented by a change from 0 volts to 1 volt or back for TX+, from 0 volts to
403 | -1 volts or back for TX-. A 0 is no change in voltage. The receiver subtracts
404 | TX- from TX+: `(TX+ + noise) - (TX- - noise) = 0V or 2V` which together with the
405 | previous voltage, determines bits. On top of that, Multilevel Threshold-3
406 | (MLT-3) is used: it halves the transfer frequency by alternating positive and
407 | negative voltages.
408 | 
409 |     4B5B:         0  1  0  1  0  0  1  1  1  1
410 |     MLT-3 TX+:  0  0  1  1  0  0  0 -1  0  1  0 (in volts)
411 |     MLT-3 TX-:  0  0 -1 -1  0  0  0  1  0 -1  0 (in volts)
412 | 
413 | 
414 | The wires go up to 100 metres. They are twisted with Cat5 Unshielded Twisted
415 | Pair (UTP; no electromagnetic protection, but the twisting protects information
416 | from noise sources). In 100BASE-TX, 100 means data goes at 100 Mbit/s, T means
417 | twisted pair, X means that bits are encoded with 4B5B.
418 | 
419 | The resulting overhead looks like this:
420 | 
421 |     ┌─────┬─────┬────────────┬───────────┬────────────┬─────────────┬──────┬─────┬─────┐
422 |     │ SSD │ SFD │ MAC header │ IP header │ TCP header │ HTTP header │ data │ FCS │ ESD │
423 |     └─────┴─────┴────────────┴───────────┴────────────┴─────────────┴──────┴─────┴─────┘
424 |     │     │                  │           │            │     application    │     │     │
425 |     │     │                  │           │            └────────────────────┤     │     │
426 |     │     │                  │           │            transport            │     │     │
427 |     │     │                  │           └─────────────────────────────────┤     │     │
428 |     │     │                  │                   network                   │     │     │
429 |     │     │                  └─────────────────────────────────────────────┘     │     │
430 |     │     │                                 link                                 │     │
431 |     │     └──────────────────────────────────────────────────────────────────────┘     │
432 |     │                                     physical                                     │
433 |     └──────────────────────────────────────────────────────────────────────────────────┘
434 | 
435 | ## Layouts
436 | 
437 | **Distributed** systems are products that rely on having multiple computing
438 | units communicating.
439 | 
440 | **Client-server** (aka. Star) architectures have a special computing unit, the
441 | server, which receives requests from any number of computing units (clients),
442 | processes the request, and sends a response to each request.
443 | Examples include HTTP and display servers.
444 | 
445 | **Three-tier** architectures separate nodes into three types:
446 | - Presentation: reads user input and displays the User Interface (UI).
447 |   Typically, laptops or phones.
448 | - Application: executes user queries and moves data.
449 | - Data: manages data, typically through Create-Read-Update-Delete (CRUD)
450 |   Application Programming Interfaces (APIs), typically with
451 |   Atomicity-Consistency-Isolation-Durability (ACID) guarantees.
452 | 
453 | **N-tier** is what happens when the three-tier application layer gets sublayers.
454 | 
455 | **Decentralized** architectures can sustain the loss of any node.
456 | Examples include Distributed Hash Tables (DHT).
457 | 
458 | **Peer-to-peer** (P2P) architectures can sustain the loss of any number of
459 | nodes, as long as there is still at least one node.
460 | Examples include Bittorrent, Bitcoin, Infinit file system.
461 | 


--------------------------------------------------------------------------------
/misc/network/physical.md:
--------------------------------------------------------------------------------
 1 | # Physical Networks
 2 | 
 3 | The physical connection between your device and the Internet looks like this:
 4 | 
 5 |     ┌────────┐ WiFi ┌────────┐ FTTH ┌─────────────────┐ Fiber ┌──────────┐
 6 |     │ Device ├──────┤ Router ├──────┤ ISP OLT/Routers ├───────┤ Repeater ├──┐
 7 |     └────────┘  4G  └────────┘      └─────────────────┘       └──────────┘  │
 8 |                                                                             │
 9 |     ┌────────┐ CAT6a ┌───────────────────┐ Fiber ┌────────┐ Submarine Fiber │
10 |     │ Server ├───────┤ Datacenter Switch ├───────┤ Router ├─────────────────┘
11 |     └────────┘       └───────────────────┘       └────────┘
12 | 
13 | ## Media
14 | 
15 | EM radiation.
16 | 
17 | antenna dipole.
18 | 
19 | AM/FM.
20 | 
21 | ### Wireless
22 | 
23 | #### WiFi
24 | 
25 | An alternative to Ethernet is **WiFi** (aka. WLAN, IEEE 802.11), a common
26 | wireless protocol.
27 | 
28 | #### 4G
29 | 
30 | ### Wired
31 | 
32 | Standard Ethernet cable names are of the form
33 | `<Speed><Signaling>-<Hardware>[<Encoding>]`, eg. 1000BASE-T.
34 | 
35 | 1. **Speed** is in Megabits per second, or in Gbps if it ends in G
36 |    (eg. 10GBASE-SR).
37 | 2. **Signaling** is how information is sent:
38 |    - BASE is **baseband** (line coding: on a clock,
39 |      we send bits by switching between two values (voltage, photon burst).
40 |    - BROAD is **broadband**: multiple frequency bands are used.
41 | 3. **Hardware**:
42 |    - 2: Coaxial cable that can reach ~200 meters.
43 |    - 5: Coaxial cable that can reach ~500 meters.
44 |    - T: Twisted Pairs.
45 |    - F: Fiber; E: Extended fiber ()
46 | 4. **Encoding**
47 | 
48 | #### Fiber
49 | 
50 | A 
51 | 
52 | Most widely-used for long-distance, because:
53 | 
54 | - Light is the fastest known particle,
55 | - Signals propagate for very long distances with little degradation.
56 | 
57 | #### Copper
58 | 
59 | But moving electrons cause magnetic fields,
60 | which can induce currents in nearby conductive wires,
61 | and alter their signal: that is called “**crosstalk**”.
62 | 
63 | ##### Twisted Pairs
64 | 
65 | TS568A vs. TS568B / Straigh-trhough vs. Crossover
66 | 
67 | ##### Coaxial
68 | 
69 | #### Submarine cables
70 | 
71 | > **History:** The first transatlantic cable was 7 copper wires
72 | > coated with gutta-percha for electrical isolation,
73 | > wound in a helix with tarred hemp and an iron-strands sheath for strength.
74 | > It was laid in 1858, a long shipping expedition which involved
75 | > having to grapple the cable on the sea bed when it unexpectedly broke.
76 | > Unsurprisingly, it degraded after a month.
77 | 
78 | ## Connectors
79 | 
80 | ### RJ45 / 8P8C
81 | 
82 | ### USB
83 | 
84 | ## Links
85 | 
86 | - [Fiber Optics in the LAN and Data Center][FOLDC]
87 | 
88 | [FOLDC]: https://www.youtube.com/watch?v=fRKT6Z9rgUw
89 | 


--------------------------------------------------------------------------------
/misc/network/protocol.md:
--------------------------------------------------------------------------------
  1 | # Network Protocols
  2 | 
  3 | Computers [communicate](./information.md) to reach a goal. For instance, you
  4 | contact Youtube to see cat videos, Youtube responds to gain advertising revenue.
  5 | 
  6 | A **network** can be represented with a graph where vertices are processing
  7 | devices and edges are transmission links. Examples of networks include the
  8 | Internet, telephones, and walkie-talkies.
  9 | 
 10 | ## Protocols
 11 | 
 12 | Certain documents (typically, standards and Requests For Comments (RFCs))
 13 | set the way in which information is transmitted through the network,
 14 | first as bits, then as higher-level concepts.
 15 | They can depend on the existence of lower-level protocols,
 16 | forming a *protocol stack*.
 17 | The typical layers that a protocol stack is made of are:
 18 | 
 19 | - Physical: transmission of bits through a medium (eg: Ethernet),
 20 | - Data link: transmission of frames mostly between adjacent nodes, to determine
 21 |   the start and end of messages (eg: MAC, PPP),
 22 | - Network: transmission of packets for routing across the graph (eg: IP),
 23 | - Transport: transmission of segments, so applications on both endpoints can
 24 |   exchange messages with a chosen reliability guarantee:
 25 |   are segments each sent at least once? in the same order? uncorrupted?
 26 |   (eg: TCP, UDP, ICMP (ping)),
 27 | - Application: serialization of data structures
 28 |   (eg: HTTP (documents), NTP (time), SMTP (email), FTP (file)).
 29 | 
 30 | Let's focus on a typical stack.
 31 | 
 32 | ### HTTP
 33 | 
 34 | **HyperText Transfer Protocol** ([HTTP][]) is an application-layer and
 35 | presentation-layer protocol designed for client-server document transmission.
 36 | For instance, to request the main page of an HTTP server on your computer:
 37 | 
 38 |     GET / HTTP/1.1
 39 |     Host: localhost:1234
 40 |     Accept: text/html
 41 | 
 42 | [HTTP]: https://tools.ietf.org/html/rfc2616
 43 | 
 44 | (Each newline is made of two bytes: 0x0D and 0x0A, aka. CR-LF; it ends with two
 45 | newlines). The server may respond:
 46 | 
 47 |     HTTP/1.1 200 OK
 48 |     Content-Type: text/html
 49 |     Date: Sat, 31 Dec 2016 15:31:45 GMT
 50 |     Connection: keep-alive
 51 |     Transfer-Encoding: chunked
 52 | 
 53 |     7E
 54 |     <!doctype html>
 55 |     <html>
 56 |      <head>
 57 |       <meta charset=utf-8>
 58 |       <title> This is HTML </title>
 59 |      </head>
 60 |      <body></body>
 61 |     </html>
 62 | 
 63 |     0
 64 | 
 65 | This response includes an [HTML][] file that the HTTP client (for instance, a
 66 | browser, like Firefox or Google Chrome) will read as instructions on how to lay
 67 | out a page, which determines the pixels to display, the animations to show, the
 68 | interactions to execute when the user moves or clicks the mouse, the sounds to
 69 | play, etc.
 70 | 
 71 | [HTML]: https://html.spec.whatwg.org/multipage/
 72 | 
 73 | All requests have a first line with a method (`GET`), a path (`/`), and a
 74 | protocol (`HTTP/1.1`), optionally followed by headers mapping header names
 75 | (`Accept`) to their value (`text/html`). Requests may also carry data.
 76 | 
 77 | Responses have a first line with a protocol (`HTTP/1.1`), a code (`200 OK`;
 78 | codes starting with 1 are informational, 2 for success, 3 for redirection, 4 for
 79 | client errors, 5 for server errors). Responses usually carry data (here, the
 80 | HTML file), and also have headers explaining what the data is, how it is encoded
 81 | (charset, compression), what time it is, whether to use caching, how to store
 82 | session information (through cookies) and so on.
 83 | 
 84 | As mentioned, HTTP includes presentation-layer "protocols" in headers, such as
 85 | **Multipurpose Internet Mail Extensions** ([MIME][]) in `Content-Type`, to
 86 | specify the file `<type>/<subtype>` (eg. `text/plain`), or whether it
 87 | recursively contains subfiles with [`multipart/form-data`][form-data], with each
 88 | subfile specifying their own headers:
 89 | 
 90 |     POST /upload HTTP/1.1
 91 |     Host: localhost:1234
 92 |     Content-Length: 882
 93 |     Content-Type: multipart/form-data; boundary=random0ACxeUx4Nxqy3roVtMxrAw
 94 | 
 95 |     --random0ACxeUx4Nxqy3roVtMxrAw
 96 |     Content-Disposition: form-data; name="name-of-first-part"
 97 |     Content-Type: text/plain
 98 | 
 99 |     This first file contains normal plain text.
100 |     --random0ACxeUx4Nxqy3roVtMxrAw
101 |     Content-Disposition: form-data; name="multiple-images"; filename="image.svg"
102 |     Content-Type: image/svg+xml; charset=UTF-8
103 | 
104 |     <?xml version="1.0" encoding="UTF-8"?>
105 |     <svg xmlns="http://www.w3.org/2000/svg" width="100" height="20">
106 |       <text x="10" y="15">This is an image</text>
107 |     </svg>
108 |     --random0ACxeUx4Nxqy3roVtMxrAw
109 |     Content-Disposition: form-data; name="multiple-images"; filename="image.png"
110 |     Content-Type: image/png
111 |     Content-Transfer-Encoding: base64
112 | 
113 |     iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAAXNSR0IArs4c6QAAAARnQU1BAACx
114 |     jwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAMSURBVBhXY2BgYAAAAAQAAVzN/2kAAAAASUVO
115 |     RK5CYII=
116 |     --random0ACxeUx4Nxqy3roVtMxrAw--
117 | 
118 | (Note that our use of base64 in image.png is deprecated; in real life, it would
119 | be replaced by the binary data directly.)
120 | 
121 | [MIME]: https://tools.ietf.org/html/rfc2045
122 | [form-data]: https://tools.ietf.org/html/rfc7578
123 | 
124 | HTTPS is HTTP transmitted over a TLS connection: all the HTTP data, including
125 | headers, is encrypted to prevent intermediate nodes on the network from reading
126 | or modifying the content, which is necessary when transmitting identification or
127 | banking information, and to avoid being fooled into performing dangerous acts.
128 | 
129 | ### TCP
130 | 
131 | **Transmission Control Protocol** ([TCP][]) is a transport-layer protocol to
132 | ensure that all sent segments are received uncorrupted in the same order.
133 | That is achieved by reordering received segments and resending corrupted ones.
134 | When using IP, it cuts its segments into pieces that fit in a packet.
135 | 
136 | 1. The server starts to listen to a port.
137 | 2. The client starts to connect with a SYN.
138 | 3. The server informs the client that it received it with a SYN+ACK.
139 | 4. The client sends an ACK.
140 | 5. The server and the client can now send a series of packets to each other
141 |    full-duplex, and they ACK each reception if all previously received packets
142 |    have been received in order.
143 | 6. The client sends a FIN.
144 | 7. The server sends a FIN+ACK (or an ACK followed by a FIN).
145 | 8. The client sends an ACK. (The connection stays open until it times out.)
146 | 
147 | *(When a previous connection was opened, this handshake can be sped up
148 | through [TCP Fast Open][TFO] or socket reuse.)*
149 | 
150 | A TCP header includes:
151 | 
152 | - source port in 2 bytes,
153 | - destination port in 2 bytes,
154 | - sequence number in 4 bytes:
155 |   - in a SYN, this is the client Initial Sequence Number (ISN), picked usually
156 |     randomly,
157 |   - otherwise it is (server ISN) + 1 + number of bytes previously sent, ensuring
158 |     that packets can be reordered to obtain the original segment.
159 | - acknowledgement number in 4 bytes:
160 |   - in a SYN-ACK, this is (client ISN) + 1, and the server sequence number is
161 |     picked.
162 |   - in an ACK, this is (server ISN) + number of bytes received + 1, which is the
163 |     expected next sequence number to be received from the server.
164 | - data offset in 4 bits, the size of the TCP header in 32-bit words (defaults to
165 |   5),
166 | - 000 (reserved),
167 | - flags in 9 bits: NS, CWR, ECE, URG (read urgent pointer), ACK (acknowledge
168 |   reception of data or SYN), PSH (push buffered data received to the
169 |   application), RST (reset connection), SYN (synchronize sequence number, only
170 |   used in the initial handshake), FIN (end of data, only used in the final
171 |   handshake),
172 | - window size in 2 bytes, allowing flow and congestion control,
173 | - checksum in 2 bytes to check header and data corruption,
174 | - urgent pointer in 2 bytes pointing to a sequence number,
175 | - options (if the data offset is > 5, zero-padded) eg. maximum segment size, or
176 |   window scale,
177 | - payload.
178 | 
179 | TODO NAT
180 | 
181 | [TCP]: https://tools.ietf.org/html/rfc793
182 | [TFO]: https://tools.ietf.org/html/rfc7413
183 | 
184 | ### IP
185 | 
186 | **Internet Protocol** ([IP][]) is a network-layer protocol that ensures that
187 | packets go to their destination despite having to transit through several
188 | devices on the way.
189 | It also cuts the packets into fragments that fit in the link-layer frame.
190 | There are two major versions of IP in use: IPv4 is the most used, and is slowly
191 | replaced by IPv6.
192 | 
193 | #### IPv4
194 | 
195 | IPv4 headers have the following fields:
196 | 
197 | - *Version* (4 bits): 0100,
198 | - *Internet Header Length* (IHL) (4 bits), as a number of 32-bit words,
199 | - *Quality of Service* (QoS) (8 bits): ranks packet priority;
200 |   - *Differentiated Services* Code Point (DSCP) (8 bits),
201 |   - *Explicit Congestion Notification* (ECN) (2 bits):
202 |     10 or 01 means “ECN-capable transport” (ECT(0) / ECT(1)),
203 |     11 means “Congestion Encountered” (CE);
204 | - *Length* of the packet in bytes (2 bytes),
205 | - *identification tag* (2 bytes),
206 |   to reconstruct the packet from multiple fragments,
207 | - 0 (1 bit),
208 | - *Don't Fragment* (DF) (1 bit), if the packet can be fragmented,
209 | - *Multiple Fragments* (Mf) (1 bit), if the rest of the packet is in subsequent
210 |   fragments,
211 | - *Fragment offset* (13 bits), identifying the position of the fragment in the
212 |   packet,
213 | - Time To Live (TTL) (1 byte): the number of remaining nodes in the network
214 |   graph that the packet is allowed to go though; each node decrements that
215 |   number and drops the packet if it reaches 0, avoiding infinite loops,
216 | - *Protocol of the payload* (8 bits): 6 for TCP, 17 for UDP, 1 for ICMP, etc.,
217 | - *Header checksum* (16 bits) to detect corruption,
218 | - *Source IP address* (32 bits),
219 | - *Destination IP address* (32 bits),
220 | - *Payload* (eg, TCP content).
221 | 
222 | The packet is broken into fragments when one device on the path
223 | cannot transmit it whole across a link
224 | (eg. because the Ethernet frame size is 1500 bytes).
225 | That can happen at the source, but also along the path.
226 | Fragments can also go through different paths once split.
227 | It is up to the receiving host to reassemble the fragments.
228 | 
229 | Note that packets can be lost, duplicated, received out of order, or corrupted
230 | without the IP layer noticing. It is up to TCP to prevent that from happening.
231 | 
232 | IP addresses segment the network into increasingly smaller subnetworks, with
233 | **routers** processing packets in and across networks. They can be gained from
234 | the **Dynamic Host Configuration Protocol** (DHCP), auto-assigned, or manually
235 | set.
236 | 
237 | IPv4 addresses fit in 4 bytes, commonly written in dot-separated decimal, eg.
238 | `172.16.254.1`. To denote a subnetwork (which has adjacent numbers), we use
239 | Classless Inter-Domain Routing (CIDR) notation: `<network address>/<bitmask>`
240 | (the bitmask is a number of leading bits that stay the same for all addresses).
241 | For instance,
242 | 192.168.2.0/24 includes addresses from 192.168.2.0 to 192.168.2.255,
243 | although you cannot use the address ending in .0 (network address),
244 | used to identify the network, nor that ending in .255 (broadcast address),
245 | used to broadcast to all devices on the network.
246 | 
247 | There are special ranges of addresses:
248 | 
249 | - 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 for private networks
250 |   (ie, not globally routable;
251 |   they are typically behind a Network Address Translator (NAT)),
252 | - 0.0.0.0/8 for "no address", used as source address when getting an IP address,
253 | - 100.64.0.0/10 "shared address space", similar to private networks, but for
254 |   Carrier-Grade NAT (CGN),
255 | - 127.0.0.0/8 for loopback (sending network data within a single node), most
256 |   notably 127.0.0.1 (which the localhost hostname usually resolves to),
257 | - 169.254.0.0/16 for IP assignment between link-local,
258 |   autoconf/zeroconf addresses,
259 | - 192.0.0.0/24 for IANA,
260 | - 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24 are reserved for testing and
261 |   examples in documentation,
262 | - 192.88.99.0/24 for IPv6-to-IPv4 anycast routers for backwards compatibility,
263 | - 198.18.0.0/15 for network performance testing,
264 | - 224.0.0.0/4 for IP multicast,
265 | - 240.0.0.0/4 blocked for historical reasons,
266 | - 255.255.255.255 for broadcast.
267 | 
268 | #### IPv6
269 | 
270 | IPv6 packets:
271 | 
272 | - *Version* (4 bits): 0110.
273 | - *Traffic Class* (8 bits): DS and ECN just like IPv4.
274 | - *Flow Label* (20 bits): ID for a flow (a group of packets).
275 | - *Payload Length* (16 bits): number of bytes including extension headers.
276 | - *Next Header* (8 bits): the payload protocol or first extension header type.
277 | - *Hop Limit* (8 bits): similar to IPv4 TTL.
278 | - *Source Address* (128 bits).
279 | - *Destination Address* (128 bits).
280 | 
281 | Fragmentation cannot be done by intermediate routers, unlike IPv4.
282 | Only the source can fragment (using an extension header).
283 | The source machine is meant to do Path MTU (Maximum Transportation Unit)
284 | Discovery (PMTUD) to pick the packet or fragment size that fits all links
285 | through the network.
286 | They will change packet size for TCP, and they will fragment for transports such
287 | as UDP or ICMP which cannot cut their data into multiple packets.
288 | 
289 | The reason for this change: lower performance impact on routers,
290 | and ensuring that security devices can read the TCP header they need
291 | when analyzing the packet.
292 | 
293 | IPv6 addresses fit in 16 bytes, with pairs of bytes represented as
294 | colon-separated hexadecimal numbers, with adjacent zeros replaced by `::` once
295 | in the address.
296 | 
297 | - unicast has a ≥ 48-bit routing prefix, a ≤ 16-bit subnet id defined by the
298 |   network administrator, and a 64-bit interface identifier obtained either by
299 |   DHCPv6, the MAC address, random, or manually.
300 | - :: for unspecified address, equivalent to IPv4’ 0.0.0.0,
301 | - ::1 for localhost,
302 | - fe80::/64 for link-local communication; cannot be routed;
303 |   all other addresses in fe80::/10 are disabled,
304 | - fc00::/7 for Unique Local Addresses (ULAs), similar to private networks:
305 |   - fc00::/8 for arbitrary allocation,
306 |   - fd00::/8 for random allocation (with a 40-bit pseudorandom number).
307 | - ff00::/8 for multicast, with 4 flag bits (reserved, rendezvous, prefix,
308 |   transient) and 4 scope bits:
309 |   - general multicast has a 112-bit group ID, including:
310 |     - ff01::1 to all interface-local nodes,
311 |     - ff02::1 to all link-local nodes,
312 |     - ff01::2 to all interface-local routers,
313 |     - ff02::2 to all link-local routers,
314 |     - ff05::2 to all site-local routers,
315 |     - ff0X::101 to all NTP servers,
316 |     - ff05::1:3 to all DHCP servers.
317 |   - ff02::1:ff00:0/104 sollicited-node multicast has a link-local scope and a
318 |     24-bit unicast address,
319 |   - unicast-prefix-based multicast has a 64-bit network prefix (= routing prefix
320 |     + subnet id) and a 32-bit group ID.
321 | - 2001::/29 thru 2001:01f8::/29 for IANA special purposes (tunneling,
322 |   benchmarking, ORCHIDv2),
323 | - 2001:db8::/32 for examples in documentation,
324 | - 0100::/64 to discard traffic.
325 | 
326 | #### DNS
327 | 
328 | Some IP addresses can have a name map to them (eg, `en.wikipedia.org` →
329 | 91.198.174.192) by using the **Domain Name System** (DNS), a naming system for
330 | Internet entities. Companies that can allocate a new domain name are called
331 | **registrars**. They publish their information as zone files, and allow
332 | authenticated editing of those files by the domain name owners as part of a
333 | business arrangement.
334 | 
335 |     ; Example zone file.
336 |     $ORIGIN example.com..
337 |     $TTL 1h
338 |     ; Indicates that the owner is admin@example.com.
339 |     example.com. IN SOA   ns.example.com. admin.example.com. (2017011201 1d 2h 4w 1h)
340 |     example.com. IN NS    ns  ; Indicates that ns.example.com is our nameserver.
341 |     example.com. IN MX    10 mail.example.com.
342 |     example.com. IN A     91.198.174.192        ; IPv4 address
343 |                     AAAA  2001:470:1:18::118    ; IPv6 address
344 |     ns           IN A     91.198.174.1          ; ns.example.com
345 |     www          IN CNAME example.com.          ; www.example.com = example.com
346 | 
347 | [IP]: https://tools.ietf.org/html/rfc791
348 | 
349 | *While HTTP requires TCP which requires IP, lower layer protocols are usually
350 | interchangeable.*
351 | 
352 | #### BGP
353 | 
354 | TODO Routers
355 | 
356 | TODO BGP
357 | 
358 | TODO DHCP
359 | 
360 | ### Ethernet
361 | 
362 | At the link layer, communication mostly happens directly between two adjacent
363 | nodes.
364 | 
365 | Among link-layer protocols, **Ethernet** (aka. IEEE 802.3) is a link-layer
366 | protocol for transmitting frames through a wire between two nodes. A frame
367 | includes:
368 | 
369 | - preamble: 7 bytes to ensure we know this is a frame, not a lower-level header,
370 |   and to synchronize clocks (it contains alternating 0s and 1s),
371 | - Start of Frame Delimiter (SFD): 1 byte to break the pattern of the preamble
372 |   and mark the start of the frame metadata,
373 | - destination **Media Access Control** (MAC) address of the target device:
374 |   each device is typically assigned a MAC when manufactured, and
375 |   each device knows the MAC address of all devices it is directly connected to.
376 |   Among its 6 bytes, it contains two special bits:
377 |   - the Universal vs. Local (U/L) bit is 0 if the MAC is separated in 3 bytes
378 |     identifying the network card's constructor (Organisationally Unique
379 |     Identifier, OUI), and 3 bytes arbitrarily but uniquely assigned by the
380 |     constructor for each card (Network Interface Controller, NIC).
381 |   - the Unicast vs. Multicast bit is 0 if the frame must only be processed by a
382 |     single linked device.
383 | - source MAC address,
384 | - EtherType indicates what protocol is used in the payload (eg, 0x86DD for
385 |   IPv6); if the value is < 1536, it represents the payload size in bytes,
386 | - payload: up to 1500 bytes of data from the layer above, typically IP
387 |   (some devices support larger sizes, eg. Jumbo frames: ~9000 bytes),
388 | - Frame Check Sequence (FCS, implemented using a **Cyclic-Redundancy Check**
389 |   (CRC)): 4 bytes that verify that the frame is not corrupted; if it is, it is
390 |   dropped and upper layers may have to re-send it.
391 | - Interpacket gap: not really part of the frame, those 12 bytes of idle line
392 |   transmission are padding to avoid having frames right next to each other.
393 | 
394 | To connect Ethernet devices together, they are typically cabled to a switch.
395 | **Switches** are devices that multiple network nodes are connected to,
396 | and that forwards Ethernet frames
397 | to the destination MAC address listed in the frame.
398 | They remember source MAC addresses in content-addressable memory (CAM)
399 | as a MAC table:
400 | this is how they can dynamically learn the MACs of their connected devices,
401 | when MAC addresses are not statically hardcoded in the table.
402 | When the destination MAC is unknown, they flood all devices:
403 | devices ignore frames for which they are not the recipient,
404 | and the device whose MAC is the destination MAC answers,
405 | filling the switch’s table.
406 | 
407 | They can also connect nodes wired with different cable technologies
408 | (eg. fiber and twisted pairs).
409 | 
410 | **Bridges** connect LANs together. They behave like a switch,
411 | but the MAC table is a Forwarding Information Base (FIB):
412 | each interface, being a LAN, corresponds to multiple MACs.
413 | When receiving a frame, the bridge decides whether to forward it
414 | and to which interface based on the known MACs in the FIB.
415 | 
416 | The use-cases of switches and bridges overlap with those of routers,
417 | which is the consequence of a historical lack of synchronization of efforts
418 | between IEEE (Ethernet) and IETF (Internet).
419 | 
420 | TODO ARP
421 | 
422 | ### 1000BASE-TX
423 | 
424 | **1000BASE-TX** (part of IEEE 802.3u, aka. Fast Ethernet) is a physical-layer
425 | protocol. It defines using RJ45, which uses an 8P8C (8 position 8 contact)
426 | connector with TIA/EIA-568B, ie. having eight copper wires with pin 1 through 8:
427 | white-orange, orange, white-green, blue, white-blue, green, white-brown, brown.
428 | x / white-x wires form pairs 1 through 4: blue, orange, green, brown, each
429 | twisted together at different rates in the cable to reduce interference.
430 | Orange pins 1 (TX+) and 2 (TX-) transmit bits;
431 | green pins 3 (RX+) and 6 (RX-) receive bits, which makes this full-duplex.
432 | 
433 | From left to right on the female Ethernet connector:
434 | 
435 |     pin   1         2         3       4       5        6        7        8
436 |     ┌────────────┬──────┬───────────┬────┬──────────┬─────┬───────────┬─────┐
437 |     │white-orange│orange│white-green│blue│white-blue│green│white-brown│brown│
438 |     └────────────┴──────┴───────────┴────┴──────────┴─────┴───────────┴─────┘
439 |           TX+       TX-      RX+                      RX-
440 | 
441 | Bits are first encoded with **8B/10B**:
442 | each 8 bits are encoded as 10 bits according to a predetermined mapping
443 | that prevents having too many consecutive zeros,
444 | which would make locating individual bits harder,
445 | as clocks are not perfectly synchronized.
446 | 8B10B also has five extra 5-bit codes:
447 | one to indicate that no data is sent (Idle = 11111,
448 | which in NRZI means systematically alternating the current),
449 | two to indicate that we will start sending data (Start of Stream Data = SSD),
450 | two to indicate that we stop sending data (End of Stream Data = ESD).
451 | 
452 |     Bits:  01000111  01000101  01010100  (ASCII GET)
453 |     8B10B: 0101001111
454 | 
455 | Bit transmission relies on Non-Return-to-Zero/Inverted (**NRZI**):
456 | a 1 is represented by a change from 0 volts to 1 volt or back for TX+,
457 | from 0 volts to -1 volts or back for TX-. A 0 is no change in voltage.
458 | The receiver subtracts TX- from TX+: `(TX+ + noise) - (TX- + noise) = 0V or 2V`
459 | which together with the previous voltage, determines bits.
460 | On top of that, Multilevel Threshold-3 (**MLT-3**) is used:
461 | it halves the transfer frequency by alternating positive and negative voltages.
462 | 
463 |     8B10B:        0  1  0  1  0  0  1  1  1  1
464 |     MLT-3 TX+:  0  0  1  1  0  0  0 -1  0  1  0 (in volts)
465 |     MLT-3 TX-:  0  0 -1 -1  0  0  0  1  0 -1  0 (in volts)
466 | 
467 | 
468 | The wires go up to 100 metres. They are twisted with **Cat5e**
469 | Unshielded Twisted Pair (**UTP** = there is no metallic foil around the pair.
470 | The twisting protects information from noise sources a bit,
471 | but higher signal frequencies (Cat7, 8, …)
472 | require **STP**: Shielded Twisted Pairs).
473 | In 1000BASE-TX, 1000 means data goes at 1000 Mbit/s, T means
474 | twisted pair, X means that bits are encoded with 8B10B.
475 | 
476 | Since wires have a limited length,
477 | **repeaters** are put in place to transmit data over longer distances.
478 | 
479 | The resulting overhead looks like this:
480 | 
481 |     ┌─────┬─────┬────────────┬───────────┬────────────┬─────────────┬──────┬─────┬─────┐
482 |     │ SSD │ SFD │ MAC header │ IP header │ TCP header │ HTTP header │ data │ FCS │ ESD │
483 |     └─────┴─────┴────────────┴───────────┴────────────┴─────────────┴──────┴─────┴─────┘
484 |     │     │                  │           │            │     application    │     │     │
485 |     │     │                  │           │            └────────────────────┤     │     │
486 |     │     │                  │           │            transport            │     │     │
487 |     │     │                  │           └─────────────────────────────────┤     │     │
488 |     │     │                  │                   network                   │     │     │
489 |     │     │                  └─────────────────────────────────────────────┘     │     │
490 |     │     │                                 link                                 │     │
491 |     │     └──────────────────────────────────────────────────────────────────────┘     │
492 |     │                                     physical                                     │
493 |     └──────────────────────────────────────────────────────────────────────────────────┘
494 | 
495 | ## Links
496 | 
497 | - [The world in which IPv6 was a good design][apenwarr17]
498 | 
499 | [apenwarr17]: https://apenwarr.ca/log/20170810
500 | 


--------------------------------------------------------------------------------
/misc/reliability.md:
--------------------------------------------------------------------------------
1 | # Reliability
2 | 
3 | TODO
4 | 
5 | ## Going further
6 | 
7 | - [Google SRE book](https://landing.google.com/sre/book/).
8 | 


--------------------------------------------------------------------------------
/misc/statistics.md:
--------------------------------------------------------------------------------
  1 | # Statistics
  2 | 
  3 | ## Probabilities
  4 | 
  5 | Kolmogorov axioms:
  6 | 
  7 | - Ω: set of elementary events (exclusive outcomes)
  8 | - F: set of events; σ-algebra (closed set of all subsets) of Ω
  9 | - prob: function from F to [0,1]
 10 | - prob(Ω) = 1
 11 | - prob(∪Ai) = Σ prob(Ai) if Ai disjoint (= exclusive)  *(partition of a disk)*
 12 | 
 13 | - prob(∅) = 0
 14 | - prob(Ω-A) = 1 - prob(A)
 15 | - prob(A) = |A| ÷ |Ω| if Ω countable and ∀e∈Ω, prob({e}) = 1÷|Ω|
 16 | - prob(A∪B) = prob(A) + prob(B) - prob(A∩B)  *(think overlapping disks)*
 17 | - prob(∪Ai) ≤ Σ prob(Ai)
 18 | - prob(∪Ai) = Σr∈{1…n} (-1)<sup>r+1</sup> Σ{i1≤ir} prob(Ai1∩…∩Air)
 19 | - prob(∩Ai) = 0 ⇔ Ai disjoint  *(non-overlapping disks)*
 20 | - A1⊂A2⊂… ⇒ prob(An) → prob(∪Ai), prob(Ai) < prob(Ai+1)
 21 | - A1⊃A2⊃… ⇒ prob(An) → prob(∩Ai), prob(Ai) > prob(Ai+1)
 22 | - prob(A|B) ≝ prob(A∩B) ÷ prob(B)  *(think of |B as assuming Ω = B)*
 23 | - A, B independent ⇔ prob(A|B) = prob(A)
 24 | - prob(∩Ai) = Π prob(Ai) if Ai independent
 25 | - prob(∩Ai) = Π{i=2…} prob(Ai|A1∩…∩A{i-1}) prob(A1)
 26 | - A∈∪Bi ⇒ prob(A) = Σ prob(A∩Bi)
 27 | - prob(A|B) = prob(B|A) prob(A) ÷ prob(B)  *(Bayes' theorem)*
 28 | - Ai partition of Ω ⇒ prob(Ai|B) = prob(B|Ai) prob(Ai) ÷ (Σj prob(B|Aj) prob(Aj))
 29 | 
 30 | ## Distributions
 31 | 
 32 | A **random variable** (RV) is a function from Ω to ℝ. *(eg. winnings)*
 33 | 
 34 | The **indicator function** is a RV such that I(A) = 1 if A, otherwise 0.
 35 | 
 36 | A **mass function** for a RV X is fX: x → prob(X=x).
 37 | 
 38 | - fX(x) ≥ 0
 39 | - ∫ℝ fX(x) dx = 1
 40 | 
 41 | The **expected value** E[X] = Σ x fX(x).
 42 | 
 43 | - E[g(X)] = Σ{x∈X} g(x) fX(x)
 44 | - E[aX+b] = a E[X] + b
 45 | - (E[X])^2 ≤ E[X^2]  *(Cauchy-Schwarz inequality)*
 46 | - prob(X=a) = 1 ⇒ E[X] = a
 47 | - prob(a < X ≤ b) = 1 ⇒ a < E[X] ≤ b
 48 | - if X∈ℕ, r≥2, E[X] < ∞:
 49 |   - E[X] = Σ prob(X≥x)
 50 |   - E[X(X-1)…(X-r+1)] = r Σ{x=r…∞} (x-1)…(x-r+1) prob(X=x)
 51 | 
 52 | The **variance** var(X) = E[(X-E[X])^2].
 53 | 
 54 | - var(aX+b) = a^2 var(X)
 55 | - var(X) = 0 ⇒ X constant
 56 | 
 57 | The **covariance** cov(X,Y) = E[XY] E[X] E[Y]
 58 | 
 59 | - cov(X,Y) = E[(X-E[X])(Y-E[Y])]
 60 | - cov(X,Y) = cov(Y,X)
 61 | - cov(constant,X) = 0
 62 | - cov(a+bX+cY,Z) = b cov(X,Z) + c cov(Y,Z)
 63 | - cov(X,Y)^2 ≤ var(X) var(Y)
 64 | 
 65 | A few families of distribution.
 66 | 
 67 | - Bernouilli: RV from Ω to {0,1}. Take p = prob(X=0).  
 68 |   *Number of 1s from a 1/p dice throw (a dice with 1/p faces).*
 69 |   - E[X] = p
 70 |   - var(X) = p (1-p)
 71 | - Binomial "X ~ B(n,p)": RV such that fX(x) = (n choose x) p^x (1-p)^(n-x).  
 72 |   *Number of 1s from n throws of a 1/p dice.*
 73 |   - E[X] = n p
 74 |   - var(X) = n p (1-p)
 75 | - Geometric "X ~ Geom(p)": RV such that fX(x) = p (1-p)^(x-1)  
 76 |   *Number of throws before a 1/p dice yields a 1.*
 77 |   - E[X] = 1/p
 78 |   - var(X) = (1-p)/p^2
 79 |   - prob(X > n+m | X > m) = prob(X > n)  *(memory loss)*
 80 | - Negative binomial "X ~ NegBin(n,p)": RV such that fX(x) = (x-1 choose n-1) p^n (1-p)^(x-n).  
 81 |   *Number of throws before a 1/p dice yields n 1s.*
 82 |   - E[X] = n p / (1-p)
 83 |   - var(X) = n p / (1-p)^2
 84 | - Hypergeometric: RV such that fX(r) = (N choose r) (N-R choose n-r) / (N choose n)  
 85 |   *Number of red socks got from n blind picks without replacement from a drawer
 86 |   with N socks of which R are red.*
 87 |   - E[X] = R n/N
 88 |   - var(X) = R n/N (N-R)/N (N-n)/(N-1)
 89 | - Poisson: RV such that fX(x) = λ^x/x! exp(-λ), with x∈{0,1,…,λ}.  
 90 |   *Number of ticks per second when averaging λ ticks per second, if ticks are independent.*
 91 |   - E[X] = λ
 92 |   - var(X) = λ
 93 | - Uniform: RV such that fX(x) = constant with x∈[a,b].  
 94 |   *Result of dice throw.*
 95 |   - E[X] = (a+b)/2
 96 |   - var(X) = (b-a)^2/12, or ((b-a+1)^2-1)/12 if discrete
 97 | - Normal (Gaussian) "X ~ N(μ,σ^2)": RV such that fX(x) = exp(-(x-μ)^2/(2σ^2)) / (σ sqrt(2π))  
 98 |   *Infinite random walk starting at μ with step variance σ^2.*
 99 |   - E[X] = μ
100 |   - var(X) = σ^2
101 | 


--------------------------------------------------------------------------------
/misc/synchronization.md:
--------------------------------------------------------------------------------
  1 | # Synchronization
  2 | 
  3 | [Concurrency](./concurrency.md) is about imparting and gathering work from
  4 | multiple actors (typically on a single machine).
  5 | 
  6 | This is about ensuring the correct operation of actors working together.
  7 | In particular, to work with each other, actors must maintain a common
  8 | understanding of the world.
  9 | 
 10 | See [this](https://github.com/aphyr/distsys-class) for now; I'll try to make
 11 | something denser.
 12 | 
 13 | **Work in progress**.
 14 | 
 15 | ## Network
 16 | 
 17 | Graph of actors communicating
 18 | 
 19 | CSP, Actor model
 20 | 
 21 | ## Actors
 22 | 
 23 | What can go wrong: crash, recovery, corruption, byzantine, heterogenous
 24 | 
 25 | ## Communication
 26 | 
 27 | What can go wrong: slow (latency / bandwidth), lost, corrupted, sent multiple
 28 | times
 29 | 
 30 | TCP, UDP ([network](./network.md))
 31 | 
 32 | ## Clock
 33 | 
 34 | [time](./time.md), posix time, GPS, atomic clock, ntp
 35 | 
 36 | lamport clock
 37 | 
 38 | vector clock
 39 | 
 40 | ## Consistency, Availability, Partition
 41 | 
 42 | CAP theorem:
 43 | Network partition will happen, got to choose between consistency (read your
 44 | writes) and availability (answer without waiting)
 45 | 
 46 | levels of consistency
 47 | 
 48 | CRUD [data](./data.md)
 49 | 
 50 | ACID
 51 | 
 52 | ## Data transmission
 53 | 
 54 | RPC calls.
 55 | SOAP (XML) / REST (JSON): trees of data.
 56 | Protocol buffers. Captn Proto.
 57 | 
 58 | GraphQL and the problem of transmitting object graph
 59 | 
 60 | old-school CP (relational) databases (eg. MySQL, postgreSQL, etc.):
 61 | single-server writes, distributed reads through WAL streaming replication (hot standby server, vs. warm standby server), failover
 62 | 
 63 | eventual consistency
 64 | 
 65 | total operation order
 66 | 
 67 | operational transformation
 68 | 
 69 | CRDT
 70 | 
 71 | Consensus: Paxos
 72 | 
 73 | DHT
 74 | 
 75 | Merkle tree (git, bitcoin)
 76 | 
 77 | Proof of work (byzantine failure, bitcoin)
 78 | 
 79 | ## Architectural building blocks
 80 | 
 81 | [data](./data.md)
 82 | 
 83 | Key-Value store
 84 | (LSM tree on each node, DHT for replication / distribution)
 85 | Store small (100K) blobs. High read and write volumes.
 86 | 
 87 | block, object store, distributed file system (eg. Ceph, S3, GlusterFS, GTFS)
 88 | Store large (M, G, etc.) blobs. Low read volume.
 89 | 
 90 | Cache (redis, memcached), typically in-memory.
 91 | Increases read speeds.
 92 | 
 93 | SQL database: relational data.
 94 | Typically low number of machines (one writer machine many readers).
 95 | 
 96 | Big data: when a single machine can't handle the write volume or data size.
 97 | Typically requires switching to an AP system, sometimes NoSQL (column (Cassandra), key-value (Riak, Dynamo), graph (Neo4J), document (MongoDB))
 98 | Note that it is extreme; often, simply performing indexing on the right SQL
 99 | column is enough.
100 | Also, new-generation SQL systems like Spanner and CockroachDB support CP with
101 | larger numbers of writer machines.
102 | 
103 | Message queue (AMQP eg. RabbitMQ, Kafka)
104 | AMQP: protocol on top of TCP to distribute messages:
105 | - Direct exchange: send message to all queues listening to that key, and they'll
106 |   deliver it to one consumer
107 | - Fanout exchange: send to all queues bound to it
108 | - Topic exchange: sned to all queues set to receive a given key
109 | PubSub
110 | 
111 | Log/Search (ElasticSearch): pull data from all machines and index it.
112 | No rewrites, large amount of data, high read volume.
113 | 
114 | Log/Aggregate: log on each machine, merge data upon reading
115 | Very high write volume, very low read volume.
116 | 
117 | Immutable core, mutable shell (eg. Plan9 file system fossil)
118 | 
119 | ## Advice
120 | 
121 | Allow failure (chaos monkey), backup, redundancy, failover, monitoring, logging
122 | 
123 | Protocols: version, upgrade
124 | 
125 | SLA
126 | 
127 | ## Going further
128 | 
129 | - [Distributed systems for fun and profit](http://book.mixu.net/distsys/index.html)
130 | 


--------------------------------------------------------------------------------
/misc/time.md:
--------------------------------------------------------------------------------
 1 | # Time
 2 | 
 3 | Any recurring event can let you track time. For instance, if your cat leaves the house regularly, just count each event. However, time tracking is best served with those three Rules of Timekeeping™:
 4 | 
 5 | - It must be **easy to track**,
 6 | - have a **constant frequency** so that we can assign fixed durations to certain actions, say baking a cake,
 7 | - be **small**, so that we can track extremely short events: counting multiples is easier than divisions.
 8 | 
 9 | ## Day
10 | 
11 | The easiest event to track is the presence of the sun in the sky. The **day** was obviously the first time unit in use. For shorter events, it was subdivided into 24 hours, each with 60 minutes that have 60 seconds. (That strange system is probably the oldest legacy code in existence.)
12 | 
13 | Initially, hours were scheduled so that there were 12 hours between the start and end of the night, but this design broke rule 2 everywhere but at the equator, and so it was evolved into uniform subdivisions of the whole day.
14 | 
15 | Nowadays, **UT1** best tracks that definition of days. Through complex measurements involving the tracking of galaxies in the sky, the angle between the centre of the sun and latitude 0 (passing through Greenwich, UK) is mapped to a precise time. However, various events on Earth like tsunamis accelerate or slow down the Earth's rotation, making this measure break rule 2 as well. Worse yet, because of the Moon's attraction, the Earth's rotation is perpetually going to slow down, increasing the length of days increasingly fast.
16 | 
17 | ## Year
18 | 
19 | With agriculture, it became necessary to track a different recurring cosmic event: the number of turns of the Earth around the Sun. We know it as a **year**. Of course, there is not an integer number of days in a year; there are around 365.24219 Earth rotations in an Earth revolution.
20 | 
21 | Yet again, this measure breaks rule 2, even though it is not as bad. A large number of gravitational effects from the other planets modify the length of a sideral year in ways that seem random to the untrained eye.
22 | 
23 | ## Second
24 | 
25 | Initially, the **second** was backed by the day, as we mentionned. Then, it was backed by the moon. Then the metre (for use in mechanical clocks). Then the year. Finally, technology allowed us to realize that microwaving cesium makes its electrons oscillate at a very nearly constant frequency. It is the best timekeeper we have, and fortunately we can still use it if we change solar system.
26 | 
27 | So we switched to defining the second as a multiple of those oscillations. When the switch was made, the second was exactly equal to a second as defined previously (a portion of a year). Now, the most precise way to measure time is to count the number of seconds, which is convenient, since the second is the SI unit of time.
28 | 
29 | As a result, **TAI** (for International Atomic Time) was designed to count all time units in terms of atomic clocks at sea level. A TAI day is 24×60×60 seconds, a year is 365 days except on leap years, where it is 366 days. Leap days are used to correct the 0.24219 extra days in a year (imperfectly; they only correct 0.2425). They add a day at the end of February. They occur on years divisible by 4 and not divisible by 100, except if divisible by 400.
30 | 
31 | ![TAI](http://www.bipm.org/utils/common/img/tai/timelinks-2013.jpg)
32 | *Location of laboratories contributing to computing TAI.*
33 | 
34 | But of course, as days increase because of the Moon's gravitational pull, days no longer last 24×60×60 seconds — they are a minuscule bit longer than that. Today, TAI's midnight is roughly 40 seconds after the UT1 midnight.
35 | 
36 | To allow computing civil time accurately forever while benefitting from TAI's conformance to the Rules of Timekeeping™, a mix was made, called **UTC**. UTC is just like TAI, except that it is a fixed number of seconds before it. Every once in a while, the [IERS](https://www.iers.org/IERS/EN/Home/home_node.html) (International Earth Rotation and Reference Systems Service) proclaims that there will be a leap second added at the end of a certain day, causing the time to go from 23:59:59 to 23:59:60 (← leap second!) and then 00:00:00. The IERS does this every time it sees that there is a risk for UTC to deviate from UT1 by more than one second.
37 | 
38 | That said, realistically, if in 30 000 years midnight has slowly become the rough time when the sun rises, the Earth population will have had time to change the meaning of the word. As a result, there is a possibility (that computer scientists warmly welcome) that UTC stop adding leap seconds after a certain year.
39 | 
40 | There is another widely used second-based code representing instants in time: **[Unix time][]**. It is the number of seconds since the start of year 1970, *not including leap seconds*. Unix time allows sub-second precision (using a real number instead of integers). Since leap seconds are discarded, they need to be subtracted. As a result, when a leap second occurs, Unix time is determined by the [POSIX][] standard to increase linearly by one second for that second, and then jump back by one second, and again increase linearly by one second, making that second happen twice.
41 | 
42 | That design ensures that we can easily compute the number of UTC days between two Unix time stamps. However, it does not give the correct number of SI seconds. To quote [them][Unix time rationale]:
43 | 
44 | > [M]ost systems are probably not synchronized to any standard time reference. Therefore, it is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.
45 | 
46 | [Unix time]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_15
47 | [POSIX]: http://standards.ieee.org/develop/wg/POSIX.html
48 | [Unix time rationale]: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap04.html#tag_21_04_15
49 | 
50 | As a result, in order to compute the number of seconds between two times, if they are TAI times: convert them to seconds and subtract them. If they are UTC or Unix time: do the same, then account for all leap seconds that occurred between them.
51 | 
52 | ## Time Zones
53 | 
54 | Here is a more difficult question to answer: “What time is it here?” Civil time is one of those things that governments impose by law. Therefore, answering that question involves language, politics and alarms.
55 | 
56 | To give the same meaning of the words noon and midnight across the world, each country proclaims that it uses a certain deviation from UTC roughly proportional to their longitude. Each deviation from UTC is called a time zone, and represented as a positive or negative number of hours, minutes, and (real-numbered) seconds.
57 | 
58 | ![Standard time](https://upload.wikimedia.org/wikipedia/commons/4/4b/Solar_time_vs_standard_time.png)
59 | *Time zones with offset from solar time.*
60 | 
61 | As extra fun, many countries change time zones twice a year, but not at the same time. They reckon that people are disturbed by the sun in summer before their clock wakes them up. Having a curtain seemed too hard, so those countries decided to change time itself. This is called **Daylight Savings Time**.
62 | 
63 | To be perfectly accurate, one needs to keep track of all wars and laws in the world to determine the civil time at a geolocated point on the planet. Fortunately, two individuals took it upon themselves to do exactly that. They built what is now known as the [IANA time zone database][tzdata] (aka. tzdata). It determines both what each time zone code it defines has to UTC through time, and what time zone code is used for a set of famous cities spread throughout the planet. Most operating systems have a copy, which is often used to ask the user what their nearest large city is when installing it.
64 | 
65 | [tzdata]: http://www.iana.org/time-zones
66 | 
67 | So, using a geolocation coordinate, this database, and time zone borders and border changes, it is feasible to compute the civil time at all points on Earth from 1986 onwards. (You need way more data for dates prior to 1986, date when Nepal became the last country to switch to a UTC time zone.)
68 | 
69 | ## Stamps
70 | 
71 | The most common representation of times and dates in computing is [ISO 8601][]. It defines a textual representation for UTC (with or without time zones), calendar dates, weeks, yearless calendar dates, and durations. The precision goes down to the millisecond. For instance, `2006-08-14T02:34:56.238-06:00`.
72 | 
73 | [ISO 8601]: https://en.wikipedia.org/wiki/ISO_8601
74 | 
75 | A variant of this is [RFC 3339][], which is more lax. For instance, `2006-08-14 02:34:56.238-0600`.
76 | 
77 | We can also find [RFC 2822][], which is used for email. For instance: `Mon, 14 Aug 2006 02:34:56 -0600`.
78 | 
79 | [RFC 3339]: https://www.ietf.org/rfc/rfc3339.txt
80 | [RFC 2822]: https://www.ietf.org/rfc/rfc2822.txt
81 | 
82 | The second most common representation in computing is [Unix time][], in seconds, either in 32-bit integers, 32-bit unsigned integers, 64-bit unsigned integers, or in textual base-10 form (with decimal digits for more precision).
83 | 
84 | Unfortunately, there is no widespread TAI-based time stamps.


--------------------------------------------------------------------------------
/tree/Readme.md:
--------------------------------------------------------------------------------
 1 | Trees are connected graphs without cycles.
 2 | 
 3 | Most trees we use are **rooted**: one vertex is the entry point (the root);
 4 | it is the only vertex with no edge pointing to it, all other have exactly one.
 5 | 
 6 | Many trees are **ordered**: vertices order their children like a list.
 7 | 
 8 | The most common implementation of trees is as either null pointers or pointers
 9 | to a structure with a value and a list of children trees.
10 | 
11 | # Binary tree
12 | 
13 | Ordered trees with up to two children. Since they are ordered, we call them
14 | "left" and "right".
15 | 
16 | ## Tree traversal
17 | 
18 | We want to read each vertex exactly once.
19 | 
20 | ### Depth-first
21 | 
22 | - **Pre-order**: read a vertex, then pre-order left, then pre-order right.
23 | - **In-order**: in-order left, then read the vertex, then in-order right.
24 | - **Post-order**: post-order left, then post-order right, then read the vertex.
25 | 
26 | In-order is probably called this because it is the correct order for binary
27 | search trees.
28 | 
29 | ### Breadth-first
30 | 
31 | 1. Start with a list of `children` containing just the root.
32 | 2. Read a vertex from the start of the list and remove it.
33 | 3. Put its children at the end of `children`.
34 | 4. Go to 2 unless `children` is empty.
35 | 
36 | It requires O(n/2) worst-case space.
37 | 
38 | ## Binary search tree
39 | 
40 | Great for maps that you can traverse in the key order, priority queues where you
41 | care about the least prioritized element, and search when a hash table won't do.
42 | 
43 | - Each value stored in a vertex has total ordering.
44 | - Left is smaller
45 | - Right is bigger
46 | 
47 | Search, insertion and deletion is O(log n) average, O(n) worst-case (it can
48 | reduce to a list).
49 | 
50 | If the tree is balanced (= minimal height), we get O(log n) worst-case.
51 | 
52 | ### AVL tree
53 | 
54 | The first self-balanced binary search tree, and the one with the least costly
55 | search. Use this if you only insert during initialization.
56 | 
57 | Height ≤ `logφ(√5·(n+2))-2`.
58 | 
59 | ### Red-Black tree
60 | 
61 | Insertions are less costly than an AVL tree.
62 | 
63 | Height ≤ `2·log2(n+1)`.
64 | 
65 | ### Splay tree
66 | 
67 | Rarer, it ensures that recently requested searches are faster to search for.
68 | 
69 | ## Binary heap
70 | 
71 | Great for priority queues. Efficient to implement as an array.
72 | 
73 | - Complete binary tree: all levels of the tree must be filled but the bottom.
74 | - Each value stored in a vertex is bigger than or equal to its children.
75 | 
76 | A nice thing to know: it can sort an array in-place, just by inserting all the
77 | elements into the heap, and extracting the minimum n times.
78 | 
79 | # B-tree
80 | 
81 | Achieve O(log(n)) worst-case. Great for IO with large blocks of data: databases,
82 | filesystems.
83 | 
84 | Each vertex can have 2 or 3 children, and 1 or 2 keys (pieces of ordered data).
85 | 
86 | - The left branch has keys smaller than the left key,
87 | - The right branch has keys higher than the right key,
88 | - The middle branch has keys in between.
89 | 


--------------------------------------------------------------------------------
/tree/binary-search.js:
--------------------------------------------------------------------------------
  1 | // A binary search tree is a binary tree (a rooted tree with vertices with up to
  2 | // 2 children) where every vertex on the left branch holds keys lower than that
  3 | // of the current vertex, and every vertex on the right branch holds keys
  4 | // higher.
  5 | 
  6 | function BinarySearchTree() {}
  7 | 
  8 | BinarySearchTree.prototype = {
  9 |   key: null,
 10 |   // Not having a value can make this work like a set
 11 |   // with a findMin over the keys.
 12 |   value: null,
 13 |   left: null,
 14 |   right: null,
 15 | 
 16 |   // Ensure that search(key) returns the value for that key.
 17 |   // O(log n) with random input, O(n) worst-case.
 18 |   insert: function(key, value) {
 19 |     if (this.key == null) {
 20 |       // This is below a leaf or the tree is empty.
 21 |       this.key = key;
 22 |       this.value = value;
 23 |     } else if (key < this.key) {
 24 |       this.leftInsert(key, value);
 25 |     } else if (key > this.key) {
 26 |       this.rightInsert(key, value);
 27 |     } else {
 28 |       // this.key == key; the key was already there.
 29 |     }
 30 |   },
 31 | 
 32 |   leftInsert: function(key, value) {
 33 |     if (this.left == null) {
 34 |       this.left = new BinarySearchTree();
 35 |     }
 36 |     this.left.insert(key, value);
 37 |   },
 38 | 
 39 |   rightInsert: function(key, value) {
 40 |     if (this.right == null) {
 41 |       this.right = new BinarySearchTree();
 42 |     }
 43 |     this.right.insert(key, value);
 44 |   },
 45 | 
 46 |   // Return the value that was inserted.
 47 |   // O(log n) with random input, O(n) worst-case.
 48 |   search: function(key) {
 49 |     if (this.key == null) {
 50 |       // We reached a leaf without success; that key was never inserted.
 51 |       return null;
 52 |     } else if (key < this.key) {
 53 |       if (this.left != null) {
 54 |         return this.left.search(key);
 55 |       } else { return null; }
 56 |     } else if (key > this.key) {
 57 |       if (this.right != null) {
 58 |         return this.right.search(key);
 59 |       } else { return null; }
 60 |     } else {
 61 |       return this.value;
 62 |     }
 63 |   },
 64 | 
 65 |   // Ensure that search(key) returns null.
 66 |   // O(log n) with random input, O(n) worst-case.
 67 |   delete: function(key) {
 68 |     if (this.key == null) {
 69 |       return;
 70 |     } else if (key < this.key) {
 71 |       if (this.left != null) {
 72 |         this.left.delete(key);
 73 |       }
 74 |     } else if (key > this.key) {
 75 |       if (this.right != null) {
 76 |         this.right.delete(key);
 77 |       }
 78 |     } else {
 79 |       // We found the key to delete.
 80 | 
 81 |       if (this.left == null && this.right == null) {
 82 |         // We have no children, we just disappear.
 83 |         this.key = this.value = null;
 84 |       } else if (this.left == null) {
 85 |         // We have one child, we switch place with it.
 86 |         this.replaceWith(this.right);
 87 |       } else if (this.right == null) {
 88 |         this.replaceWith(this.left);
 89 |       } else {
 90 |         // We have two children. Replace with the biggest vertex on the left,
 91 |         // and delete that biggest vertex downward.
 92 |         var max = this.left.findMax();
 93 |         // It cannot be null here.
 94 |         this.key = max.key;
 95 |         this.value = max.value;
 96 |         this.left.delete(max.key);
 97 |       }
 98 |     }
 99 |   },
100 | 
101 |   replaceWith: function(tree) {
102 |     this.key = tree.key;
103 |     this.value = tree.value;
104 |     this.left = tree.left;
105 |     this.right = tree.right;
106 |   },
107 | 
108 |   // Return the biggest vertex.
109 |   findMax: function() {
110 |     if (this.right == null) {
111 |       return this;
112 |     } else {
113 |       return this.right.findMax();
114 |     }
115 |   },
116 | 
117 |   // In-order walk. O(n).
118 |   walk: function(f) {
119 |     if (this.left != null) {
120 |       this.left.walk(f);
121 |     }
122 |     if (this.key != null) {
123 |       f(this.key, this.value);
124 |     }
125 |     if (this.right != null) {
126 |       this.right.walk(f);
127 |     }
128 |   },
129 | };
130 | 
131 | // Usage.
132 | 
133 | var tree = new BinarySearchTree();
134 | tree.insert("orange", "A citrus fruit with a slightly sour flavour.");
135 | tree.insert("banana", "An elongated curved tropic fruit with a creamy flesh.");
136 | tree.insert("strawberry", "A sweet fruit of a plant of the genus Fragaria.");
137 | console.log("An orange is " + tree.search("orange"));
138 | tree.delete("orange");
139 | console.log("Once deleted, an orange is " + tree.search("orange") + ".");
140 | tree.walk(function(key, value) { console.log("- " + key + ": " + value); });
141 | 


--------------------------------------------------------------------------------
/tree/heap.js:
--------------------------------------------------------------------------------
  1 | // Binary Heap as a Priority Queue.
  2 | //
  3 | // 1. Each vertex is bigger than any descendent.
  4 | // 2. All levels of the tree must be filled but the bottom.
  5 | 
  6 | function BinaryHeap() {
  7 |   this.size = 0;
  8 |   this.array = [];
  9 | }
 10 | 
 11 | BinaryHeap.prototype = {
 12 |   // In this array, for each vertex at position n:
 13 |   // - the left child is at 2*n + 1,
 14 |   // - the right child is at 2*n + 2.
 15 |   array: [],
 16 | 
 17 |   // First, some obvious functions specific to the use of an array.
 18 |   swap: function(i, j) {
 19 |     var tmp = this.array[i];
 20 |     this.array[i] = this.array[j];
 21 |     this.array[j] = tmp;
 22 |   },
 23 |   push: function(item) { this.size++; this.array.push(item); },
 24 |   pop: function() { this.size--; return this.array.pop(); },
 25 | 
 26 |   // Now, the real magic.
 27 |   // O(log n) worst-case.
 28 |   insert: function(priority, value) {
 29 |     this.push({key: priority, value: value});
 30 |     this.shiftUp(this.size - 1);
 31 |   },
 32 | 
 33 |   max: function() {
 34 |     return this.array[0];
 35 |   },
 36 | 
 37 |   removeMax: function() {
 38 |     // Swap the max with the min, push the min down.
 39 |     this.swap(0, this.size - 1);
 40 |     var max = this.pop();
 41 |     this.shiftDown(0);
 42 |     return max;
 43 |   },
 44 | 
 45 |   // Primitives required for balancing the tree.
 46 | 
 47 |   shiftUp: function(j) {
 48 |     for (;;) {
 49 |       // Is j's parent (i) bigger?
 50 |       // If yes, we have kept rule #1, we can exit.
 51 |       var i = Math.floor((j - 1) / 2);
 52 |       if (i < 0) { i = 0; }
 53 |       if (this.array[i].key >= this.array[j].key) {
 54 |         break;
 55 |       }
 56 |       this.swap(i, j);
 57 |       j = i;
 58 |     }
 59 |   },
 60 | 
 61 |   shiftDown: function(i) {
 62 |     var j;
 63 |     for (;;) {
 64 |       // j1 is the left child of i.
 65 |       // j2 is the right child of i.
 66 |       var j1 = 2*i + 1;
 67 |       var j2 = 2*i + 2;
 68 | 
 69 |       // We want to switch i with the biggest of its child,
 70 |       // to maintain rule #1.
 71 |       if (j1 >= this.size) { break; }
 72 |       if ((j2 < this.size) && (this.array[j1].key <= this.array[j2].key)) {
 73 |         j = j2;
 74 |       } else {
 75 |         j = j1;
 76 |       }
 77 | 
 78 |       // If we already follow rule #1, we're good to go.
 79 |       if (this.array[i].key >= this.array[j].key) { break; }
 80 | 
 81 |       // We don't follow rule #1. Swap and continue.
 82 |       this.swap(i, j);
 83 |       i = j;
 84 |     }
 85 |   },
 86 | };
 87 | 
 88 | var heap = new BinaryHeap();
 89 | heap.insert(2, 'two');
 90 | heap.insert(5, 'five');
 91 | heap.insert(3, 'three');
 92 | heap.insert(4, 'four');
 93 | heap.insert(1, 'one');
 94 | console.log('The top of the heap is ' + heap.max().key + ' (five).');
 95 | console.log('Reverse sorted items:');
 96 | var item;
 97 | while (item = heap.removeMax()) {
 98 |   console.log(item.key + '. ' + item.value);
 99 | }
100 | 


--------------------------------------------------------------------------------
/tree/red-black.js:
--------------------------------------------------------------------------------
  1 | // A Red-Black tree has four properties:
  2 | //
  3 | // 1. Each vertex is either red or black.
  4 | // 2. The root is black.
  5 | // 3. A red vertex cannot have a red child.
  6 | // 4. All paths from root to leaf must have the same number of black vertices.
  7 | 
  8 | function RedBlackTree() {}
  9 | 
 10 | var color = {red: 0, black: 1};
 11 | RedBlackTree.prototype = {
 12 |   key: null,
 13 |   value: null,
 14 |   left: null,
 15 |   right: null,
 16 |   parent: null,
 17 |   color: color.black,
 18 | 
 19 |   // O(log n) worst-case.
 20 |   // We choose to implement it procedurally instead of recursively to separate
 21 |   // all steps.
 22 |   insert: function(key, value) {
 23 |     var v = this;
 24 | 
 25 |     // First, find the insertion position as with a normal binary search tree.
 26 |     for (;;) {
 27 |       if (v.key == null) {
 28 |         // This is below a leaf or the tree is empty.
 29 |         v.key = key;
 30 |         v.value = value;
 31 |         v.color = color.red;  // ← Every inserted vertex starts out red.
 32 |         break;
 33 |       } else if (key < v.key) {
 34 |         v = this.produceLeft(v);
 35 |       } else if (key > v.key) {
 36 |         v = this.produceRight(v);
 37 |       } else {
 38 |         // v.key == key; the key was already there.
 39 |       }
 40 |     }
 41 | 
 42 |     // Second, compare it to its parent.
 43 |     for (;;) {
 44 |       if (v.parent == null) {
 45 |         // Case 1. We are inserting the root: paint it black for rule #2.
 46 |         v.color = color.black;
 47 |         return;
 48 |       }
 49 |       if (v.parent.color === color.black) {
 50 |         // Case 2. Black parent, Red child, we are not breaking any rule.
 51 |         return;
 52 |       }
 53 |       // Red parent, Red child, we are breaking rule #3.
 54 |       // We know we have a black grandparent, since the root cannot be red,
 55 |       // but our parent is.
 56 |       var grandparent = v.parent.parent;
 57 |       var uncle = v.uncle();
 58 |       if (uncle != null && uncle.color === color.red) {
 59 |         // Case 3.
 60 |         //      (G)          [G]
 61 |         //    [P] [U]  →   (P) (U)
 62 |         //  [V]          [V]
 63 |         //
 64 |         // G = grandparent, P = parent, U = uncle; [Red], (Black).
 65 |         //
 66 |         // This keeps the number of blacks equal whatever the path.
 67 |         v.parent.color = color.black;
 68 |         uncle.color = color.black;
 69 |         grandparent.color = color.red;  // Red is the new black. ☺
 70 |         // What about the grandparent?
 71 |         // Did we make it break rule #2 (black root) or #3 (two red vertices)?
 72 |         v = grandparent;
 73 |         // Back to Case 1.
 74 |         // Case 3 is the only looping case, which is why insert() is O(log n).
 75 |         continue;
 76 |       }
 77 |       if (v.parent.right === v && v.parent === grandparent.left) {
 78 |         // Case 4.
 79 |         //     (G)             (G)
 80 |         //  [P]   (U)   →   [V]   (U)
 81 |         //    [V]         [P]
 82 |         this.leftRotation(v.parent);
 83 |         // We still have two red vertices breaking rule #3.
 84 |         // We will fix that with case 5.
 85 |         v = v.left;
 86 |       } else if (v.parent.left === v && v.parent === grandparent.right) {
 87 |         // Case 4 cont.
 88 |         //     (G)             (G)
 89 |         //  (U)   [P]   →   (U)   [V]
 90 |         //      [V]                 [P]
 91 |         this.rightRotation(v.parent);
 92 |         // We still have two red vertices breaking rule #3.
 93 |         // We will fix that with case 5.
 94 |         v = v.right;
 95 |       }
 96 |       if (v.parent.left === v && v.parent === grandparent.left) {
 97 |         // Case 5.
 98 |         //       (G)             (P)
 99 |         //    [P]   (U)   →   [V]   [G]
100 |         //  [V]                       (U)
101 |         this.rightRotation(grandparent);
102 |         v.parent.color = color.black;
103 |         grandparent.color = color.red;
104 |         // All is safe now.
105 |         return;
106 |       } else if (v.parent.right === v && v.parent === grandparent.right) {
107 |         // Case 5 cont.
108 |         //     (G)              (P)
109 |         //  (U)   [P]   →    [G]   [V]
110 |         //          [V]    (U)
111 |         this.leftRotation(grandparent);
112 |         v.parent.color = color.black;
113 |         grandparent.color = color.red;
114 |         // All is safe now.
115 |         return;
116 |       }
117 |     }
118 |   },
119 | 
120 |   //    G
121 |   //  P   U ← uncle
122 |   // V ← vertex (this)
123 |   uncle: function() {
124 |     if (this.parent == null || this.parent.parent == null) {
125 |       return null;
126 |     }
127 |     var grandparent = this.parent.parent;
128 |     if (grandparent.right === this.parent) {
129 |       return grandparent.left;
130 |     } else {
131 |       return grandparent.right;
132 |     }
133 |   },
134 | 
135 |   // Tree rotations preserve the properties of binary search trees.
136 | 
137 |   //    P         V
138 |   //  V   C  →  A   P
139 |   // A B           B C
140 |   rightRotation: function(parent) {
141 |     var v = parent.left;
142 |     var b = v.right;
143 |     var grandparent = parent.parent;
144 |     if (grandparent != null) {
145 |       var parentIsLeft = (grandparent.left === parent);
146 |       v.right = parent;
147 |       parent.left = b;
148 |       v.parent = grandparent;
149 |       if (parentIsLeft) {
150 |         grandparent.left = v;
151 |       } else {
152 |         grandparent.right = v;
153 |       }
154 |     } else {
155 |       // parent (which is root) becomes v.
156 |       //    P         P
157 |       //  V   C  →  A   V
158 |       // A B           B C
159 |       v.switchData(parent);
160 |       var a = v.left;
161 |       v.left = b;
162 |       v.right = parent.right;
163 |       parent.left = a;
164 |       parent.right = v;
165 |       parent.parent = null;
166 |       v.parent = parent;
167 |     }
168 |   },
169 | 
170 |   //   P           V
171 |   // A   V   →   P   C
172 |   //    B C     A B
173 |   leftRotation: function(parent) {
174 |     var v = parent.right;
175 |     var b = v.left;
176 |     var grandparent = parent.parent;
177 |     if (grandparent != null) {
178 |       var parentIsLeft = (grandparent.left === parent);
179 |       v.left = parent;
180 |       parent.right = b;
181 |       v.parent = grandparent;
182 |       if (parentIsLeft) {
183 |         grandparent.left = v;
184 |       } else {
185 |         grandparent.right = v;
186 |       }
187 |     } else {
188 |       // parent (which is root) becomes v.
189 |       //   P           P
190 |       // A   V   →   V   C
191 |       //    B C     A B
192 |       v.switchData(parent);
193 |       var c = v.right;
194 |       v.right = b;
195 |       v.left = parent.left;
196 |       parent.right = c;
197 |       parent.left = v;
198 |       parent.parent = null;
199 |       v.parent = parent;
200 |     }
201 |   },
202 | 
203 |   switchData: function(vertex) {
204 |     var key = this.key;
205 |     var value = this.value;
206 |     var color = this.color;
207 | 
208 |     this.key = vertex.key;
209 |     this.value = vertex.value;
210 |     this.color = vertex.color;
211 | 
212 |     vertex.key = key;
213 |     vertex.value = value;
214 |     vertex.color = color;
215 |   },
216 | 
217 |   produceLeft: function(vertex) {
218 |     var left = vertex.left;
219 |     if (left == null) {
220 |       left = vertex.left = new RedBlackTree();
221 |       left.parent = vertex;
222 |     }
223 |     return left;
224 |   },
225 | 
226 |   produceRight: function(vertex) {
227 |     var right = vertex.right;
228 |     if (right == null) {
229 |       right = vertex.right = new RedBlackTree();
230 |       right.parent = vertex;
231 |     }
232 |     return right;
233 |   },
234 | 
235 |   // Return the value that was inserted.
236 |   // O(log n) worst-case.
237 |   // Same implementation as a binary search.
238 |   search: function(key) {
239 |     if (this.key == null) {
240 |       // We reached a leaf without success; that key was never inserted.
241 |       return null;
242 |     } else if (key < this.key) {
243 |       if (this.left != null) {
244 |         return this.left.search(key);
245 |       } else { return null; }
246 |     } else if (key > this.key) {
247 |       if (this.right != null) {
248 |         return this.right.search(key);
249 |       } else { return null; }
250 |     } else {
251 |       return this.value;
252 |     }
253 |   },
254 | 
255 |   // In-order walk. O(n).
256 |   // Same implementation as a binary search.
257 |   walk: function(f) {
258 |     if (this.left != null) {
259 |       this.left.walk(f);
260 |     }
261 |     if (this.key != null) {
262 |       f(this.key, this.value);
263 |     }
264 |     if (this.right != null) {
265 |       this.right.walk(f);
266 |     }
267 |   },
268 | };
269 | 
270 | // Usage.
271 | var tree = new RedBlackTree();
272 | // Note that those insertions would create a linked list with a naive
273 | // binary search tree.
274 | tree.insert("banana", "An elongated curved tropic fruit with a creamy flesh.");
275 | tree.insert("orange", "A citrus fruit with a slightly sour flavour.");
276 | tree.insert("strawberry", "A sweet fruit of a plant of the genus Fragaria.");
277 | console.log("An orange is " + tree.search("orange"));
278 | tree.walk(function(key, value) { console.log("- " + key + ": " + value); });
279 | 


--------------------------------------------------------------------------------