├── Complexity.md ├── Readme.md ├── graph └── Readme.md ├── list ├── Readme.md ├── binary-search.js ├── shuffle.js └── sort │ ├── merge-sort.js │ └── quicksort.js ├── misc ├── cryptography.md ├── engineering.md ├── memory.md ├── network.md ├── network │ ├── physical.md │ └── protocol.md ├── reliability.md ├── statistics.md ├── synchronization.md └── time.md └── tree ├── Readme.md ├── binary-search.js ├── heap.js └── red-black.js /Complexity.md: -------------------------------------------------------------------------------- 1 | ``` 2 | f(n) = O(g(n)) ⇔ |f(n)| ≤ |g(n)|·k "grows less than" 3 | f(n) = Ω(g(n)) ⇔ f(n) ≥ g(n)·k "grows more than" 4 | f(n) = Θ(g(n)) ⇔ g(n)·k1 ≤ f(n) ≤ g(n)·k2 "bounded by" 5 | ``` 6 | 7 | (Insert "∃k>0: ∃n0: ∀n>n0" where needed.) 8 | 9 | # Master theorem 10 | 11 | An algorithm has complexity `T(n) = a T(n/b) + f(n)`. 12 | 13 | 1. `f(n) = Θ(n^(logb(a)))` → `T(n) = Θ(f(n))` 16 | 17 | # Typical complexities 18 | 19 | | Complexity | Description | 20 | |------------|-------------| 21 | | O(1) | Getting an element from an array | 22 | | O(m α(m,n))| Best minimum spanning tree (α = inverse Ackermann) | 23 | | O(log n) | Binary search | 24 | | O(n) | Maximum of an unsorted array | 25 | | O(n log n) | Best comparison sort | 26 | | O(n^2) | Naive vector cross product | 27 | | O(n^3) | Naive matrix multiplication | 28 | | 2^O(log n) | P; Karmarkar (Linear programming); AKS (Primes) | 29 | | 2^o(n) | Integer factorization | 30 | | 2^O(n) | E; TSP, 3-SAT, Graph-coloring | 31 | 32 | - SAT: can a given expression with variables, AND, OR, NOT and nesting be true? 33 | - TSP: minimum distance for a traveling saleswoman which must go through a set 34 | of cities. 35 | - Graph coloring: give each vertex a different color than its neighbors, with a 36 | fixed number of colors (2 is O(n), 3 is O(1.3289^n), k is O(2.445^n)). 37 | - Knapsack: keep a maximal value out of a set of elements with a value and a 38 | mass, given a limit to how much mass you can keep. 39 | It has a known O(n Mass) dynamic programming solution, but is NP-complete. 40 | - Exact cover: given a bunch of subsets (tetris piece locations) of a set 41 | (board), select the subsets that cover the whole set with no overlap. 42 | Pentomino tilings, Sudoku and N queens are of that form. 43 | Donald Knuth implements it using "Dancing Links". 44 | 45 | # Complexity classes 46 | 47 | - P: solved on a deterministic Turing machine in polynomial time. 48 | - NP: solved on a non-deterministic Turing machine in polynomial time (has 49 | overlapping parallel universes). A solution is verified by a deterministic 50 | Turing machine in polynomial time. (The non-deterministic solution generates 51 | all candidates and checks them in parallel.) 52 | - NP-hard: if that problem was O(1), all problems in NP would be in P. 53 | - NP-complete: NP-hard and NP. 54 | 55 | ``` 56 | ← easy … hard → 57 | ├────────NP─────────┤ 58 | ├─P─┤ ├─NP-complete─┤ 59 | ├────NP-hard────┤ 60 | ``` 61 | 62 | P could be equal to NP, but for all practical purposes, it is not (which may be 63 | proved in the future). 64 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # Succinct Cybernetics 2 | 3 | All decidable problems can be solved with algorithms. For all we know, humans 4 | are Turing machines. 5 | 6 | Most problems can be structured into a **graph**. [Graphs](/graph/Readme.md) have vertices (entities 7 | holding data) and edges (from one vertex to another, sometimes with a value). 8 | 9 | Many problems can be structured into a **tree**. [Trees](/tree/Readme.md) are connected graphs 10 | without cycles. Most trees we use are rooted: one vertex is the entry point (the 11 | root); it is the only vertex with no edge pointing to it, all other have exactly 12 | one. Many trees are ordered: vertices order their children like a list. 13 | 14 | Some problems can be structured into a **list**. [Lists](/list/Readme.md) are rooted trees where a 15 | maximum of one child is allowed. 16 | 17 | A few problems can be structured into a **map**. Maps are directed graphs where 18 | every vertex has either a single edge coming from them (keys) or at least one 19 | edge coming from a key (values). 20 | 21 | (An uncommon variation of maps are multimaps, where keys can have more than a 22 | single edge coming from them.) 23 | 24 | Another structure is a **set**. Sets are graphs with no edges. 25 | 26 | ## Index 27 | 28 | 1. [Complexity](/Complexity.md) 29 | 2. [Graphs](/graph/Readme.md) 30 | 3. [Trees](/tree/Readme.md) 31 | 4. [Lists](/list/Readme.md) 32 | 5. Misc: 33 | - [Memory](/misc/memory.md) 34 | - [Time](/misc/time.md) 35 | - [Network](/misc/network.md) 36 | - [Synchronization](/misc/synchronization.md) 37 | - [Reliability](/misc/reliability.md) 38 | - [Statistics](/misc/statistics.md) 39 | - [Cryptography](/misc/cryptography.md) 40 | - [Engineering](/misc/engineering.md) 41 | 42 | ## Going further 43 | 44 | - [Introduction to Algorithms](https://mitpress.mit.edu/books/introduction-algorithms) 45 | -------------------------------------------------------------------------------- /graph/Readme.md: -------------------------------------------------------------------------------- 1 | Graphs have vertices (entities holding data) and edges (from one vertex to 2 | another, sometimes with a value). 3 | 4 | # Implementation 5 | 6 | - **Adjacency matrix**: n by n matrix. Each slot is 0 (no edge between i and j), 7 | 1 (edge from i to j), potentially more for colored edges. 8 | - **Adjacency list**: list of n items, each with their vertex data and a pointer 9 | to a list of indices of vertices it points to. 10 | - **Pointers**: vertices with a list of pointers to vertices. Useful for trees 11 | or with a fixed maximum of adjacent vertices. 12 | 13 | | | Adjacency matrix | Adjacency list | 14 | |-------------|------------------|----------------| 15 | | Storage | O(v^2) | O(v + e) | 16 | | Add vertex | O(v^2) | O(1) | 17 | | Add edge | O(1) | O(1) | 18 | | Rm vertex | O(v^2) | O(e) | 19 | | Rm edge | O(1) | O(e) | 20 | | Is adjacent | O(1) | O(v) | 21 | 22 | # Graph traversal 23 | 24 | ## Depth-first 25 | 26 | ``` 27 | Search(G, v): 28 | explored(v) 29 | ∀e∈edges(G, v): 30 | w = vertex(G, v, e) 31 | if unexplored(w): 32 | check w 33 | Search(G, w) 34 | ``` 35 | 36 | O(m) time, O(n) space (worst case). 37 | 38 | ## Breadth-first 39 | 40 | ``` 41 | Search(G, v): 42 | queue Q, set S 43 | enqueue(Q, v) 44 | add(S, v) 45 | while Q not empty: 46 | w = dequeue(Q) 47 | check w 48 | ∀e∈edges(G, w): 49 | u = vertex(G, w, e) 50 | if u not in S: 51 | add(S, u) 52 | enqueue(Q, u) 53 | ``` 54 | 55 | O(m) time, O(n) space (worst case). 56 | -------------------------------------------------------------------------------- /list/Readme.md: -------------------------------------------------------------------------------- 1 | # Sort 2 | 3 | Sorted lists are easier to search through / extract data, but maintaining that 4 | property requires either care or full-list sorting (as below). 5 | 6 | | | Average | Worst | Stable | Memory | Note | 7 | |-----------|---------------|---------------|--------|----------------|--------| 8 | |Merge sort | O(n log n) | O(n log n) | yes | O(n) | | 9 | |" in-place | O(n log(n)^2) | O(n log(n)^2) | yes | O(1) | | 10 | |Quicksort | O(n log n) | O(n^2) | no | O(log n) / O(n)|The pivot technique is used elsewhere| 11 | |Heapsort | O(n log n) | O(n log n) | no | O(1) |In-place| 12 | |Insertion | O(n^2) | O(n^2) | yes | O(1) |Booklike| 13 | 14 | ## Radix sort 15 | 16 | Famous for being "linear". O(wn) worst-case, with n the size of the list, and w 17 | the size of the items (eg, for 64 bit integers, 64). If the list has no 18 | duplicates, w will be ≥log(n), so it is not really linear. Use this only if you 19 | have mostly duplicates (this sort is stable). 20 | 21 | # Shuffle 22 | 23 | Fisher–Yates shuffle is O(n) and can be in-place (O(1) memory). 24 | 25 | # Search 26 | 27 | ## Binary search 28 | 29 | O(log n) search for a key with no index in a sorted list. 30 | 31 | You know the index is between a lower bound and an upper bound (initially 32 | including all of the list), and you reduce their span by checking if the item is 33 | on the left or on the right half of the span. 34 | 35 | If you don't know how big the list is, you can do **exponential search**: first 36 | exponentially try to find an index whose item is bigger than the searched item, 37 | then do binary search. 38 | 39 | ## Interpolation search 40 | 41 | Better average complexity. Instead of halving the span, cut the span in 42 | proportion to where the item should be. For instance, in the dictionary, the 43 | word "zebra" would be around the end. It only works if items are uniformly 44 | distributed (O(log log n)). 45 | -------------------------------------------------------------------------------- /list/binary-search.js: -------------------------------------------------------------------------------- 1 | // A list and an item that may be in that list. 2 | // Returns -1 if it is not there, or the index of that item if it is. 3 | function search(list, item) { 4 | return binarySearch(list, item, 0, list.length - 1); 5 | } 6 | 7 | function binarySearch(list, item, imin, imax) { 8 | while (imin < imax) { 9 | // Idea: average of imin and imax, (imin + imax) / 2. 10 | // If imin and imax are too large, their sum could trigger an integer 11 | // overflow (go past the largest integer representable). 12 | // (imin + imin - imin + imax) / 2 = (imax - imin) / 2 + imin 13 | var imid = Math.floor((imax - imin) / 2) + imin; 14 | 15 | // 0 <= imin <= imid < imax 16 | 17 | if (list[imid] < item) { 18 | // |----|--x-| 19 | imin = imid + 1; 20 | } else { 21 | // |--x-|----| 22 | imax = imid; 23 | } 24 | } 25 | 26 | // Now, imin >= imax. 27 | // imin > imax if the list is empty. 28 | if ((imax === imin) && (list[imin] === item)) { 29 | return imin; 30 | } else { 31 | return -1; 32 | } 33 | } 34 | 35 | // 3 36 | console.log(search([2, 34, 321, 834, 854, 856], 834)); 37 | -------------------------------------------------------------------------------- /list/shuffle.js: -------------------------------------------------------------------------------- 1 | function shuffle(list) { 2 | for (var i = list.length - 1; i > 0; i--) { 3 | // i goes left: 4 | // j i 5 | // --x--x--- 6 | // unshuffled shuffled 7 | var j = Math.floor(Math.random() * (i + 1)); 8 | // i+1 because it should be able to stay in place. 9 | [list[i], list[j]] = [list[j], list[i]]; 10 | } 11 | return list; 12 | } 13 | 14 | console.log(shuffle([1, 2, 3, 4, 5])); 15 | -------------------------------------------------------------------------------- /list/sort/merge-sort.js: -------------------------------------------------------------------------------- 1 | // Classic divide-and-conquer algorithm. 2 | // This implementation is not in-place. 3 | 4 | function sort(list) { 5 | // Cut the list in a left piece (which we sort) and a right piece (which we 6 | // sort). 7 | var n = list.length; 8 | var target = new Array(n); 9 | // The smallest case is sublists of 1 item (which are then sorted). 10 | // Then we use sublists that double in size every time. 11 | for (var width = 1; width < n; width *= 2) { 12 | // Go from sublist to sublist. 13 | for (var i = 0; i < n; i += (2 * width)) { 14 | // Merge the sorted sublists from the last run. 15 | merge(list, i, Math.min(i + width, n), Math.min(i + 2*width, n), target); 16 | } 17 | // The target contains the better data, we'll use list as the new buffer. 18 | var tmp = target; 19 | target = list; 20 | list = tmp; 21 | } 22 | return list; 23 | } 24 | 25 | // Items from ileft to iright-1 are sorted on their own, 26 | // items from iright to iend are sorted on their own. 27 | function merge(list, ileft, iright, iend, target) { 28 | // |------|------| 29 | // ileft iright iend 30 | var imiddle = iright; 31 | 32 | // We will cover each item eventually, by increasing ileft and iright. 33 | for (var j = ileft; j < iend; j++) { 34 | if (ileft < imiddle // We still have a left item. 35 | // We don't have a right item or it is larger. 36 | && ((iright >= iend) || (list[ileft] <= list[iright]))) { 37 | // Put the left item. 38 | target[j] = list[ileft]; 39 | ileft += 1; 40 | } else { 41 | // Put the right item. 42 | target[j] = list[iright]; 43 | iright += 1; 44 | } 45 | } 46 | } 47 | 48 | console.log(sort([2, 5, 4, 1, 3])); 49 | -------------------------------------------------------------------------------- /list/sort/quicksort.js: -------------------------------------------------------------------------------- 1 | // Classic recursive algorithm. 2 | 3 | function sort(list) { 4 | return quicksort(list, 0, list.length - 1); 5 | } 6 | 7 | function quicksort(list, lo, hi) { 8 | if (lo < hi) { 9 | // Find a pivot in the middle (can be random, here, it's at the end). 10 | // Things smaller than the pivot will all accumulate on the left. 11 | var pivot = list[hi]; 12 | var i = lo; 13 | for (var j = lo; j < hi; j++) { 14 | // Put all the items smaller than the pivot on the left. 15 | // |--------|------|---p 16 | // lo <=p i >p j hi 17 | if (list[j] <= pivot) { 18 | // Swap i and j. 19 | var tmp = list[j]; 20 | list[j] = list[i]; 21 | list[i] = tmp; 22 | i += 1; 23 | } 24 | } 25 | // i has now all items <=p on the left, and >p on the right. 26 | // Put p (which is still on hi) between them. 27 | var tmp = list[hi]; 28 | list[hi] = list[i]; 29 | list[i] = tmp; 30 | 31 | // Sort the left side and the right side of the pivot. 32 | quicksort(list, lo, i - 1); 33 | quicksort(list, i + 1, hi); 34 | } 35 | return list; 36 | } 37 | 38 | console.log(sort([2, 5, 4, 1, 3])); 39 | -------------------------------------------------------------------------------- /misc/cryptography.md: -------------------------------------------------------------------------------- 1 | # Cryptography 2 | 3 | ## One-way function 4 | 5 | **One-way functions** are such that: 6 | 7 | - `one-way(input)` is computed in [polynomial-time](../Complexity.md) 8 | - All randomized polynomial-time functions `inverse(output)` such that 9 | `inverse(one-way(input))` have on average a near-zero probability to 10 | return `input` as the result of the computation of `inverse`. 11 | 12 | In practice, cryptanalysis can discover new ways to find the input from the 13 | output which changes the estimated probability, or machines can become more 14 | powerful than planned. As a result, it is necessary to stay up-to-date to 15 | correctly estimate risk. 16 | 17 | ### Hash 18 | 19 | A **Hash** is a one-way function returning a fixed-sized (typically small) 20 | output such that the following functions are computationally too hard: 21 | - `collision() = (m1, m2)` such that `hash(m1) = hash(m2)` 22 | - `preimage` such that `preimage(hash(m)) = m` 23 | - `second_preimage(m1) = m2` such that `hash(m1) = hash(m2)` 24 | 25 | It is useful to uniquely identify a large message in a small amount of memory 26 | (typically 64 bits (weak), 128 bits, or 256 bits) so that checking identity is 27 | fast. 28 | 29 | A regular NIST competition is performed to select a good hash function: the 30 | Secure Hash Algorithm (SHA). 31 | 32 | Ron Rivest’s MD5 is broken; SHA-0 and SHA-1 are considered broken; some SHA-2 33 | constructions have dangerous properties (vulnerability to *length-extension 34 | attacks*) that require the use of the HMAC algorithm for message authentication 35 | (but SHA-512/256 (ie. SHA-512 truncated to 256 bits) does not), and Joan 36 | Daemen’s SHA-3 is the latest as of 2018 (and does not have the SHA-2 issues). 37 | 38 | Famous non-SHA cryptographic hash functions include BLAKE2 (derived from SHA 39 | finalist BLAKE, itself derived from djb’s ChaCha20), Kangaroo12 (derived from 40 | SHA-3). 41 | 42 | Famous non-cryptographic hash functions include Zobrist (eg. to detect unique 43 | states in a game), FNV, CityHash, MurmurHash, SipHash (for hash tables). 44 | 45 | **Universal hash functions** are a family of hash functions where a key 46 | determines which function of the family is picked (usually, a random number 47 | picked when the hash table is created in memory). They only target 48 | **collision-resistance** against an adversary *that doesn’t know the key*. 49 | *SipHash* (by JP Aumasson and djb) is secure under those assumptions, and fast 50 | enough to be used to avoid malicious collisions in hash tables causing 51 | performance degradation and unavailability. 52 | 53 | A MAC (see below) can be used as a universal hash function by using a random 54 | key. 55 | 56 | **Rolling hash functions** TODO 57 | 58 | ### Message Authentication Code 59 | 60 | A **Message Authentication Code** (MAC) can assert the following properties if 61 | you share a secret key with a given entity: 62 | - **authentication**: the message was validated by a keyholder, 63 | - **integrity**: the message was not modified by a non-keyholder. 64 | 65 | It can be done with a hash; it is then called **keyed hash function**. 66 | Modern hash functions such as SHA-3 or BLAKE2 offer this functionality this way: 67 | - `mac = authentify(message, key) = hash(key + message)` (`+` is string 68 | concatenation), 69 | - `verify(message, mac, key) = authentify(message, key) == mac`. 70 | 71 | Older hashes suffer from *length-extension attacks* with this approach. However, 72 | they can be used by relying on the **HMAC** algorithm: 73 | - `authentify(message, key) = hash((rehash(key)^outerPad) + 74 | hash((rehash(key)^innerPad) + message))` where `^` is XOR, the pads are fixed 75 | and the size of the hash block, and `rehash` depends on the hash size. 76 | 77 | Another common MAC is djb’s *Poly1305*. 78 | 79 | HOTP, TOTP: TODO 80 | 81 | ### Key derivation function 82 | 83 | One use of cryptographic hash functions is to store password, but only via a 84 | metafunction called a **key derivation function** (KDF) that performs **key 85 | stretching** (passing a low-[entropy][] password through the hash function in a 86 | loop a large number of times) and salting (putting a known random prefix to 87 | protect against *rainbow attacks*). 88 | 89 | Famous KDFs include PBKDF2, bcrypt, scrypt and Argon2, ordered by date of 90 | creation and increased confidence in security. They typically have a stretching 91 | parameter to increase their complexity so that the same algorithm can be used 92 | when computers get better at brute-forcing passwords. 93 | 94 | [entropy]: ./information.md 95 | 96 | (Note that key stretching is only about low entropy: large-enough purely-random 97 | passwords hashed through a cryptographic hash cannot be brute-forced. For 98 | instance, a 256-bit BLAKE2 of a 128-bit CSPRNG output has 256/2 = 128 bits of 99 | security. Brute-forcing it would require `2^127` attempts on average. Computers 100 | take at best 1 ns to perform an elementary operation, and `2^127` ns is 4 times 101 | the age of the known universe. Parallelizing would cost 20 septillion € of 102 | machines to brute-force a typical hash in 100 years, if Earth had enough 103 | material to build the computers. It goes down to 1 million € with 64 bits of 104 | security, which is why 128 bits are used when security matters, eg. with 105 | UUIDv4.) 106 | 107 | ## Randomness 108 | 109 | Humans are terrible at estimating randomness, and machines (and cryptographers) 110 | are pretty good at exploiting weaknesses in randomness. 111 | 112 | A good random source obeys certain *statistical* characteristics to ensure that 113 | the probability of someone predicting its output is near zero. 114 | (See for instance the [NIST randomness recommendation][].) 115 | 116 | [NIST randomness recommandation]: https://csrc.nist.gov/projects/random-bit-generation 117 | 118 | One option is to gather and privately store data from physical events that are 119 | very hard for someone else to control, such as electric or atmospheric noise, 120 | or time noise in the occurrence of events in a booting operating system. 121 | 122 | Another is to make a **pseudo-random** number generator (PRNG), such that 123 | `random = prng(seed)` is a function that yields a bit (or a fixed-sized list of 124 | bits, as a number) every time it is called, such that: 125 | 126 | - it yields the same sequence of bits given the same seed, 127 | - the sequence of bits obeys the statistical characteristics we talked about. 128 | 129 | Cryptographically-Secure PRNG (CSPRNG) are designed more carefully but are 130 | typically slower. They are usually instances of a **pseudorandom function 131 | family** (PRF). 132 | 133 | Examples: arc4random (based on a leaked version of the RC4 cipher), AES-CTR, 134 | ChaCha20 (eg. in Linux’ /dev/urandom). 135 | 136 | Examples of non-cryptographically-secure: LCG, XorShift, Mersenne Twister, PCG 137 | (in order of quality against predictability). 138 | 139 | Typically, on Unix systems, `/dev/urandom` is a CSPRNG fed with a pool of 140 | entropy from boot-time randomness extracted from the operating system. 141 | 142 | ## Symmetric-key ciphers 143 | 144 | Cipher that defines two functions `encrypt(msg, key)` and `decrypt(msg, key)` such that: 145 | 146 | - `decrypt(encrypt(msg, key), key) = msg` 147 | - `encrypt` is a one-way function (and often `decrypt` too). 148 | 149 | **Reciprocal ciphers** are such that `decrypt = encrypt`, eg. the Enigma machine. 150 | 151 | Two common designs: stream and block ciphers. 152 | 153 | ### Stream ciphers 154 | 155 | They use a construct where every bit of information is encrypted 156 | one at a time; you give it the next bit of message, you instantly get the next 157 | bit of ciphertext out. 158 | 159 | The *Vigenère cipher* survived hundreds of years of cryptanalysis, earning it 160 | the name of “chiffre indéchiffrable” (indecipherable). While Babbage broke it by 161 | noticing repeated sequences in the plaintext exhibited repeated sequences in the 162 | ciphertext, it inspired the creation of the only provably unbreakable cipher, 163 | the *one-time pad*. 164 | 165 | The **one-time pad** simply performs modular addition on each symbol (in the 166 | case of bits, this corresponds to a XOR with the secret key). It requires a 167 | perfectly random secret key of a size equal to the plaintext that is never 168 | reused. *Claude Shannon* proved [information-theoretically](./information.md) 169 | that it is unbreakable (the only such cipher known to date), as for all 170 | plaintexts, there is a key that yields a given ciphertext. Practical risks in 171 | managing the key caused it to fall into disuse. 172 | 173 | Ron Rivest designed **RC4** (Rivest Cipher 4) as a proprietary algorithm for 174 | the RSA Security company. Following an anonymous online description, it was 175 | reverse-engineered. To avoid trademark conflicts, many systems adopted it as 176 | *ARC4*, and a derived CSPRNG was called *arc4random*. It was a common cipher in 177 | SSL/TLS and WEP/WPA, until a 2015 flaw was discovered. 178 | 179 | Daniel J. Bernstein (djb) designed *Salsa20* for the eSTREAM competition (a 180 | follow-up to the NESSIE competition where all stream ciphers submitted were 181 | broken). Many servers switched from RC4 to a derived cipher, **ChaCha20**, along 182 | with djb’s Poly1305 MAC, to have authenticated encryption ([RFC 7905][]). 183 | 184 | [RFC 7905]: https://tools.ietf.org/html/rfc7905 185 | 186 | ### Block ciphers 187 | 188 | A block cipher, by contrast with a stream cipher, can only encrypt a fixed 189 | number of bits (its *block size*). 190 | 191 | Most block ciphers are **product ciphers**: the generation of an encrypted block 192 | relies on repeating an operation (typically performing substitutions 193 | (**s-boxes**) and permutations (**p-boxes**)) multiple times by linking the 194 | output of one to the input of the next in a sophisticated *network* which 195 | increases security every time. (They achieve that by distributing the impact of 196 | each input bit to output bits, producing statistically more random output.) 197 | 198 | The number of times the network is repeated is called the number of **rounds**. 199 | Typical cryptanalysis first tries to break a cipher with a lower number of 200 | rounds. If they find a better algorithm than brute-force on all rounds, the 201 | cipher is considered *theoretically broken*, but the algorithm typically 202 | requires impractical amounts of time and memory. If it achieves a scale close to 203 | human lives and memory close to that of a country, it is considered *practically 204 | broken*. 205 | 206 | The US government’s NBS (ancestor to NIST) requested proposals for a cipher. IBM 207 | proposed **DES** (Data Encryption Standard), a 64-bit block cipher (based on a 208 | Feistel network), whose s-boxes were then tweaked by NSA and key size reduced to 209 | 56 bits before publication. 210 | 211 | When DES’s key size became dangerously close to brute-force-worthy, 3DES was 212 | produced, but it was very slow. NIST organized a more open competition, **AES** 213 | (Advanced Encryption Standard). The finalist, Vincent Rijmen and Joan Daemen’s 214 | Rijndael, is a 128-bit block cipher based on a SP-network, with three variants: 215 | 128-, 192-, 256-bit keys (with 10, 12, or 14 rounds). It was rebaptized AES when 216 | it won. 217 | 218 | ### Block modes 219 | 220 | Block modes convert a block cipher into a stream cipher by breaking the 221 | plaintext into blocks and encrypting each block with the cipher and a parameter 222 | that depends on the processing of previous blocks. 223 | 224 | The lack of use of that parameter, eg. by encrypting each block individually 225 | (*ECB* mode), falls to shifted plaintext analysis: identical plaintext blocks 226 | will have identical ciphertext blocks. 227 | 228 | **CBC** (Cipher Block Chaining) for instance XORs each block of plaintext with 229 | the ciphertext of the previous block. Diffie and Hellman also designed **CTR** 230 | (Counter) mode, which XORs the plaintext with an encrypted **nonce** (a unique 231 | input) that is incremented for every block. 232 | 233 | Parameters: key, nonce, plaintext. 234 | ciphertext block 1 = encrypt(nonce + 0, key) XOR (plaintext block 1) 235 | ciphertext block 2 = encrypt(nonce + 1, key) XOR (plaintext block 2) 236 | etc. 237 | 238 | When the parameter for each block is obtained from the previous block, the first 239 | block needs an initial parameter. It must be unique (ie. a nonce, “number used 240 | once”), so that encrypting twice the same message does not yield the same 241 | ciphertext, which would leak information. Often, it needs to be random in a 242 | cryptographically-secure way. That first parameter is called an **initialization 243 | vector**. It must be sent along with the ciphertext so it can be deciphered. 244 | 245 | Those ciphers only ensure **confidentiality** (the message can only be read by 246 | keyholders), but they lack: 247 | - **authentication**: the message was validated by a keyholder, 248 | - **integrity**: the message was not modified by a non-keyholder. 249 | 250 | The lack of those guarantees can allow a non-keyholder to tamper with the 251 | encrypted content unnoticed. The decrypted plaintext would then contain 252 | planted or substituted information. 253 | 254 | **Authenticated Encryption** (AE) offer authentication and integrity by adding a 255 | MAC. For instance, **GCM** (Galois Counter mode) converts a block cipher to an 256 | authenticated stream cipher which encrypts in counter mode and also produces a 257 | fixed-sized tag (a MAC) for the whole message. 258 | 259 | There are three variants of AE: **Encrypt-then-MAC** (EtM), which hashes 260 | encrypted data, **Encrypt-and-MAC** (E&M), which hashes plaintext data, and 261 | **MAC-then-Encrypt** (MtE), which encrypts hashed plaintext data. 262 | 263 | EtM is considered the most secure. MtE, for instance, has caused vulnerabilities 264 | such as Lucky13 in the way it interacts with padding. 265 | 266 | Usually, you can also insert non-encrypted metadata along with the ciphertext, 267 | which you wish to include for integrity in the AE MAC. That design is called 268 | **Authenticated Encryption with Associated Data** (AEAD). For instance, GCM mode 269 | supports that. 270 | 271 | ## Asymmetrical cryptography 272 | 273 | Cipher that defines three functions `public, private = keys(random)`, 274 | `encryptPublic(msg, public)`, `encryptPrivate(msg, private)`, such that: 275 | 276 | - `encryptPrivate(encryptPublic(msg, public), private) = msg` 277 | - `encryptPublic(encryptPrivate(msg, private), public) = msg` 278 | - `encryptPublic` and `encryptPrivate` are one-way functions. 279 | 280 | ``` 281 | ┌───────────┐ ─ encryptPublic → ┌───────────┐ 282 | │ message 1 │ │ message 2 │ 283 | └───────────┘ ← encryptPrivate ─ └───────────┘ 284 | ``` 285 | 286 | Most ciphers rely on one of two common mathematically difficult problems to 287 | enforce the one-way constraint: 288 | - factoring primes (**RSA**), 289 | - elliptic curves (**EC**, eg. NIST P-256 (aka secp256r1), or Curve25519). The 290 | keys are typically smaller (eg. 256-bit, compare to 4096 bits for RSA). 291 | 292 | **RSA** (Rivest, Shamir, Adleman) was the first public-key cryptosystem, and 293 | shows how to encrypt data in its original formulation. However, it is usually 294 | used as **RSAES-OAEP** ([RFC 2437][]) for use in encryption, detailing the 295 | proper use of padding by relying on a hash function. 296 | 297 | [RFC 2437]: https://tools.ietf.org/html/rfc2437 298 | 299 | Elliptic curves don’t by themselves have an encryption algorithm, but **ECIES** 300 | (Elliptic Curve Integrated Encryption Scheme) combines an EC, a *KDF*, a *MAC*, 301 | and a *symmetric encryption scheme* to encrypt data just with a public key. 302 | 303 | ### Key exchange 304 | 305 | Encryption is much more computationally expensive than symmetric schemes for 306 | large (> 400 bytes) messages. Since the goal of asymmetric encryption is to 307 | allow secure communication over a public channel without needing a shared secret 308 | (the issue with symmetric ciphers), this is limiting. 309 | 310 | A **key exchange** is a protocol where two entities communicate in public, 311 | resulting in them generating a secret key that only they know. 312 | 313 | **Diffie-Hellman** is a key exchange that lets two parties A and B obtain a 314 | shared secret over a public channel. That secret can then be used as the key of 315 | a symmetric cipher. 316 | 317 | 1. They each generate public and private keys with the same parameters (the 318 | modulo portion for RSA, the domain parameters for elliptic curves (ECDH)). 319 | 2. They agree on a base message `m`. 320 | 3. A sends `encryptPrivate(m, privateA)`, and B sends `encryptPrivate(m, 321 | privateB)`. 322 | 4. A computes `secret = encrypPrivate(encryptPrivate(m, privateB), privateA)` 323 | and b `secret = encryptPrivate(encryptPrivate(m, privateA), privateB)`, whose 324 | equality results from commutativity in underlying math. 325 | 5. `secret` is now shared exclusively between A and B. 326 | 327 | Advice for the common values to choose is detailed in 328 | [RFC 5114](https://tools.ietf.org/html/rfc5114). 329 | 330 | Daniel J. Bernstein’s **X25519** is a famous ECDH using Curve25519 (picked by 331 | djb for that use). 332 | 333 | Systems which generate a new random secret key for every new session are said to 334 | have **forward secrecy**. 335 | 336 | ### Digital signature 337 | 338 | **Digital signatures** associated with a message give the following guarantees: 339 | - **authentication**: the message was validated by a keyholder, 340 | - **non-repudiation**: the message cannot be un-validated by the keyholder, 341 | - **integrity**: the message was not modified by a non-keyholder. 342 | 343 | Unlike a MAC, it does not require a shared secret, just shared public keys. 344 | 345 | It relies on two functions, `sign()` and `verify()`, and a public/private key 346 | pair, such that: 347 | - `verify(message, sign(message, private), public) = true` and all other 348 | parameter combinations yield `false` with high probability. 349 | 350 | For RSA, this is achieved this way: 351 | - `sign(message, private) = encryptPrivate(hash(message), private)` 352 | - `verify(message, signature, public) = encryptPublic(signature, public) == 353 | hash(message)` 354 | 355 | In practice, **RSASSA-PKCS1-v1\_5** defines an RSA signature scheme for a given 356 | hash. 357 | 358 | **RSASSA-PSS** defines another RSA signature scheme for a given hash, mask 359 | generation formula, and a randomly generated salt of a given size. Both of those 360 | schemes are defined in [RFC 3447][]. 361 | 362 | [RFC 3447]: https://tools.ietf.org/html/rfc3447 363 | 364 | **ECDSA** (Elliptic Curve Digital Signature Algorithm) achieves that scheme, 365 | given any EC and a hash. 366 | 367 | Daniel J. Bernstein’s **EdDSA** is another digital signature scheme relying on 368 | Edwards curves (such as Curve25519), and tends to be faster than ECDSA. The 369 | primary example is ed25519, which is included in OpenSSH. 370 | 371 | ## Quantum Cryptography 372 | 373 | **Shor’s algorithm** solves integer factorization in logarithmic time on a 374 | quantum computer, which breaks the complexity assumption of RSA (and a variant 375 | would also break elliptic curve cryptography). The part of the algorithm running 376 | on a classical computer is randomized: pick a number F < N, discard it in some 377 | edge-cases. The quantum part finds the period of f(x) = F^x mod(N) by entangling 378 | photons in a circuit such that interference will cause the observation of the 379 | photons to collapse to one of several states which follow an equation of the 380 | period. Multiple equations allow to solve the period. Then the classical 381 | computer checks that it can use F and the period to find factors of N. 382 | 383 | However, quantum computers struggle with: 384 | - **Coherence time** (how much time the qubits stay uncorrupted). Keeping an 385 | algorithm running > 10 s is a challenge. 386 | - Number of qubits. The largest quantum computer has just 2000 qubits. Shor’s 387 | algorithm needs twice the size of the RSA key, eg. 4096 for 2048-bit RSA, and 388 | likely ten times that to correct errors. 389 | 390 | The largest number factorized by a quantum computer is 19-bit. There are no case 391 | yet of a quantum computation going faster than the equivalent classical 392 | computation (aka. **quantum supremacy**). 393 | 394 | It should eventually happen, which is why other techniques for asymmetric 395 | cryptography are researched, such as **lattice cryptography**. 396 | 397 | ## Going further 398 | 399 | - [Serious Cryptography](https://seriouscrypto.com) 400 | -------------------------------------------------------------------------------- /misc/engineering.md: -------------------------------------------------------------------------------- 1 | # Engineering 2 | 3 | 1. Define the desired features of a solution for your problem. 4 | 2. Break the problem into tractable subproblems. 5 | 3. Make the simplest working solution to each subproblem. 6 | 4. Compute all limits of your solution (performance, storage… even if the 7 | results are astronomically high). 8 | 5. Research the mathematical and physical laws causing those limits. 9 | 6. Improve the implementation. 10 | 7. When that is no longer enough, improve the design. 11 | 12 | ## Example problem: image store 13 | 14 | 1. We add images with `POST /images` which returns a string ID, and 15 | `GET /images/ID` returns the image. 16 | 2. We must generate unique IDs, store the image, associate an ID to a 17 | location in the store, get the image from its ID. 18 | 3. Generate the ID by incrementing a global variable, keep a vector of pointers 19 | to images in the heap, fetch an image by looking it up in the vector at the 20 | ID's index. 21 | 4. Limits: 22 | - Storage costs are about 6 €/GB and the maximum amount of images is about 23 | 600 GB (about 0.6 million photos) since we must have a single server. 24 | - A 1 MB image takes about 10 ms gigabit ethernet throughput + 50 μs RAM 25 | throughput (unnoticeable) = 10 ms to load. 26 | (cf. [memory limits](./memory.md).) 27 | - The probability of data loss in a given year on a server with 28 | 99.999% SLA for operating system uptime with a 3-min reboot (ie, averaging 29 | 1.8 reboots a year) can be computed from a [Poisson](./statistics.md) 30 | distribution as 1-exp(-1.8) (about 0.8). 31 | 5. The price limit is about the diminishing returns of economies of scale of 32 | DRAM production and the infrastructure of its construction and distribution. 33 | The amount limit is about efficiently packing DRAM on a single board for 34 | cloud providers. The probability limit is about electrical volatility of DRAM 35 | state. 36 | 37 | The latter is the primary issue to address. We can store images on disk instead, 38 | on a drive mounted on /img, with the file name equal to the hex representation 39 | of the 64-bit ID, eg. /img/000000000000000c for ID 12. 40 | 41 | (Depending on how the file system deals with directories with a large number of 42 | files, you may need to segment the key space: /img/a0/00000000000000a0 for ID 43 | 160, /img/ef/00000000deadbeef for ID 3735928559.) 44 | 45 | - Storage reaches 40 €/TB, the maximum (still single-server) is about 100 TB 46 | (about 100 million photos). 47 | - A 1 MB image takes 10 ms gigabit ethernet throughput + 10 ms disk throughput + 48 | 10 ms disk latency = 30 ms. 49 | - The probability of loss in a given year (aka. annualized failure rate) for a 50 | disk averaging 0.02 failures a year is `1-exp(-0.02)` (about 0.02). 51 | 52 | Going with an SSD instead yields: 53 | - 250 €/TB, maxing out at about 100 TB. 54 | - A 1 MB image takes about 10 ms gigabit ethernet throughput + 1 μs drive 55 | throughput + 30 μs drive latency = 10 ms. 56 | - The annualized failure rate is about 0.007. 57 | 58 | From then on, you can increase by synchronizing multiple servers. A master 59 | server holds a persisted map from ID to the storage server where the image is 60 | stored, loads the data from there to RAM and transits it through. 61 | 62 | When storing an image, the master server sends the image to the storage server 63 | (10 ms) and they simultaneously write to drive: the image for the storage server 64 | (30 μs latency + 1 μs throughput) and the mapping from ID to the storage server 65 | for the master server. 66 | 67 | Both when posting and getting the image, the image can be buffered instead of 68 | being fully loaded by the metadata server and then transmitted, which reduces 69 | image loading latency (time-to-first-byte) to just 500 μs round trip between 70 | servers within a datacenter + 50 μs SSD latency = 550 μs. The full image will 71 | still be loaded after 10 ms. 72 | 73 | Buffering also reduces the amount of RAM necessary to the number of concurrent 74 | requests multiplied by the size of the buffer, plus the mapping between ID and 75 | storage server. 76 | 77 | To reduce the RAM cost, the mapping can be put on disk with a **key-value 78 | store** (eg. [RocksDB](http://rocksdb.org/)) without affecting latency (a write 79 | is about 60 μs, a read 8 μs). 80 | 81 | The new bottleneck is the storage of the mapping from ID to storage server. 82 | - We can reach about 100 TB drive ÷ (64 bits ID + 32 bits index of storage 83 | server) = 8 trillion images (ignoring the overhead of storing into SST files). 84 | - The death of the drive causes complete loss of the data, and it is still at an 85 | AFR of 0.7%. 86 | 87 | We can mitigate the latter through backups. If we do them every B hours, get 88 | P posts per hour and recover in R hours, we will lose 0.7 × (B + R) × P ÷ (P × 89 | 8760) percent of posts a year (eg. a 99.9999% SLA for hourly backups with 100 90 | posts per second and 2-minute recovery). 91 | **Streaming replication** can get our SLA very high. 92 | 93 | Individual storage servers can also fail, so we can replicate: we save images 94 | to two servers in parallel, and we fetch it from the first server that is not 95 | dead. The probability that any two replicas die within the time it takes to 96 | rereplicate (say, 1h) is very low, 1-(1-8×10^-7)^I (where I is the number of 97 | images), but at a million images, a loss is as likely as a coin toss. 98 | **Trireplication** brings it down to 1-(1-6×10^-13)^I; you need a trillion 99 | images for a loss to again be a coin toss. 100 | 101 | (Another issue is that of [bitrot](./memory.md). A solution used in GFS (Google 102 | File System) is to check for corruption by computing a checksum, and fetching 103 | from one of the other two replicas if it doesn't match the stored checksum. Its 104 | successor Colossus doesn't trireplicate, but stores a replica off-site and a 105 | Reed-Solomon erasure code on a separate server in the datacenter. If a 106 | corruption is found, the data is replaced by its remote replica.) 107 | 108 | The new bottleneck is the throughput of image posts, since they all go through a 109 | single server. Write frequency reaches 100 kHz [in some benchmarks][badger]. 110 | 111 | [badger]: https://blog.dgraph.io/post/badger/ 112 | 113 | To solve that, we rely on a **distributed key-value store**, such as etcd or 114 | FoundationDB, which distribute writes through automatic segmentation of the key 115 | space into chunks each assigned to a server, and through coordination of that 116 | repartition and of writes by using Paxos or [Raft][] (two similar 117 | [consensus algorithms](./synchronization.md)). 118 | 119 | [Raft]: https://raft.github.io/ 120 | 121 | Since there is no longer a single writer, we can no longer generate IDs by 122 | incrementing a counter. We can solve this by generating 64-bit IDs that are the 123 | bitwise concatenation of the 32-bit ID of the master server and a 32-bit counter 124 | incremented on that server. 125 | 126 | The write latency goes up a bit because the consensus algorithm typically 127 | requires two round trips between the servers (1 ms). 128 | 129 | From then on, the next bottleneck is the gateway server your client is connected 130 | to: it has a limit in the number of concurrent requests it supports (the C10k 131 | problem). It can probably handle 10k concurrent requests, and each request takes 132 | 10 ms, which computes to a request frequency of 1 MHz. 133 | 134 | We can change the requirements to have the gateway redirect you to a random 135 | master server, which ultimately will be bottlenecked by the fact that 136 | coordination of the distributed key-value store requires all servers to keep 137 | some information about all other servers. That coordination also has hidden 138 | costs beyond thousands of master servers relating to gossip protocols, 139 | clock skew, and datacenter management. 140 | 141 | At that point, we can store on the order of 1k servers × 100 TB ÷ (64-bit ID + 142 | 3 × 32-bit storage server ID) = 5 quadrillion images. 143 | The storage servers can still hold them with a 32-bit ID: their limit is at 2^32 144 | servers × 100 TB ÷ 1 MB = 42 quadrillion images. 145 | 146 | The next step is then to use a **Distributed Hash Table** (DHT), wherein each 147 | server (and client) only need to know a subset of all servers, and they get the 148 | value of a key typically in log(N) hops, where N is the number of servers. The 149 | latency is then at 10 + log(N) × L ms (with L the average latency between two 150 | servers: 1 ms within a city, but it rises up fast when going to other cities). 151 | 152 | At that point, since serveurs are no longer assigned an incrementing ID, image 153 | IDs are randomly generated, typically 128 bits (**UUID**), or obtained by 154 | [hashing][] the image (**content-addressing**). 155 | 156 | [hashing]: ./cryptography.md 157 | 158 | This is the design used by most object stores, such as S3, Ceph, GlusterFS, 159 | Bittorrent, or IPFS, and related databases, such as Cassandra and Dynamo. 160 | 161 | Then we reach the limit of Earth's surface area. The problem then involves the 162 | severe latency of interplanetary communication (13 minutes on average for light 163 | from Earth to reach Mars). 164 | -------------------------------------------------------------------------------- /misc/memory.md: -------------------------------------------------------------------------------- 1 | # Memory 2 | 3 | Memory is either **volatile** (requires constant power) or persistent, random-access (**RAM**) or sequential, read-only (**ROM**) or writable, and has varying costs, durability and transfer speeds. 4 | 5 | As far as speed is concerned, it is important to distinguish **latency** (duration between the request and the start of the response) and **throughput** (amount of bits per second). 6 | For instance, the fastest way to transmit 20 TB across the globe is on SD cards by cargo plane: even on 1 Gbps Ethernet, it takes 44 hours, while a plane takes 22 hours. 7 | Planes have tremendous throughput, but their latency is 22 hours, while Ethernet will be around 100 ms. 8 | 9 | ## CPU 10 | 11 | The CPU has **registers** to store memory for computation. They are as fast as the processor (about 1 cycle per read or write), but there are about 16 64-bit registers (and 16 128-bit registers for floating point computation). 12 | (There are a handful of other registers for SIMD, SSE, …) 13 | 14 | The CPU also has caches with varying latencies and amounts: **L1** (0.5 ns), **L2** (7 ns), sometimes **L3**, altogether worth about 3 MB on laptops. 15 | 16 | ## Main Memory 17 | 18 | The main location of volatile storage; 100 ns latency, 20 GB/s. When a process is run, its code and data are located in the RAM. The memory given to a process is separated into five **segments**: 19 | 20 | - The **stack** stores local variables of all function calls leading to and including the currently executed function. All local variables of a function are destroyed when the function ends. 21 | - The **heap** stores data dynamically allocated and deallocated by the program, pointed to by a **pointer** on the stack or on the heap. It is useful to manage a common data structure from several functions, but the indirection is slower than accessing data on the stack. 22 | - In glibc, allocation here is performed by `malloc()` and deallocation by `free()`. 23 | - In *reference counting*, each object on the heap stores an integer that keeps track of the number of pointers on the stack and on the heap pointing to it. When that object reaches zero, it decrements the reference counts of the objects it has pointers to, and gets deallocated. When done automatically (without explicity incrementing and decrementing the reference count), since reference cycles would prevent reference counts from ever reaching zero (eg. stack pointer → A(1) → B(1), then set a pointer in B to A: stack pointer → A(2) → B(1) → A(2), then pop the stack: A(1) → B(1) → A(1)), it is common to rely on explicit weak references (that don't contribute to the reference count; here, we would set the B → A reference as a weak reference). Another approach is to set them to be garbage-collected through mark-and-sweep. 24 | - In tracing GC (*garbage collection*), a heuristic determines when deallocation is needed, at which point an algorithm is run to flag all unreachable heap objects, and then deallocates them. 25 | - The **data** contains statically allocated data (ie. created when the program starts, destroyed when it ends): global variables, string constants… The ones that are not initialized at startup are in **BSS** (Block Started by Symbol). 26 | - The **text** stores the code as executable machine instructions. 27 | 28 | ``` 29 | ┌──────┬──────┬─────┬──────────────────────┬─────────┬──────────────────────┐ 30 | │ text │ data │ bss │ heap (grows right) → │ (empty) │ ← stack (grows left) │ 31 | └──────┴──────┴─────┴──────────────────────┴─────────┴──────────────────────┘ 32 | ``` 33 | 34 | Each program has its own read, write, and execute access to parts of memory, enforced by the operating system, which can terminate the program with a **segmentation fault** if it does an unauthorized operation. 35 | 36 | The process' memory relies on **virtual memory**: it normally lives on volatile storage. However, when main memory becomes scarce, some *memory pages* (fixed blocks of virtual memory) are *swapped*: they are transfered to auxiliary memory (ie, storage drives) to make room in main memory. 37 | 38 | **Primary storage** refers to registers, CPU caches, and main memory. 39 | 40 | ## Secondary Storage 41 | 42 | **Auxiliary memory** persists data for a few years without being powered. 43 | 44 | - **Hard Disk Drive** (HDD for short): rotating disk coated with magnetic material, with a magnetic head moving from the edge to the center of the disk to go to a particular memory position. 45 | - **Solid State Drive** (SSD, Flash Storage): integrated circuit, where bits are stored in transistor cells as trapped electrons on an insulator. 46 | 47 | The counterpart: transfer speeds are much lower than RAM. 48 | 49 | - HDD: 100 MB/s, 10 ms latency, but it pays that latency every time it needs to move to a completely different location on disk. 50 | - SSD: 1 GB/s, 30 μs latency. 51 | 52 | Persistence is not a certainty. First, environmental events like water damage, shock, melting or electromagnetic fields can destroy stored information. 53 | 54 | Second, drives age probabilistically. 55 | Manufacturers use two metrics: Mean Time Between Failures (MTBF, expected lifetime) and Annualized Failure Rates (AFR, probability that a drive dies within a year). The two are linked by a [Poisson](./statistics.md) distribution: `AFR = 1-exp(-8760/MTBF)`. 56 | 57 | - HDD: AFR of [2%][Backblaze AFR]. Note that aging affects AFR (it is about 5% for 1.5 years, then 1.5% for 1.5 years, then 12% [according to Backblaze][Backblaze age analysis]. 58 | - SSD: AFR of [0.7%][Microsoft SSD Failures]. 59 | 60 | [Backblaze AFR]: https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ 61 | [Backblaze age analysis]: https://www.backblaze.com/blog/how-long-do-disk-drives-last/ 62 | [Microsoft SSD Failures]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf 63 | 64 | Third, **bitrot** (ie. a random change of a bit from 0 to 1 or vice-versa) happens. HDD are exposed to a huge amount of cosmic radiation than can impact their magnetic material. It happens maybe once a year? In addition to that, SSD are made of 1- or 2-bit floating-gate transistors that have a very predictable wear. Eventually, they return the wrong value. 65 | 66 | ### File system 67 | 68 | Persistent storage is typically organized hierarchically, as a tree of files: leafs contain blobs of bytes, while their ancestors in the tree, directories, associate each child with a name. 69 | 70 | #### Disk Layout 71 | 72 | Typically, on Linux, macOS and similar Unix operating systems, each file has an **inode** stored on disk, which includes the following information: 73 | 74 | - Which device is it on? 75 | - What user owns it? What group owns it? 76 | - What are its permissions? (Can the user read it? write to it? execute it? How about the group's users? How about other users?) 77 | - What type of file is it? (regular, directory, link, socket, device, FIFO… Along with the permissions, they are stored as six octal digits in `st_mode`) 78 | - When was it last modified (`ctime`)? When was the content last modified (`mtime`)? When was it last accessed (`atime`)? 79 | - It also contains pointers to fixed-sized **blocks** on disk holding the file's content. Reading the file gets the bytes from those blocks one after the other. 80 | 81 | Traditional file systems (eg. Linux' ext3, ufs) typically have inodes include a dozen pointers to blocks holding content (direct blocks). If the file is too big to fit in those blocks, it uses a couple of pointers from the inode to blocks holding pointers to blocks holding content (indirect blocks). If the file is still too big, it uses another pointer from the inode to blocks holding pointers to blocks holding pointers to blocks holding content (double indirect blocks). 82 | 83 | More modern designs (Linux' [ext4], macOS' HFS) rely on **extents** for content storage: contiguous blocks. Instead of the inode holding a pointer to the start of a block, it has a pointer to the start of the extent, and the number of blocks that the extent covers. It also has a field to determine whether the extent contains the file's content, or pointers to other extents. Having contiguous blocks reduces the amount of indexing data (eg, only one pointer and the number "4" to say we have a 4-block extent, instead of four pointers to four blocks), and it avoids making hard drives seek to new locations on-disk after each block is read (which can cost 10 ms every time). 84 | 85 | Because newly created extents may not fit between existing extents, the file system needs to deal with that *external fragmentation*. Among the tricks used to fight this, *delayed allocation* (aka. allocate-on-flush) is a technique that aggregates information about all file writes across the file system for a few seconds in something called the journal, and writes (*flushes*) them all at once to the disk in a way that avoids leaving small unused spaces between extents. 86 | 87 | An alternative technique to journaling, *copy-on-write* (COW), used by Linux' btrfs and macOS' APFS, never directly edits existing extents: instead, it writes to a brand-new extent, and then makes the inode point to the new extent. 88 | 89 | [ext4]: https://ext4.wiki.kernel.org/index.php/Ext4_Design 90 | 91 | #### File Operation 92 | 93 | When a process needs to access a file, it opens it with its path (eg. `/home/user/file`) and with flags determining how it can be manipulated (one of `O_RDONLY` (read-only access), `O_RDWR` (read and write access), `O_WRONLY` (write), and optionally `O_CREAT` (create the file if it doesn't exist), `O_APPEND` (only write at the end of the file), etc.). 94 | 95 | The system then gives an integer to the process: the **file descriptor**. The list of file descriptors a process has can be obtained with `ls /proc//fd`. The process uses the file descriptor to interact with the file. 96 | 97 | The file descriptor has a cursor determining where it starts reading from, which is initially at the start (or the end, for `O_APPEND`). When you read a number of bytes from the file, the cursor moves forward by that amount. You can change the position of the cursor with `lseek()`. When you write bytes to a file, it inserts them at the cursor position. 98 | 99 | Once the file is dealt with, it must be closed, to allow the operating system to release resources. 100 | 101 | ### Disk arrays 102 | 103 | Storing data in a single location is dangerous, as the drive may fail. Regularly copying the data to another drive (making a **backup**) is common: when the drive fails, it can be replaced and the data copied from the backup. That can be done by physically mounting the backup drive and using `cp`, or `rsync`, or by using `rsync` over the network to a remote backup server (rsync avoids resending data that is already backed up), or by using `btrfs send` over SSH, (which also avoids resending data in more subtle ways). 104 | 105 | However, when the drive fails, the data stops being available in the meantime. More importantly, corrupted data goes undetected, as the drive won't complain, and the corruption will blindly be copied to the backup. 106 | 107 | Instead of an only-connected-for-backup external or remote drive, your computer can have two permanently connected drives. You can use the second drive exactly as if it was a backup. It contains a perfect copy of the content. That way, if one drive fails, your data is still safe, and is still available while the second is replaced. This is called **RAID1**. 108 | 109 | To automatically correct bitrot, you can use a setup of multiple drives where consecutive blocks are distributed across drives (**data striping**, improves the time needed to read extents), and you also distribute parity information across drives. That parity information lets you detect and correct corruption in the data. This is called **RAID5**, and it also has all the pros of RAID1, except that it requires using at least three drives. 110 | 111 | ## Internet storage 112 | 113 | Data persistence is not guaranteed even with those protections. A fire may destroy a whole building. Single-computer drive arrays are not enough. To support copying and synchronizing of data over long distances, the use of Internet storage is necessary. It has multiple facets: 114 | 115 | - Replication: every piece of data must be in at least two different buildings, preferably in different cities. The **replication factor** is the number of long-distance copies. 116 | - Distribution: having all of the data be replicated in all locations burns egress, bandwidth, and disk usage, all of which is costly. Therefore, each piece of data is only located on a subset of the nodes. Determining which nodes are used is either manual or relies on a distributed hash table (**DHT**). 117 | - Synchronization: if multiple nodes allow updating the data, writes between those nodes can conflict, in which case a conflict resolution system or some kind of CAP compromise is required. The **CAP theorem** states that when a partition (P) in the network occurs (ie, some nodes can't communicate with others), the system must choose between maintaining consistency (C, the guarantee that reads return the most recent write) by making requests wait until the network recovers, or maintaining availability (A) by responding to requests immediately. 118 | - Parallelism: a piece of data can be **striped** over multiple nodes, so that reading it will be faster. 119 | - **Deduplication**: to use less disk space, the network can detect pieces of data that hold the same content, and only store it once (at the replication factor). 120 | - **Byzantine fault-tolerance**: while having multiple copies protects against having a data center being destroyed or cut off, some Internet storage need to assume that some nodes may be malicious. The system then needs to protect data and updates of non-malicious nodes up until half of the nodes become malicious. 121 | 122 | Common solutions include [GlusterFS] (pieces of data are at the file level), [CephFS] (at the block level), Hadoop HDFS, Amazon S3. 123 | 124 | [GlusterFS]: https://www.gluster.org/ 125 | [CephFS]: http://ceph.com/ 126 | 127 | Within the same datacenter, you can expect 500 μs of latency; across the planet, about 100 ms. 128 | -------------------------------------------------------------------------------- /misc/network.md: -------------------------------------------------------------------------------- 1 | # Networks 2 | 3 | Computers [communicate](./information.md) to reach a goal. For instance, you 4 | contact Youtube to see cat videos, Youtube responds to gain advertising revenue. 5 | 6 | A **network** can be represented with a graph where vertices are processing 7 | machines and edges are transmission links. Examples of networks include the 8 | Internet, telephones, and walkie-talkies. 9 | 10 | ## Protocols 11 | 12 | Certain documents (typically, standards and Requests For Comments (RFCs)) set 13 | the way in which information is transmitted through the network, first as bits, 14 | then as higher-level concepts. They can depend on the existence of lower-level 15 | protocols, forming a *protocol stack*. The **OSI model** theorizes the layers 16 | that a protocol stack is made of: 17 | 18 | - Physical: transmission of bits through a medium (eg: Ethernet PHY chip for 19 | 100BASE-TX), 20 | - Data link: transmission of frames mostly between adjacent nodes, to determine 21 | the start and end of messages (eg: MAC, PPP), 22 | - Network: transmission of packets for routing across the graph (eg: IP), 23 | - Transport: transmission of segments, so applications on both endpoints can 24 | exchange messages with given reliability guarantees (eg: TCP, UDP, ICMP 25 | (ping)), 26 | - Session: setup and recognition of endpoints across messages, 27 | - Presentation: encoding of data (charset, compression, encryption) (eg: TLS, 28 | HTTP with MIME to some extent), 29 | - Application: serialization of data structures (eg: HTTP (documents), NTP 30 | (time), SMTP (email), FTP (file)). 31 | 32 | Let's focus on a typical stack. 33 | 34 | ### HTTP 35 | 36 | **HyperText Transfer Protocol** ([HTTP][]) is an application-layer and 37 | presentation-layer protocol designed for client-server document transmission. 38 | For instance, to request the main page of an HTTP server on your computer: 39 | 40 | GET / HTTP/1.1 41 | Host: localhost:1234 42 | Accept: text/html 43 | 44 | [HTTP]: https://tools.ietf.org/html/rfc2616 45 | 46 | (Each newline is made of two bytes: 0x0D and 0x0A, aka. CR-LF; it ends with two 47 | newlines). The server may respond: 48 | 49 | HTTP/1.1 200 OK 50 | Content-Type: text/html 51 | Date: Sat, 31 Dec 2016 15:31:45 GMT 52 | Connection: keep-alive 53 | Transfer-Encoding: chunked 54 | 55 | 7E 56 | 57 | 58 | 59 | 60 | This is HTML 61 | 62 | 63 | 64 | 65 | 0 66 | 67 | This response includes an [HTML][] file that the HTTP client (for instance, a 68 | browser, like Firefox or Google Chrome) will read as instructions on how to lay 69 | out a page, which determines the pixels to display, the animations to show, the 70 | interactions to execute when the user moves or clicks the mouse, the sounds to 71 | play, etc. 72 | 73 | [HTML]: https://html.spec.whatwg.org/multipage/ 74 | 75 | All requests have a first line with a method (`GET`), a path (`/`), and a 76 | protocol (`HTTP/1.1`), optionally followed by headers mapping header names 77 | (`Accept`) to their value (`text/html`). Requests may also carry data. 78 | 79 | Responses have a first line with a protocol (`HTTP/1.1`), a code (`200 OK`; 80 | codes starting with 1 are informational, 2 for success, 3 for redirection, 4 for 81 | client errors, 5 for server errors). Responses usually carry data (here, the 82 | HTML file), and also have headers explaining what the data is, how it is encoded 83 | (charset, compression), what time it is, whether to use caching, how to store 84 | session information (through cookies) and so on. 85 | 86 | As mentioned, HTTP includes presentation-layer "protocols" in headers, such as 87 | **Multipurpose Internet Mail Extensions** ([MIME][]) in `Content-Type`, to 88 | specify the file `/` (eg. `text/plain`), or whether it 89 | recursively contains subfiles with [`multipart/form-data`][form-data], with each 90 | subfile specifying their own headers: 91 | 92 | POST /upload HTTP/1.1 93 | Host: localhost:1234 94 | Content-Length: 882 95 | Content-Type: multipart/form-data; boundary=random0ACxeUx4Nxqy3roVtMxrAw 96 | 97 | --random0ACxeUx4Nxqy3roVtMxrAw 98 | Content-Disposition: form-data; name="name-of-first-part" 99 | Content-Type: text/plain 100 | 101 | This first file contains normal plain text. 102 | --random0ACxeUx4Nxqy3roVtMxrAw 103 | Content-Disposition: form-data; name="multiple-images"; filename="image.svg" 104 | Content-Type: image/svg+xml; charset=UTF-8 105 | 106 | 107 | 108 | This is an image 109 | 110 | --random0ACxeUx4Nxqy3roVtMxrAw 111 | Content-Disposition: form-data; name="multiple-images"; filename="image.png" 112 | Content-Type: image/png 113 | Content-Transfer-Encoding: base64 114 | 115 | iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAAXNSR0IArs4c6QAAAARnQU1BAACx 116 | jwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAMSURBVBhXY2BgYAAAAAQAAVzN/2kAAAAASUVO 117 | RK5CYII= 118 | --random0ACxeUx4Nxqy3roVtMxrAw-- 119 | 120 | (Note that our use of base64 in image.png is deprecated; in real life, it would 121 | be replaced by the binary data directly.) 122 | 123 | [MIME]: https://tools.ietf.org/html/rfc2045 124 | [form-data]: https://tools.ietf.org/html/rfc7578 125 | 126 | HTTPS is HTTP transmitted over a TLS connection: all the HTTP data, including 127 | headers, is encrypted to prevent intermediate nodes on the network from reading 128 | or modifying the content, which is necessary when transmitting identification or 129 | banking information, and to avoid being fooled into performing dangerous acts. 130 | 131 | ### TCP 132 | 133 | **Transmission Control Protocol** ([TCP][]) is a transport-layer protocol to 134 | ensure that all sent segments are received uncorrupted in the same order. That 135 | is achieved by reordering received segments and resending corrupted ones. 136 | When using IP, it cuts its segments into pieces that fit in a packet. 137 | 138 | 1. The server starts to listen to a port. 139 | 2. The client starts to connect with a SYN. 140 | 3. The server informs the client that it received it with a SYN+ACK. 141 | 4. The client sends an ACK. 142 | 5. The server and the client can now send a series of packets to each other 143 | full-duplex, and they ACK each reception if all previously received packets 144 | have been received in order. 145 | 6. The client sends a FIN. 146 | 7. The server sends a FIN+ACK (or an ACK followed by a FIN). 147 | 8. The client sends an ACK. (The connection stays open until it times out.) 148 | 149 | A TCP header includes: 150 | 151 | - source port in 2 bytes, 152 | - destination port in 2 bytes, 153 | - sequence number in 4 bytes: 154 | - in a SYN, this is the client Initial Sequence Number (ISN), picked usually 155 | randomly, 156 | - otherwise it is (server ISN) + 1 + number of bytes previously sent, ensuring 157 | that packets can be reordered to obtain the original segment. 158 | - acknowledgement number in 4 bytes: 159 | - in a SYN-ACK, this is (client ISN) + 1, and the server sequence number is 160 | picked. 161 | - in an ACK, this is (server ISN) + number of bytes received + 1, which is the 162 | expected next sequence number to be received from the server. 163 | - data offset in 4 bits, the size of the TCP header in 32-bit words (defaults to 164 | 5), 165 | - 000 (reserved), 166 | - flags in 9 bits: NS, CWR, ECE, URG (read urgent pointer), ACK (acknowledge 167 | reception of data or SYN), PSH (push buffered data received to the 168 | application), RST (reset connection), SYN (synchronize sequence number, only 169 | used in the initial handshake), FIN (end of data, only used in the final 170 | handshake), 171 | - window size in 2 bytes, allowing flow and congestion control, 172 | - checksum in 2 bytes to check header and data corruption, 173 | - urgent pointer in 2 bytes pointing to a sequence number, 174 | - options (if the data offset is > 5, zero-padded) eg. maximum segment size, or 175 | window scale, 176 | - payload. 177 | 178 | [TCP]: https://tools.ietf.org/html/rfc793 179 | 180 | ### IP 181 | 182 | **Internet Protocol** ([IP][]) is a network-layer protocol that ensures that 183 | packets go to their destination despite having to transit through several 184 | machines on the way. 185 | It also cuts the packets into fragments that fit in the link-layer frame. 186 | There are two major versions of IP in use: IPv4 is the most used, and is slowly 187 | replaced by IPv6. 188 | 189 | IPv4 headers have the following fields: 190 | 191 | - version in 4 bits, 192 | - Internet Header Length (IHL) in 4 bits, as a number of 32-bit words, 193 | - Quality of Service (QoS) in 1 byte: ranks packet priority; it is typically cut 194 | into 6 bits of Differentiated Services Code Point (DSCP) and 2 bits of 195 | Explicit Congestion Notification (ECN), 196 | - length of the packet in bytes, in 2 bytes, 197 | - identification tag in 2 bytes, to reconstruct the packet from multiple 198 | fragments, 199 | - 0 in 1 bit, 200 | - Don't Fragment (DF) in 1 bit, if the packet can be fragmented, 201 | - Multiple Fragments (Mf) in 1 bit, if the rest of the packet is in subsequent 202 | fragments, 203 | - fragment offset in 13 bits, identifying the position of the fragment in the 204 | packet, 205 | - Time To Live (TTL) in 1 byte: the number of remaining nodes in the network 206 | graph that the packet is allowed to go though; each node decrements that 207 | number and drops the packet if it reaches 0, avoiding infinite loops, 208 | - protocol of the payload in 8 bits (TCP, UDP, ICMP, etc.), 209 | - header checksum in 16 bits to detect corruption, 210 | - source IP address in 32 bits, 211 | - destination IP address in 32 bits, 212 | - payload (eg, TCP content). 213 | 214 | Fragmenting the packet and reordering it at the other end was designed for cases 215 | where the packet must be transmitted over a link which cannot hold the whole 216 | packet (typically, the maximum Ethernet frame size, or when the destination 217 | doesn't have enough memory to hold the packet). 218 | 219 | However, TCP can cut its segments into arbitrarily sized packets to fit in an 220 | Ethernet frame, and fragmentation makes packet analysis harder as the tail 221 | fragments don't hold the TCP segment headers. Besides, Path MTU (Maximum 222 | Transportation Unit) Discovery (PMTUD) allows determining the size of the 223 | physical-layer frame in a path through the network. As a result, IPv6 disallows 224 | fragmenting packets within the path, requiring the sender to either form its 225 | packets at the right size (for TCP), or to form its fragments at the right size 226 | (for UDP and ICMP, which cannot cut their data into multiple packets). 227 | 228 | Note that packets can be lost, duplicated, received out of order, or corrupted 229 | without the IP layer noticing. It is up to TCP to prevent that from happening. 230 | 231 | IP addresses segment the network into increasingly smaller subnetworks, with 232 | **routers** processing packets in and across networks. They can be gained from 233 | the **Dynamic Host Configuration Protocol** (DHCP), auto-assigned, or manually 234 | set. 235 | 236 | IPv4 addresses fit in 4 bytes, commonly written in dot-separated decimal, eg. 237 | `172.16.254.1`. To denote a subnetwork (which has adjacent numbers), we use 238 | Classless Inter-Domain Routing (CIDR) notation: `/`. For instance, 240 | 192.168.2.0/24 includes addresses from 192.168.2.0 to 192.168.2.255, although 241 | you cannot use the address ending in .0 (network address), used to identify the 242 | network, nor that ending in .255 (broadcast address), used to broadcast to all 243 | devices on the network. 244 | 245 | There are special ranges of addresses: 246 | 247 | - 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 for private networks (ie, 248 | not globally routable; they are typically behind a Network Address Translator 249 | (NAT)), 250 | - 0.0.0.0/8 for "no address", used as source address when getting an IP address, 251 | - 100.64.0.0/10 "shared address space", similar to private networks, but for 252 | Carrier-Grade NAT (CGN), 253 | - 127.0.0.0/8 for loopback (sending network data within a single node), most 254 | notably 127.0.0.1 (which the localhost hostname usually resolves to), 255 | - 169.254.0.0/16 for IP assignment between link-local, autoconf/zeroconf 256 | addresses, 257 | - 192.0.0.0/24 for IANA, 258 | - 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24 are reserved for testing and 259 | examples in documentation, 260 | - 192.88.99.0/24 for IPv6-to-IPv4 anycast routers for backwards compatibility, 261 | - 198.18.0.0/15 for network performance testing, 262 | - 224.0.0.0/4 for IP multicast, 263 | - 240.0.0.0/4 blocked for historical reasons, 264 | - 255.255.255.255 for broadcast. 265 | 266 | IPv6 addresses fit in 16 bytes, with pairs of bytes represented as 267 | colon-separated hexadecimal numbers, with adjacent zeros replaced by `::` once 268 | in the address. 269 | 270 | - unicast has a ≥ 48-bit routing prefix, a ≤ 16-bit subnet id defined by the 271 | network administrator, and a 64-bit interface identifier obtained either by 272 | DHCPv6, the MAC address, random, or manually. 273 | - :: for unspecified address. 274 | - ::1 for localhost, 275 | - fe80::/64 for link-local communication; cannot be routed; all other addresses 276 | in fe80::/10 are disabled, 277 | - fc00::/7 for Unique Local Addresses (ULAs), similar to private networks: 278 | - fc00::/8 for arbitrary allocation, 279 | - fd00::/8 for random allocation (with a 40-bit pseudorandom number). 280 | - ff00::/8 for multicast, with 4 flag bits (reserved, rendezvous, prefix, 281 | transient) and 4 scope bits: 282 | - general multicast has a 112-bit group ID, including: 283 | - ff01::1 to all interface-local nodes, 284 | - ff02::1 to all link-local nodes, 285 | - ff01::2 to all interface-local routers, 286 | - ff02::2 to all link-local routers, 287 | - ff05::2 to all site-local routers, 288 | - ff0X::101 to all NTP servers, 289 | - ff05::1:3 to all DHCP servers. 290 | - ff02::1:ff00:0/104 sollicited-node multicast has a link-local scope and a 291 | 24-bit unicast address, 292 | - unicast-prefix-based multicast has a 64-bit network prefix (= routing prefix 293 | + subnet id) and a 32-bit group ID. 294 | - ::ffff:0:0/96 (IPv4-mapped IPv6 addresses), ::ffff:0:0:0/96 (IPv4-translated 295 | addresses in the Stateless IP/ICMP Translation (SIIT) protocol), 64:ff9b:://96 296 | (automatic IPv4/IPv6 translation), 2002::/16 (6to4) for transitioning from 297 | IPv4, 298 | - 2001::/29 thru 2001:01f8::/29 for IANA special purposes (tunneling, 299 | benchmarking, ORCHIDv2), 300 | - 2001:db8::/32 for examples in documentation, 301 | - 0100::/64 to discard traffic. 302 | 303 | Some IP addresses can be mapped to a name (eg, `en.wikipedia.org` → 304 | 91.198.174.192) by using the **Domain Name System** (DNS), a naming system for 305 | Internet entities. Companies that can allocate a new domain name are called 306 | **registrars**. They publish their information as zone files, and allow 307 | authenticated editing of those files by the domain name owners as part of a 308 | business arrangement. 309 | 310 | ; Example zone file. 311 | $ORIGIN example.com.. 312 | $TTL 1h 313 | ; Indicates that the owner is admin@example.com. 314 | example.com. IN SOA ns.example.com. admin.example.com. (2017011201 1d 2h 4w 1h) 315 | example.com. IN NS ns ; Indicates that ns.example.com is our nameserver. 316 | example.com. IN MX 10 mail.example.com. 317 | example.com. IN A 91.198.174.192 ; IPv4 address 318 | AAAA 2001:470:1:18::118 ; IPv6 address 319 | ns IN A 91.198.174.1 ; ns.example.com 320 | www IN CNAME example.com. ; www.example.com = example.com 321 | 322 | 323 | [IP]: https://tools.ietf.org/html/rfc791 324 | 325 | *While HTTP requires TCP which requires IP, lower layer protocols are usually 326 | interchangeable.* 327 | 328 | ### Ethernet 329 | 330 | At the link layer, communication mostly happens directly between two adjacent 331 | nodes. 332 | 333 | Among link-layer protocols, **Ethernet** (aka. LAN, IEEE 802.3) is a link-layer 334 | protocol for transiting frames through a wire between two machines. A frame 335 | includes: 336 | 337 | - preamble: 7 bytes to ensure we know this is a frame, not a lower-level header, 338 | and to synchronize clocks (it contains alternating 0s and 1s), 339 | - Start of Frame Delimiter (SFD): 1 byte to break the pattern of the preamble 340 | and mark the start of the frame metadata, 341 | - destination **Media Access Control** (MAC) address of the target machine: each 342 | machine knows the MAC address of all machines it is directly connected to. 343 | Among its 6 bytes, it contains two special bits: 344 | - the Universal vs. Local (U/L) bit is 0 if the MAC is separated in 3 bytes 345 | identifying the network card's constructor (Organisationally Unique 346 | Identifier, OUI), and 3 bytes arbitrarily but uniquely assigned by the 347 | constructor for each card (Network Interface Controller, NIC). 348 | - the Unicast vs. Multicast bit is 0 if the frame must only be processed by a 349 | single linked machine. 350 | - source MAC address, 351 | - EtherType indicates what protocol is used in the payload (eg, 0x86DD for 352 | IPv6); if the value is < 1536, it represents the payload size in bytes, 353 | - payload: up to 1500 bytes of data from the layer above, typically IP, 354 | - Frame Check Sequence (FCS, implemented using a **Cyclic-Redundancy Check** 355 | (CRC)): 4 bytes that verify that the frame is not corrupted; if it is, it is 356 | dropped and upper layers may have to re-send it. 357 | - Interpacket gap: not really part of the frame, those 12 bytes of idle line 358 | transmission are padding to avoid having frames right next to each other. 359 | 360 | Ethernet relies on **repeaters** to transmit data over long distances, as the 361 | physical layer usually relies on cables that have a maximum length. Multiple 362 | machines are connected to the same repeater, creating a star topology. 363 | **Bridges** are smarter machines that remembers source MAC addresses, and uses 364 | that to avoid sending frames to machines that are not the recipient according to 365 | the frame. **Switches** are smarter, programmable machines that detect and block 366 | corrupted packets. 367 | 368 | An alternative to Ethernet is **WiFi** (aka. WLAN, IEEE 802.11), a common 369 | wireless protocol. 370 | 371 | ### 100BASE-TX 372 | 373 | **100BASE-TX** (part of IEEE 802.3u, aka. Fast Ethernet) is a physical-layer 374 | protocol. It defines using RJ45, which uses an 8P8C (8 position 8 contact) 375 | connector with TIA/EIA-568B, ie. having eight copper wires with pin 1 through 8: 376 | white-orange, orange, white-green, blue, white-blue, green, white-brown, brown. 377 | x / white-x wires form pairs 1 through 4: blue, orange, green, brown, each 378 | twisted together at different rates in the cable. Orange pins 1 (TX+) and 2 379 | (TX-) transmit bits; green pins 3 (RX+) and 6 (RX-) receive bits, which makes 380 | this full-duplex. 381 | 382 | From left to right on the female Ethernet connector: 383 | 384 | pin 1 2 3 4 5 6 7 8 385 | ┌────────────┬──────┬───────────┬────┬──────────┬─────┬───────────┬─────┐ 386 | │white-orange│orange│white-green│blue│white-blue│green│white-brown│brown│ 387 | └────────────┴──────┴───────────┴────┴──────────┴─────┴───────────┴─────┘ 388 | TX+ TX- RX+ RX- 389 | 390 | Bits are first encoded with 4B5B: each 4 bits are encoded as 5 bits according to 391 | a predetermined mapping that prevents having too many consecutive zeros, which 392 | would make locating individual bits harder, as clocks are not perfectly 393 | synchronized. 4B5B also has five extra 5-bit codes: one to indicate that no data 394 | is sent (Idle = 11111, which in NRZI means systematically alternating the 395 | current), two to indicate that we will start sending data (Start of Stream Data 396 | = SSD), two to indicate that we stop sending data (End of Stream Data = ESD). 397 | 398 | Bits: 0100 0111 (ASCII G) 399 | 4B5B: 0101001111 400 | 401 | Bit transmission relies on Non-Return-to-Zero/Inverted (NRZI): a 1 is 402 | represented by a change from 0 volts to 1 volt or back for TX+, from 0 volts to 403 | -1 volts or back for TX-. A 0 is no change in voltage. The receiver subtracts 404 | TX- from TX+: `(TX+ + noise) - (TX- - noise) = 0V or 2V` which together with the 405 | previous voltage, determines bits. On top of that, Multilevel Threshold-3 406 | (MLT-3) is used: it halves the transfer frequency by alternating positive and 407 | negative voltages. 408 | 409 | 4B5B: 0 1 0 1 0 0 1 1 1 1 410 | MLT-3 TX+: 0 0 1 1 0 0 0 -1 0 1 0 (in volts) 411 | MLT-3 TX-: 0 0 -1 -1 0 0 0 1 0 -1 0 (in volts) 412 | 413 | 414 | The wires go up to 100 metres. They are twisted with Cat5 Unshielded Twisted 415 | Pair (UTP; no electromagnetic protection, but the twisting protects information 416 | from noise sources). In 100BASE-TX, 100 means data goes at 100 Mbit/s, T means 417 | twisted pair, X means that bits are encoded with 4B5B. 418 | 419 | The resulting overhead looks like this: 420 | 421 | ┌─────┬─────┬────────────┬───────────┬────────────┬─────────────┬──────┬─────┬─────┐ 422 | │ SSD │ SFD │ MAC header │ IP header │ TCP header │ HTTP header │ data │ FCS │ ESD │ 423 | └─────┴─────┴────────────┴───────────┴────────────┴─────────────┴──────┴─────┴─────┘ 424 | │ │ │ │ │ application │ │ │ 425 | │ │ │ │ └────────────────────┤ │ │ 426 | │ │ │ │ transport │ │ │ 427 | │ │ │ └─────────────────────────────────┤ │ │ 428 | │ │ │ network │ │ │ 429 | │ │ └─────────────────────────────────────────────┘ │ │ 430 | │ │ link │ │ 431 | │ └──────────────────────────────────────────────────────────────────────┘ │ 432 | │ physical │ 433 | └──────────────────────────────────────────────────────────────────────────────────┘ 434 | 435 | ## Layouts 436 | 437 | **Distributed** systems are products that rely on having multiple computing 438 | units communicating. 439 | 440 | **Client-server** (aka. Star) architectures have a special computing unit, the 441 | server, which receives requests from any number of computing units (clients), 442 | processes the request, and sends a response to each request. 443 | Examples include HTTP and display servers. 444 | 445 | **Three-tier** architectures separate nodes into three types: 446 | - Presentation: reads user input and displays the User Interface (UI). 447 | Typically, laptops or phones. 448 | - Application: executes user queries and moves data. 449 | - Data: manages data, typically through Create-Read-Update-Delete (CRUD) 450 | Application Programming Interfaces (APIs), typically with 451 | Atomicity-Consistency-Isolation-Durability (ACID) guarantees. 452 | 453 | **N-tier** is what happens when the three-tier application layer gets sublayers. 454 | 455 | **Decentralized** architectures can sustain the loss of any node. 456 | Examples include Distributed Hash Tables (DHT). 457 | 458 | **Peer-to-peer** (P2P) architectures can sustain the loss of any number of 459 | nodes, as long as there is still at least one node. 460 | Examples include Bittorrent, Bitcoin, Infinit file system. 461 | -------------------------------------------------------------------------------- /misc/network/physical.md: -------------------------------------------------------------------------------- 1 | # Physical Networks 2 | 3 | The physical connection between your device and the Internet looks like this: 4 | 5 | ┌────────┐ WiFi ┌────────┐ FTTH ┌─────────────────┐ Fiber ┌──────────┐ 6 | │ Device ├──────┤ Router ├──────┤ ISP OLT/Routers ├───────┤ Repeater ├──┐ 7 | └────────┘ 4G └────────┘ └─────────────────┘ └──────────┘ │ 8 | │ 9 | ┌────────┐ CAT6a ┌───────────────────┐ Fiber ┌────────┐ Submarine Fiber │ 10 | │ Server ├───────┤ Datacenter Switch ├───────┤ Router ├─────────────────┘ 11 | └────────┘ └───────────────────┘ └────────┘ 12 | 13 | ## Media 14 | 15 | EM radiation. 16 | 17 | antenna dipole. 18 | 19 | AM/FM. 20 | 21 | ### Wireless 22 | 23 | #### WiFi 24 | 25 | An alternative to Ethernet is **WiFi** (aka. WLAN, IEEE 802.11), a common 26 | wireless protocol. 27 | 28 | #### 4G 29 | 30 | ### Wired 31 | 32 | Standard Ethernet cable names are of the form 33 | `-[]`, eg. 1000BASE-T. 34 | 35 | 1. **Speed** is in Megabits per second, or in Gbps if it ends in G 36 | (eg. 10GBASE-SR). 37 | 2. **Signaling** is how information is sent: 38 | - BASE is **baseband** (line coding: on a clock, 39 | we send bits by switching between two values (voltage, photon burst). 40 | - BROAD is **broadband**: multiple frequency bands are used. 41 | 3. **Hardware**: 42 | - 2: Coaxial cable that can reach ~200 meters. 43 | - 5: Coaxial cable that can reach ~500 meters. 44 | - T: Twisted Pairs. 45 | - F: Fiber; E: Extended fiber () 46 | 4. **Encoding** 47 | 48 | #### Fiber 49 | 50 | A 51 | 52 | Most widely-used for long-distance, because: 53 | 54 | - Light is the fastest known particle, 55 | - Signals propagate for very long distances with little degradation. 56 | 57 | #### Copper 58 | 59 | But moving electrons cause magnetic fields, 60 | which can induce currents in nearby conductive wires, 61 | and alter their signal: that is called “**crosstalk**”. 62 | 63 | ##### Twisted Pairs 64 | 65 | TS568A vs. TS568B / Straigh-trhough vs. Crossover 66 | 67 | ##### Coaxial 68 | 69 | #### Submarine cables 70 | 71 | > **History:** The first transatlantic cable was 7 copper wires 72 | > coated with gutta-percha for electrical isolation, 73 | > wound in a helix with tarred hemp and an iron-strands sheath for strength. 74 | > It was laid in 1858, a long shipping expedition which involved 75 | > having to grapple the cable on the sea bed when it unexpectedly broke. 76 | > Unsurprisingly, it degraded after a month. 77 | 78 | ## Connectors 79 | 80 | ### RJ45 / 8P8C 81 | 82 | ### USB 83 | 84 | ## Links 85 | 86 | - [Fiber Optics in the LAN and Data Center][FOLDC] 87 | 88 | [FOLDC]: https://www.youtube.com/watch?v=fRKT6Z9rgUw 89 | -------------------------------------------------------------------------------- /misc/network/protocol.md: -------------------------------------------------------------------------------- 1 | # Network Protocols 2 | 3 | Computers [communicate](./information.md) to reach a goal. For instance, you 4 | contact Youtube to see cat videos, Youtube responds to gain advertising revenue. 5 | 6 | A **network** can be represented with a graph where vertices are processing 7 | devices and edges are transmission links. Examples of networks include the 8 | Internet, telephones, and walkie-talkies. 9 | 10 | ## Protocols 11 | 12 | Certain documents (typically, standards and Requests For Comments (RFCs)) 13 | set the way in which information is transmitted through the network, 14 | first as bits, then as higher-level concepts. 15 | They can depend on the existence of lower-level protocols, 16 | forming a *protocol stack*. 17 | The typical layers that a protocol stack is made of are: 18 | 19 | - Physical: transmission of bits through a medium (eg: Ethernet), 20 | - Data link: transmission of frames mostly between adjacent nodes, to determine 21 | the start and end of messages (eg: MAC, PPP), 22 | - Network: transmission of packets for routing across the graph (eg: IP), 23 | - Transport: transmission of segments, so applications on both endpoints can 24 | exchange messages with a chosen reliability guarantee: 25 | are segments each sent at least once? in the same order? uncorrupted? 26 | (eg: TCP, UDP, ICMP (ping)), 27 | - Application: serialization of data structures 28 | (eg: HTTP (documents), NTP (time), SMTP (email), FTP (file)). 29 | 30 | Let's focus on a typical stack. 31 | 32 | ### HTTP 33 | 34 | **HyperText Transfer Protocol** ([HTTP][]) is an application-layer and 35 | presentation-layer protocol designed for client-server document transmission. 36 | For instance, to request the main page of an HTTP server on your computer: 37 | 38 | GET / HTTP/1.1 39 | Host: localhost:1234 40 | Accept: text/html 41 | 42 | [HTTP]: https://tools.ietf.org/html/rfc2616 43 | 44 | (Each newline is made of two bytes: 0x0D and 0x0A, aka. CR-LF; it ends with two 45 | newlines). The server may respond: 46 | 47 | HTTP/1.1 200 OK 48 | Content-Type: text/html 49 | Date: Sat, 31 Dec 2016 15:31:45 GMT 50 | Connection: keep-alive 51 | Transfer-Encoding: chunked 52 | 53 | 7E 54 | 55 | 56 | 57 | 58 | This is HTML 59 | 60 | 61 | 62 | 63 | 0 64 | 65 | This response includes an [HTML][] file that the HTTP client (for instance, a 66 | browser, like Firefox or Google Chrome) will read as instructions on how to lay 67 | out a page, which determines the pixels to display, the animations to show, the 68 | interactions to execute when the user moves or clicks the mouse, the sounds to 69 | play, etc. 70 | 71 | [HTML]: https://html.spec.whatwg.org/multipage/ 72 | 73 | All requests have a first line with a method (`GET`), a path (`/`), and a 74 | protocol (`HTTP/1.1`), optionally followed by headers mapping header names 75 | (`Accept`) to their value (`text/html`). Requests may also carry data. 76 | 77 | Responses have a first line with a protocol (`HTTP/1.1`), a code (`200 OK`; 78 | codes starting with 1 are informational, 2 for success, 3 for redirection, 4 for 79 | client errors, 5 for server errors). Responses usually carry data (here, the 80 | HTML file), and also have headers explaining what the data is, how it is encoded 81 | (charset, compression), what time it is, whether to use caching, how to store 82 | session information (through cookies) and so on. 83 | 84 | As mentioned, HTTP includes presentation-layer "protocols" in headers, such as 85 | **Multipurpose Internet Mail Extensions** ([MIME][]) in `Content-Type`, to 86 | specify the file `/` (eg. `text/plain`), or whether it 87 | recursively contains subfiles with [`multipart/form-data`][form-data], with each 88 | subfile specifying their own headers: 89 | 90 | POST /upload HTTP/1.1 91 | Host: localhost:1234 92 | Content-Length: 882 93 | Content-Type: multipart/form-data; boundary=random0ACxeUx4Nxqy3roVtMxrAw 94 | 95 | --random0ACxeUx4Nxqy3roVtMxrAw 96 | Content-Disposition: form-data; name="name-of-first-part" 97 | Content-Type: text/plain 98 | 99 | This first file contains normal plain text. 100 | --random0ACxeUx4Nxqy3roVtMxrAw 101 | Content-Disposition: form-data; name="multiple-images"; filename="image.svg" 102 | Content-Type: image/svg+xml; charset=UTF-8 103 | 104 | 105 | 106 | This is an image 107 | 108 | --random0ACxeUx4Nxqy3roVtMxrAw 109 | Content-Disposition: form-data; name="multiple-images"; filename="image.png" 110 | Content-Type: image/png 111 | Content-Transfer-Encoding: base64 112 | 113 | iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAAXNSR0IArs4c6QAAAARnQU1BAACx 114 | jwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAMSURBVBhXY2BgYAAAAAQAAVzN/2kAAAAASUVO 115 | RK5CYII= 116 | --random0ACxeUx4Nxqy3roVtMxrAw-- 117 | 118 | (Note that our use of base64 in image.png is deprecated; in real life, it would 119 | be replaced by the binary data directly.) 120 | 121 | [MIME]: https://tools.ietf.org/html/rfc2045 122 | [form-data]: https://tools.ietf.org/html/rfc7578 123 | 124 | HTTPS is HTTP transmitted over a TLS connection: all the HTTP data, including 125 | headers, is encrypted to prevent intermediate nodes on the network from reading 126 | or modifying the content, which is necessary when transmitting identification or 127 | banking information, and to avoid being fooled into performing dangerous acts. 128 | 129 | ### TCP 130 | 131 | **Transmission Control Protocol** ([TCP][]) is a transport-layer protocol to 132 | ensure that all sent segments are received uncorrupted in the same order. 133 | That is achieved by reordering received segments and resending corrupted ones. 134 | When using IP, it cuts its segments into pieces that fit in a packet. 135 | 136 | 1. The server starts to listen to a port. 137 | 2. The client starts to connect with a SYN. 138 | 3. The server informs the client that it received it with a SYN+ACK. 139 | 4. The client sends an ACK. 140 | 5. The server and the client can now send a series of packets to each other 141 | full-duplex, and they ACK each reception if all previously received packets 142 | have been received in order. 143 | 6. The client sends a FIN. 144 | 7. The server sends a FIN+ACK (or an ACK followed by a FIN). 145 | 8. The client sends an ACK. (The connection stays open until it times out.) 146 | 147 | *(When a previous connection was opened, this handshake can be sped up 148 | through [TCP Fast Open][TFO] or socket reuse.)* 149 | 150 | A TCP header includes: 151 | 152 | - source port in 2 bytes, 153 | - destination port in 2 bytes, 154 | - sequence number in 4 bytes: 155 | - in a SYN, this is the client Initial Sequence Number (ISN), picked usually 156 | randomly, 157 | - otherwise it is (server ISN) + 1 + number of bytes previously sent, ensuring 158 | that packets can be reordered to obtain the original segment. 159 | - acknowledgement number in 4 bytes: 160 | - in a SYN-ACK, this is (client ISN) + 1, and the server sequence number is 161 | picked. 162 | - in an ACK, this is (server ISN) + number of bytes received + 1, which is the 163 | expected next sequence number to be received from the server. 164 | - data offset in 4 bits, the size of the TCP header in 32-bit words (defaults to 165 | 5), 166 | - 000 (reserved), 167 | - flags in 9 bits: NS, CWR, ECE, URG (read urgent pointer), ACK (acknowledge 168 | reception of data or SYN), PSH (push buffered data received to the 169 | application), RST (reset connection), SYN (synchronize sequence number, only 170 | used in the initial handshake), FIN (end of data, only used in the final 171 | handshake), 172 | - window size in 2 bytes, allowing flow and congestion control, 173 | - checksum in 2 bytes to check header and data corruption, 174 | - urgent pointer in 2 bytes pointing to a sequence number, 175 | - options (if the data offset is > 5, zero-padded) eg. maximum segment size, or 176 | window scale, 177 | - payload. 178 | 179 | TODO NAT 180 | 181 | [TCP]: https://tools.ietf.org/html/rfc793 182 | [TFO]: https://tools.ietf.org/html/rfc7413 183 | 184 | ### IP 185 | 186 | **Internet Protocol** ([IP][]) is a network-layer protocol that ensures that 187 | packets go to their destination despite having to transit through several 188 | devices on the way. 189 | It also cuts the packets into fragments that fit in the link-layer frame. 190 | There are two major versions of IP in use: IPv4 is the most used, and is slowly 191 | replaced by IPv6. 192 | 193 | #### IPv4 194 | 195 | IPv4 headers have the following fields: 196 | 197 | - *Version* (4 bits): 0100, 198 | - *Internet Header Length* (IHL) (4 bits), as a number of 32-bit words, 199 | - *Quality of Service* (QoS) (8 bits): ranks packet priority; 200 | - *Differentiated Services* Code Point (DSCP) (8 bits), 201 | - *Explicit Congestion Notification* (ECN) (2 bits): 202 | 10 or 01 means “ECN-capable transport” (ECT(0) / ECT(1)), 203 | 11 means “Congestion Encountered” (CE); 204 | - *Length* of the packet in bytes (2 bytes), 205 | - *identification tag* (2 bytes), 206 | to reconstruct the packet from multiple fragments, 207 | - 0 (1 bit), 208 | - *Don't Fragment* (DF) (1 bit), if the packet can be fragmented, 209 | - *Multiple Fragments* (Mf) (1 bit), if the rest of the packet is in subsequent 210 | fragments, 211 | - *Fragment offset* (13 bits), identifying the position of the fragment in the 212 | packet, 213 | - Time To Live (TTL) (1 byte): the number of remaining nodes in the network 214 | graph that the packet is allowed to go though; each node decrements that 215 | number and drops the packet if it reaches 0, avoiding infinite loops, 216 | - *Protocol of the payload* (8 bits): 6 for TCP, 17 for UDP, 1 for ICMP, etc., 217 | - *Header checksum* (16 bits) to detect corruption, 218 | - *Source IP address* (32 bits), 219 | - *Destination IP address* (32 bits), 220 | - *Payload* (eg, TCP content). 221 | 222 | The packet is broken into fragments when one device on the path 223 | cannot transmit it whole across a link 224 | (eg. because the Ethernet frame size is 1500 bytes). 225 | That can happen at the source, but also along the path. 226 | Fragments can also go through different paths once split. 227 | It is up to the receiving host to reassemble the fragments. 228 | 229 | Note that packets can be lost, duplicated, received out of order, or corrupted 230 | without the IP layer noticing. It is up to TCP to prevent that from happening. 231 | 232 | IP addresses segment the network into increasingly smaller subnetworks, with 233 | **routers** processing packets in and across networks. They can be gained from 234 | the **Dynamic Host Configuration Protocol** (DHCP), auto-assigned, or manually 235 | set. 236 | 237 | IPv4 addresses fit in 4 bytes, commonly written in dot-separated decimal, eg. 238 | `172.16.254.1`. To denote a subnetwork (which has adjacent numbers), we use 239 | Classless Inter-Domain Routing (CIDR) notation: `/` 240 | (the bitmask is a number of leading bits that stay the same for all addresses). 241 | For instance, 242 | 192.168.2.0/24 includes addresses from 192.168.2.0 to 192.168.2.255, 243 | although you cannot use the address ending in .0 (network address), 244 | used to identify the network, nor that ending in .255 (broadcast address), 245 | used to broadcast to all devices on the network. 246 | 247 | There are special ranges of addresses: 248 | 249 | - 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 for private networks 250 | (ie, not globally routable; 251 | they are typically behind a Network Address Translator (NAT)), 252 | - 0.0.0.0/8 for "no address", used as source address when getting an IP address, 253 | - 100.64.0.0/10 "shared address space", similar to private networks, but for 254 | Carrier-Grade NAT (CGN), 255 | - 127.0.0.0/8 for loopback (sending network data within a single node), most 256 | notably 127.0.0.1 (which the localhost hostname usually resolves to), 257 | - 169.254.0.0/16 for IP assignment between link-local, 258 | autoconf/zeroconf addresses, 259 | - 192.0.0.0/24 for IANA, 260 | - 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24 are reserved for testing and 261 | examples in documentation, 262 | - 192.88.99.0/24 for IPv6-to-IPv4 anycast routers for backwards compatibility, 263 | - 198.18.0.0/15 for network performance testing, 264 | - 224.0.0.0/4 for IP multicast, 265 | - 240.0.0.0/4 blocked for historical reasons, 266 | - 255.255.255.255 for broadcast. 267 | 268 | #### IPv6 269 | 270 | IPv6 packets: 271 | 272 | - *Version* (4 bits): 0110. 273 | - *Traffic Class* (8 bits): DS and ECN just like IPv4. 274 | - *Flow Label* (20 bits): ID for a flow (a group of packets). 275 | - *Payload Length* (16 bits): number of bytes including extension headers. 276 | - *Next Header* (8 bits): the payload protocol or first extension header type. 277 | - *Hop Limit* (8 bits): similar to IPv4 TTL. 278 | - *Source Address* (128 bits). 279 | - *Destination Address* (128 bits). 280 | 281 | Fragmentation cannot be done by intermediate routers, unlike IPv4. 282 | Only the source can fragment (using an extension header). 283 | The source machine is meant to do Path MTU (Maximum Transportation Unit) 284 | Discovery (PMTUD) to pick the packet or fragment size that fits all links 285 | through the network. 286 | They will change packet size for TCP, and they will fragment for transports such 287 | as UDP or ICMP which cannot cut their data into multiple packets. 288 | 289 | The reason for this change: lower performance impact on routers, 290 | and ensuring that security devices can read the TCP header they need 291 | when analyzing the packet. 292 | 293 | IPv6 addresses fit in 16 bytes, with pairs of bytes represented as 294 | colon-separated hexadecimal numbers, with adjacent zeros replaced by `::` once 295 | in the address. 296 | 297 | - unicast has a ≥ 48-bit routing prefix, a ≤ 16-bit subnet id defined by the 298 | network administrator, and a 64-bit interface identifier obtained either by 299 | DHCPv6, the MAC address, random, or manually. 300 | - :: for unspecified address, equivalent to IPv4’ 0.0.0.0, 301 | - ::1 for localhost, 302 | - fe80::/64 for link-local communication; cannot be routed; 303 | all other addresses in fe80::/10 are disabled, 304 | - fc00::/7 for Unique Local Addresses (ULAs), similar to private networks: 305 | - fc00::/8 for arbitrary allocation, 306 | - fd00::/8 for random allocation (with a 40-bit pseudorandom number). 307 | - ff00::/8 for multicast, with 4 flag bits (reserved, rendezvous, prefix, 308 | transient) and 4 scope bits: 309 | - general multicast has a 112-bit group ID, including: 310 | - ff01::1 to all interface-local nodes, 311 | - ff02::1 to all link-local nodes, 312 | - ff01::2 to all interface-local routers, 313 | - ff02::2 to all link-local routers, 314 | - ff05::2 to all site-local routers, 315 | - ff0X::101 to all NTP servers, 316 | - ff05::1:3 to all DHCP servers. 317 | - ff02::1:ff00:0/104 sollicited-node multicast has a link-local scope and a 318 | 24-bit unicast address, 319 | - unicast-prefix-based multicast has a 64-bit network prefix (= routing prefix 320 | + subnet id) and a 32-bit group ID. 321 | - 2001::/29 thru 2001:01f8::/29 for IANA special purposes (tunneling, 322 | benchmarking, ORCHIDv2), 323 | - 2001:db8::/32 for examples in documentation, 324 | - 0100::/64 to discard traffic. 325 | 326 | #### DNS 327 | 328 | Some IP addresses can have a name map to them (eg, `en.wikipedia.org` → 329 | 91.198.174.192) by using the **Domain Name System** (DNS), a naming system for 330 | Internet entities. Companies that can allocate a new domain name are called 331 | **registrars**. They publish their information as zone files, and allow 332 | authenticated editing of those files by the domain name owners as part of a 333 | business arrangement. 334 | 335 | ; Example zone file. 336 | $ORIGIN example.com.. 337 | $TTL 1h 338 | ; Indicates that the owner is admin@example.com. 339 | example.com. IN SOA ns.example.com. admin.example.com. (2017011201 1d 2h 4w 1h) 340 | example.com. IN NS ns ; Indicates that ns.example.com is our nameserver. 341 | example.com. IN MX 10 mail.example.com. 342 | example.com. IN A 91.198.174.192 ; IPv4 address 343 | AAAA 2001:470:1:18::118 ; IPv6 address 344 | ns IN A 91.198.174.1 ; ns.example.com 345 | www IN CNAME example.com. ; www.example.com = example.com 346 | 347 | [IP]: https://tools.ietf.org/html/rfc791 348 | 349 | *While HTTP requires TCP which requires IP, lower layer protocols are usually 350 | interchangeable.* 351 | 352 | #### BGP 353 | 354 | TODO Routers 355 | 356 | TODO BGP 357 | 358 | TODO DHCP 359 | 360 | ### Ethernet 361 | 362 | At the link layer, communication mostly happens directly between two adjacent 363 | nodes. 364 | 365 | Among link-layer protocols, **Ethernet** (aka. IEEE 802.3) is a link-layer 366 | protocol for transmitting frames through a wire between two nodes. A frame 367 | includes: 368 | 369 | - preamble: 7 bytes to ensure we know this is a frame, not a lower-level header, 370 | and to synchronize clocks (it contains alternating 0s and 1s), 371 | - Start of Frame Delimiter (SFD): 1 byte to break the pattern of the preamble 372 | and mark the start of the frame metadata, 373 | - destination **Media Access Control** (MAC) address of the target device: 374 | each device is typically assigned a MAC when manufactured, and 375 | each device knows the MAC address of all devices it is directly connected to. 376 | Among its 6 bytes, it contains two special bits: 377 | - the Universal vs. Local (U/L) bit is 0 if the MAC is separated in 3 bytes 378 | identifying the network card's constructor (Organisationally Unique 379 | Identifier, OUI), and 3 bytes arbitrarily but uniquely assigned by the 380 | constructor for each card (Network Interface Controller, NIC). 381 | - the Unicast vs. Multicast bit is 0 if the frame must only be processed by a 382 | single linked device. 383 | - source MAC address, 384 | - EtherType indicates what protocol is used in the payload (eg, 0x86DD for 385 | IPv6); if the value is < 1536, it represents the payload size in bytes, 386 | - payload: up to 1500 bytes of data from the layer above, typically IP 387 | (some devices support larger sizes, eg. Jumbo frames: ~9000 bytes), 388 | - Frame Check Sequence (FCS, implemented using a **Cyclic-Redundancy Check** 389 | (CRC)): 4 bytes that verify that the frame is not corrupted; if it is, it is 390 | dropped and upper layers may have to re-send it. 391 | - Interpacket gap: not really part of the frame, those 12 bytes of idle line 392 | transmission are padding to avoid having frames right next to each other. 393 | 394 | To connect Ethernet devices together, they are typically cabled to a switch. 395 | **Switches** are devices that multiple network nodes are connected to, 396 | and that forwards Ethernet frames 397 | to the destination MAC address listed in the frame. 398 | They remember source MAC addresses in content-addressable memory (CAM) 399 | as a MAC table: 400 | this is how they can dynamically learn the MACs of their connected devices, 401 | when MAC addresses are not statically hardcoded in the table. 402 | When the destination MAC is unknown, they flood all devices: 403 | devices ignore frames for which they are not the recipient, 404 | and the device whose MAC is the destination MAC answers, 405 | filling the switch’s table. 406 | 407 | They can also connect nodes wired with different cable technologies 408 | (eg. fiber and twisted pairs). 409 | 410 | **Bridges** connect LANs together. They behave like a switch, 411 | but the MAC table is a Forwarding Information Base (FIB): 412 | each interface, being a LAN, corresponds to multiple MACs. 413 | When receiving a frame, the bridge decides whether to forward it 414 | and to which interface based on the known MACs in the FIB. 415 | 416 | The use-cases of switches and bridges overlap with those of routers, 417 | which is the consequence of a historical lack of synchronization of efforts 418 | between IEEE (Ethernet) and IETF (Internet). 419 | 420 | TODO ARP 421 | 422 | ### 1000BASE-TX 423 | 424 | **1000BASE-TX** (part of IEEE 802.3u, aka. Fast Ethernet) is a physical-layer 425 | protocol. It defines using RJ45, which uses an 8P8C (8 position 8 contact) 426 | connector with TIA/EIA-568B, ie. having eight copper wires with pin 1 through 8: 427 | white-orange, orange, white-green, blue, white-blue, green, white-brown, brown. 428 | x / white-x wires form pairs 1 through 4: blue, orange, green, brown, each 429 | twisted together at different rates in the cable to reduce interference. 430 | Orange pins 1 (TX+) and 2 (TX-) transmit bits; 431 | green pins 3 (RX+) and 6 (RX-) receive bits, which makes this full-duplex. 432 | 433 | From left to right on the female Ethernet connector: 434 | 435 | pin 1 2 3 4 5 6 7 8 436 | ┌────────────┬──────┬───────────┬────┬──────────┬─────┬───────────┬─────┐ 437 | │white-orange│orange│white-green│blue│white-blue│green│white-brown│brown│ 438 | └────────────┴──────┴───────────┴────┴──────────┴─────┴───────────┴─────┘ 439 | TX+ TX- RX+ RX- 440 | 441 | Bits are first encoded with **8B/10B**: 442 | each 8 bits are encoded as 10 bits according to a predetermined mapping 443 | that prevents having too many consecutive zeros, 444 | which would make locating individual bits harder, 445 | as clocks are not perfectly synchronized. 446 | 8B10B also has five extra 5-bit codes: 447 | one to indicate that no data is sent (Idle = 11111, 448 | which in NRZI means systematically alternating the current), 449 | two to indicate that we will start sending data (Start of Stream Data = SSD), 450 | two to indicate that we stop sending data (End of Stream Data = ESD). 451 | 452 | Bits: 01000111 01000101 01010100 (ASCII GET) 453 | 8B10B: 0101001111 454 | 455 | Bit transmission relies on Non-Return-to-Zero/Inverted (**NRZI**): 456 | a 1 is represented by a change from 0 volts to 1 volt or back for TX+, 457 | from 0 volts to -1 volts or back for TX-. A 0 is no change in voltage. 458 | The receiver subtracts TX- from TX+: `(TX+ + noise) - (TX- + noise) = 0V or 2V` 459 | which together with the previous voltage, determines bits. 460 | On top of that, Multilevel Threshold-3 (**MLT-3**) is used: 461 | it halves the transfer frequency by alternating positive and negative voltages. 462 | 463 | 8B10B: 0 1 0 1 0 0 1 1 1 1 464 | MLT-3 TX+: 0 0 1 1 0 0 0 -1 0 1 0 (in volts) 465 | MLT-3 TX-: 0 0 -1 -1 0 0 0 1 0 -1 0 (in volts) 466 | 467 | 468 | The wires go up to 100 metres. They are twisted with **Cat5e** 469 | Unshielded Twisted Pair (**UTP** = there is no metallic foil around the pair. 470 | The twisting protects information from noise sources a bit, 471 | but higher signal frequencies (Cat7, 8, …) 472 | require **STP**: Shielded Twisted Pairs). 473 | In 1000BASE-TX, 1000 means data goes at 1000 Mbit/s, T means 474 | twisted pair, X means that bits are encoded with 8B10B. 475 | 476 | Since wires have a limited length, 477 | **repeaters** are put in place to transmit data over longer distances. 478 | 479 | The resulting overhead looks like this: 480 | 481 | ┌─────┬─────┬────────────┬───────────┬────────────┬─────────────┬──────┬─────┬─────┐ 482 | │ SSD │ SFD │ MAC header │ IP header │ TCP header │ HTTP header │ data │ FCS │ ESD │ 483 | └─────┴─────┴────────────┴───────────┴────────────┴─────────────┴──────┴─────┴─────┘ 484 | │ │ │ │ │ application │ │ │ 485 | │ │ │ │ └────────────────────┤ │ │ 486 | │ │ │ │ transport │ │ │ 487 | │ │ │ └─────────────────────────────────┤ │ │ 488 | │ │ │ network │ │ │ 489 | │ │ └─────────────────────────────────────────────┘ │ │ 490 | │ │ link │ │ 491 | │ └──────────────────────────────────────────────────────────────────────┘ │ 492 | │ physical │ 493 | └──────────────────────────────────────────────────────────────────────────────────┘ 494 | 495 | ## Links 496 | 497 | - [The world in which IPv6 was a good design][apenwarr17] 498 | 499 | [apenwarr17]: https://apenwarr.ca/log/20170810 500 | -------------------------------------------------------------------------------- /misc/reliability.md: -------------------------------------------------------------------------------- 1 | # Reliability 2 | 3 | TODO 4 | 5 | ## Going further 6 | 7 | - [Google SRE book](https://landing.google.com/sre/book/). 8 | -------------------------------------------------------------------------------- /misc/statistics.md: -------------------------------------------------------------------------------- 1 | # Statistics 2 | 3 | ## Probabilities 4 | 5 | Kolmogorov axioms: 6 | 7 | - Ω: set of elementary events (exclusive outcomes) 8 | - F: set of events; σ-algebra (closed set of all subsets) of Ω 9 | - prob: function from F to [0,1] 10 | - prob(Ω) = 1 11 | - prob(∪Ai) = Σ prob(Ai) if Ai disjoint (= exclusive) *(partition of a disk)* 12 | 13 | - prob(∅) = 0 14 | - prob(Ω-A) = 1 - prob(A) 15 | - prob(A) = |A| ÷ |Ω| if Ω countable and ∀e∈Ω, prob({e}) = 1÷|Ω| 16 | - prob(A∪B) = prob(A) + prob(B) - prob(A∩B) *(think overlapping disks)* 17 | - prob(∪Ai) ≤ Σ prob(Ai) 18 | - prob(∪Ai) = Σr∈{1…n} (-1)r+1 Σ{i1≤ir} prob(Ai1∩…∩Air) 19 | - prob(∩Ai) = 0 ⇔ Ai disjoint *(non-overlapping disks)* 20 | - A1⊂A2⊂… ⇒ prob(An) → prob(∪Ai), prob(Ai) < prob(Ai+1) 21 | - A1⊃A2⊃… ⇒ prob(An) → prob(∩Ai), prob(Ai) > prob(Ai+1) 22 | - prob(A|B) ≝ prob(A∩B) ÷ prob(B) *(think of |B as assuming Ω = B)* 23 | - A, B independent ⇔ prob(A|B) = prob(A) 24 | - prob(∩Ai) = Π prob(Ai) if Ai independent 25 | - prob(∩Ai) = Π{i=2…} prob(Ai|A1∩…∩A{i-1}) prob(A1) 26 | - A∈∪Bi ⇒ prob(A) = Σ prob(A∩Bi) 27 | - prob(A|B) = prob(B|A) prob(A) ÷ prob(B) *(Bayes' theorem)* 28 | - Ai partition of Ω ⇒ prob(Ai|B) = prob(B|Ai) prob(Ai) ÷ (Σj prob(B|Aj) prob(Aj)) 29 | 30 | ## Distributions 31 | 32 | A **random variable** (RV) is a function from Ω to ℝ. *(eg. winnings)* 33 | 34 | The **indicator function** is a RV such that I(A) = 1 if A, otherwise 0. 35 | 36 | A **mass function** for a RV X is fX: x → prob(X=x). 37 | 38 | - fX(x) ≥ 0 39 | - ∫ℝ fX(x) dx = 1 40 | 41 | The **expected value** E[X] = Σ x fX(x). 42 | 43 | - E[g(X)] = Σ{x∈X} g(x) fX(x) 44 | - E[aX+b] = a E[X] + b 45 | - (E[X])^2 ≤ E[X^2] *(Cauchy-Schwarz inequality)* 46 | - prob(X=a) = 1 ⇒ E[X] = a 47 | - prob(a < X ≤ b) = 1 ⇒ a < E[X] ≤ b 48 | - if X∈ℕ, r≥2, E[X] < ∞: 49 | - E[X] = Σ prob(X≥x) 50 | - E[X(X-1)…(X-r+1)] = r Σ{x=r…∞} (x-1)…(x-r+1) prob(X=x) 51 | 52 | The **variance** var(X) = E[(X-E[X])^2]. 53 | 54 | - var(aX+b) = a^2 var(X) 55 | - var(X) = 0 ⇒ X constant 56 | 57 | The **covariance** cov(X,Y) = E[XY] E[X] E[Y] 58 | 59 | - cov(X,Y) = E[(X-E[X])(Y-E[Y])] 60 | - cov(X,Y) = cov(Y,X) 61 | - cov(constant,X) = 0 62 | - cov(a+bX+cY,Z) = b cov(X,Z) + c cov(Y,Z) 63 | - cov(X,Y)^2 ≤ var(X) var(Y) 64 | 65 | A few families of distribution. 66 | 67 | - Bernouilli: RV from Ω to {0,1}. Take p = prob(X=0). 68 | *Number of 1s from a 1/p dice throw (a dice with 1/p faces).* 69 | - E[X] = p 70 | - var(X) = p (1-p) 71 | - Binomial "X ~ B(n,p)": RV such that fX(x) = (n choose x) p^x (1-p)^(n-x). 72 | *Number of 1s from n throws of a 1/p dice.* 73 | - E[X] = n p 74 | - var(X) = n p (1-p) 75 | - Geometric "X ~ Geom(p)": RV such that fX(x) = p (1-p)^(x-1) 76 | *Number of throws before a 1/p dice yields a 1.* 77 | - E[X] = 1/p 78 | - var(X) = (1-p)/p^2 79 | - prob(X > n+m | X > m) = prob(X > n) *(memory loss)* 80 | - Negative binomial "X ~ NegBin(n,p)": RV such that fX(x) = (x-1 choose n-1) p^n (1-p)^(x-n). 81 | *Number of throws before a 1/p dice yields n 1s.* 82 | - E[X] = n p / (1-p) 83 | - var(X) = n p / (1-p)^2 84 | - Hypergeometric: RV such that fX(r) = (N choose r) (N-R choose n-r) / (N choose n) 85 | *Number of red socks got from n blind picks without replacement from a drawer 86 | with N socks of which R are red.* 87 | - E[X] = R n/N 88 | - var(X) = R n/N (N-R)/N (N-n)/(N-1) 89 | - Poisson: RV such that fX(x) = λ^x/x! exp(-λ), with x∈{0,1,…,λ}. 90 | *Number of ticks per second when averaging λ ticks per second, if ticks are independent.* 91 | - E[X] = λ 92 | - var(X) = λ 93 | - Uniform: RV such that fX(x) = constant with x∈[a,b]. 94 | *Result of dice throw.* 95 | - E[X] = (a+b)/2 96 | - var(X) = (b-a)^2/12, or ((b-a+1)^2-1)/12 if discrete 97 | - Normal (Gaussian) "X ~ N(μ,σ^2)": RV such that fX(x) = exp(-(x-μ)^2/(2σ^2)) / (σ sqrt(2π)) 98 | *Infinite random walk starting at μ with step variance σ^2.* 99 | - E[X] = μ 100 | - var(X) = σ^2 101 | -------------------------------------------------------------------------------- /misc/synchronization.md: -------------------------------------------------------------------------------- 1 | # Synchronization 2 | 3 | [Concurrency](./concurrency.md) is about imparting and gathering work from 4 | multiple actors (typically on a single machine). 5 | 6 | This is about ensuring the correct operation of actors working together. 7 | In particular, to work with each other, actors must maintain a common 8 | understanding of the world. 9 | 10 | See [this](https://github.com/aphyr/distsys-class) for now; I'll try to make 11 | something denser. 12 | 13 | **Work in progress**. 14 | 15 | ## Network 16 | 17 | Graph of actors communicating 18 | 19 | CSP, Actor model 20 | 21 | ## Actors 22 | 23 | What can go wrong: crash, recovery, corruption, byzantine, heterogenous 24 | 25 | ## Communication 26 | 27 | What can go wrong: slow (latency / bandwidth), lost, corrupted, sent multiple 28 | times 29 | 30 | TCP, UDP ([network](./network.md)) 31 | 32 | ## Clock 33 | 34 | [time](./time.md), posix time, GPS, atomic clock, ntp 35 | 36 | lamport clock 37 | 38 | vector clock 39 | 40 | ## Consistency, Availability, Partition 41 | 42 | CAP theorem: 43 | Network partition will happen, got to choose between consistency (read your 44 | writes) and availability (answer without waiting) 45 | 46 | levels of consistency 47 | 48 | CRUD [data](./data.md) 49 | 50 | ACID 51 | 52 | ## Data transmission 53 | 54 | RPC calls. 55 | SOAP (XML) / REST (JSON): trees of data. 56 | Protocol buffers. Captn Proto. 57 | 58 | GraphQL and the problem of transmitting object graph 59 | 60 | old-school CP (relational) databases (eg. MySQL, postgreSQL, etc.): 61 | single-server writes, distributed reads through WAL streaming replication (hot standby server, vs. warm standby server), failover 62 | 63 | eventual consistency 64 | 65 | total operation order 66 | 67 | operational transformation 68 | 69 | CRDT 70 | 71 | Consensus: Paxos 72 | 73 | DHT 74 | 75 | Merkle tree (git, bitcoin) 76 | 77 | Proof of work (byzantine failure, bitcoin) 78 | 79 | ## Architectural building blocks 80 | 81 | [data](./data.md) 82 | 83 | Key-Value store 84 | (LSM tree on each node, DHT for replication / distribution) 85 | Store small (100K) blobs. High read and write volumes. 86 | 87 | block, object store, distributed file system (eg. Ceph, S3, GlusterFS, GTFS) 88 | Store large (M, G, etc.) blobs. Low read volume. 89 | 90 | Cache (redis, memcached), typically in-memory. 91 | Increases read speeds. 92 | 93 | SQL database: relational data. 94 | Typically low number of machines (one writer machine many readers). 95 | 96 | Big data: when a single machine can't handle the write volume or data size. 97 | Typically requires switching to an AP system, sometimes NoSQL (column (Cassandra), key-value (Riak, Dynamo), graph (Neo4J), document (MongoDB)) 98 | Note that it is extreme; often, simply performing indexing on the right SQL 99 | column is enough. 100 | Also, new-generation SQL systems like Spanner and CockroachDB support CP with 101 | larger numbers of writer machines. 102 | 103 | Message queue (AMQP eg. RabbitMQ, Kafka) 104 | AMQP: protocol on top of TCP to distribute messages: 105 | - Direct exchange: send message to all queues listening to that key, and they'll 106 | deliver it to one consumer 107 | - Fanout exchange: send to all queues bound to it 108 | - Topic exchange: sned to all queues set to receive a given key 109 | PubSub 110 | 111 | Log/Search (ElasticSearch): pull data from all machines and index it. 112 | No rewrites, large amount of data, high read volume. 113 | 114 | Log/Aggregate: log on each machine, merge data upon reading 115 | Very high write volume, very low read volume. 116 | 117 | Immutable core, mutable shell (eg. Plan9 file system fossil) 118 | 119 | ## Advice 120 | 121 | Allow failure (chaos monkey), backup, redundancy, failover, monitoring, logging 122 | 123 | Protocols: version, upgrade 124 | 125 | SLA 126 | 127 | ## Going further 128 | 129 | - [Distributed systems for fun and profit](http://book.mixu.net/distsys/index.html) 130 | -------------------------------------------------------------------------------- /misc/time.md: -------------------------------------------------------------------------------- 1 | # Time 2 | 3 | Any recurring event can let you track time. For instance, if your cat leaves the house regularly, just count each event. However, time tracking is best served with those three Rules of Timekeeping™: 4 | 5 | - It must be **easy to track**, 6 | - have a **constant frequency** so that we can assign fixed durations to certain actions, say baking a cake, 7 | - be **small**, so that we can track extremely short events: counting multiples is easier than divisions. 8 | 9 | ## Day 10 | 11 | The easiest event to track is the presence of the sun in the sky. The **day** was obviously the first time unit in use. For shorter events, it was subdivided into 24 hours, each with 60 minutes that have 60 seconds. (That strange system is probably the oldest legacy code in existence.) 12 | 13 | Initially, hours were scheduled so that there were 12 hours between the start and end of the night, but this design broke rule 2 everywhere but at the equator, and so it was evolved into uniform subdivisions of the whole day. 14 | 15 | Nowadays, **UT1** best tracks that definition of days. Through complex measurements involving the tracking of galaxies in the sky, the angle between the centre of the sun and latitude 0 (passing through Greenwich, UK) is mapped to a precise time. However, various events on Earth like tsunamis accelerate or slow down the Earth's rotation, making this measure break rule 2 as well. Worse yet, because of the Moon's attraction, the Earth's rotation is perpetually going to slow down, increasing the length of days increasingly fast. 16 | 17 | ## Year 18 | 19 | With agriculture, it became necessary to track a different recurring cosmic event: the number of turns of the Earth around the Sun. We know it as a **year**. Of course, there is not an integer number of days in a year; there are around 365.24219 Earth rotations in an Earth revolution. 20 | 21 | Yet again, this measure breaks rule 2, even though it is not as bad. A large number of gravitational effects from the other planets modify the length of a sideral year in ways that seem random to the untrained eye. 22 | 23 | ## Second 24 | 25 | Initially, the **second** was backed by the day, as we mentionned. Then, it was backed by the moon. Then the metre (for use in mechanical clocks). Then the year. Finally, technology allowed us to realize that microwaving cesium makes its electrons oscillate at a very nearly constant frequency. It is the best timekeeper we have, and fortunately we can still use it if we change solar system. 26 | 27 | So we switched to defining the second as a multiple of those oscillations. When the switch was made, the second was exactly equal to a second as defined previously (a portion of a year). Now, the most precise way to measure time is to count the number of seconds, which is convenient, since the second is the SI unit of time. 28 | 29 | As a result, **TAI** (for International Atomic Time) was designed to count all time units in terms of atomic clocks at sea level. A TAI day is 24×60×60 seconds, a year is 365 days except on leap years, where it is 366 days. Leap days are used to correct the 0.24219 extra days in a year (imperfectly; they only correct 0.2425). They add a day at the end of February. They occur on years divisible by 4 and not divisible by 100, except if divisible by 400. 30 | 31 | ![TAI](http://www.bipm.org/utils/common/img/tai/timelinks-2013.jpg) 32 | *Location of laboratories contributing to computing TAI.* 33 | 34 | But of course, as days increase because of the Moon's gravitational pull, days no longer last 24×60×60 seconds — they are a minuscule bit longer than that. Today, TAI's midnight is roughly 40 seconds after the UT1 midnight. 35 | 36 | To allow computing civil time accurately forever while benefitting from TAI's conformance to the Rules of Timekeeping™, a mix was made, called **UTC**. UTC is just like TAI, except that it is a fixed number of seconds before it. Every once in a while, the [IERS](https://www.iers.org/IERS/EN/Home/home_node.html) (International Earth Rotation and Reference Systems Service) proclaims that there will be a leap second added at the end of a certain day, causing the time to go from 23:59:59 to 23:59:60 (← leap second!) and then 00:00:00. The IERS does this every time it sees that there is a risk for UTC to deviate from UT1 by more than one second. 37 | 38 | That said, realistically, if in 30 000 years midnight has slowly become the rough time when the sun rises, the Earth population will have had time to change the meaning of the word. As a result, there is a possibility (that computer scientists warmly welcome) that UTC stop adding leap seconds after a certain year. 39 | 40 | There is another widely used second-based code representing instants in time: **[Unix time][]**. It is the number of seconds since the start of year 1970, *not including leap seconds*. Unix time allows sub-second precision (using a real number instead of integers). Since leap seconds are discarded, they need to be subtracted. As a result, when a leap second occurs, Unix time is determined by the [POSIX][] standard to increase linearly by one second for that second, and then jump back by one second, and again increase linearly by one second, making that second happen twice. 41 | 42 | That design ensures that we can easily compute the number of UTC days between two Unix time stamps. However, it does not give the correct number of SI seconds. To quote [them][Unix time rationale]: 43 | 44 | > [M]ost systems are probably not synchronized to any standard time reference. Therefore, it is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch. 45 | 46 | [Unix time]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_15 47 | [POSIX]: http://standards.ieee.org/develop/wg/POSIX.html 48 | [Unix time rationale]: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap04.html#tag_21_04_15 49 | 50 | As a result, in order to compute the number of seconds between two times, if they are TAI times: convert them to seconds and subtract them. If they are UTC or Unix time: do the same, then account for all leap seconds that occurred between them. 51 | 52 | ## Time Zones 53 | 54 | Here is a more difficult question to answer: “What time is it here?” Civil time is one of those things that governments impose by law. Therefore, answering that question involves language, politics and alarms. 55 | 56 | To give the same meaning of the words noon and midnight across the world, each country proclaims that it uses a certain deviation from UTC roughly proportional to their longitude. Each deviation from UTC is called a time zone, and represented as a positive or negative number of hours, minutes, and (real-numbered) seconds. 57 | 58 | ![Standard time](https://upload.wikimedia.org/wikipedia/commons/4/4b/Solar_time_vs_standard_time.png) 59 | *Time zones with offset from solar time.* 60 | 61 | As extra fun, many countries change time zones twice a year, but not at the same time. They reckon that people are disturbed by the sun in summer before their clock wakes them up. Having a curtain seemed too hard, so those countries decided to change time itself. This is called **Daylight Savings Time**. 62 | 63 | To be perfectly accurate, one needs to keep track of all wars and laws in the world to determine the civil time at a geolocated point on the planet. Fortunately, two individuals took it upon themselves to do exactly that. They built what is now known as the [IANA time zone database][tzdata] (aka. tzdata). It determines both what each time zone code it defines has to UTC through time, and what time zone code is used for a set of famous cities spread throughout the planet. Most operating systems have a copy, which is often used to ask the user what their nearest large city is when installing it. 64 | 65 | [tzdata]: http://www.iana.org/time-zones 66 | 67 | So, using a geolocation coordinate, this database, and time zone borders and border changes, it is feasible to compute the civil time at all points on Earth from 1986 onwards. (You need way more data for dates prior to 1986, date when Nepal became the last country to switch to a UTC time zone.) 68 | 69 | ## Stamps 70 | 71 | The most common representation of times and dates in computing is [ISO 8601][]. It defines a textual representation for UTC (with or without time zones), calendar dates, weeks, yearless calendar dates, and durations. The precision goes down to the millisecond. For instance, `2006-08-14T02:34:56.238-06:00`. 72 | 73 | [ISO 8601]: https://en.wikipedia.org/wiki/ISO_8601 74 | 75 | A variant of this is [RFC 3339][], which is more lax. For instance, `2006-08-14 02:34:56.238-0600`. 76 | 77 | We can also find [RFC 2822][], which is used for email. For instance: `Mon, 14 Aug 2006 02:34:56 -0600`. 78 | 79 | [RFC 3339]: https://www.ietf.org/rfc/rfc3339.txt 80 | [RFC 2822]: https://www.ietf.org/rfc/rfc2822.txt 81 | 82 | The second most common representation in computing is [Unix time][], in seconds, either in 32-bit integers, 32-bit unsigned integers, 64-bit unsigned integers, or in textual base-10 form (with decimal digits for more precision). 83 | 84 | Unfortunately, there is no widespread TAI-based time stamps. -------------------------------------------------------------------------------- /tree/Readme.md: -------------------------------------------------------------------------------- 1 | Trees are connected graphs without cycles. 2 | 3 | Most trees we use are **rooted**: one vertex is the entry point (the root); 4 | it is the only vertex with no edge pointing to it, all other have exactly one. 5 | 6 | Many trees are **ordered**: vertices order their children like a list. 7 | 8 | The most common implementation of trees is as either null pointers or pointers 9 | to a structure with a value and a list of children trees. 10 | 11 | # Binary tree 12 | 13 | Ordered trees with up to two children. Since they are ordered, we call them 14 | "left" and "right". 15 | 16 | ## Tree traversal 17 | 18 | We want to read each vertex exactly once. 19 | 20 | ### Depth-first 21 | 22 | - **Pre-order**: read a vertex, then pre-order left, then pre-order right. 23 | - **In-order**: in-order left, then read the vertex, then in-order right. 24 | - **Post-order**: post-order left, then post-order right, then read the vertex. 25 | 26 | In-order is probably called this because it is the correct order for binary 27 | search trees. 28 | 29 | ### Breadth-first 30 | 31 | 1. Start with a list of `children` containing just the root. 32 | 2. Read a vertex from the start of the list and remove it. 33 | 3. Put its children at the end of `children`. 34 | 4. Go to 2 unless `children` is empty. 35 | 36 | It requires O(n/2) worst-case space. 37 | 38 | ## Binary search tree 39 | 40 | Great for maps that you can traverse in the key order, priority queues where you 41 | care about the least prioritized element, and search when a hash table won't do. 42 | 43 | - Each value stored in a vertex has total ordering. 44 | - Left is smaller 45 | - Right is bigger 46 | 47 | Search, insertion and deletion is O(log n) average, O(n) worst-case (it can 48 | reduce to a list). 49 | 50 | If the tree is balanced (= minimal height), we get O(log n) worst-case. 51 | 52 | ### AVL tree 53 | 54 | The first self-balanced binary search tree, and the one with the least costly 55 | search. Use this if you only insert during initialization. 56 | 57 | Height ≤ `logφ(√5·(n+2))-2`. 58 | 59 | ### Red-Black tree 60 | 61 | Insertions are less costly than an AVL tree. 62 | 63 | Height ≤ `2·log2(n+1)`. 64 | 65 | ### Splay tree 66 | 67 | Rarer, it ensures that recently requested searches are faster to search for. 68 | 69 | ## Binary heap 70 | 71 | Great for priority queues. Efficient to implement as an array. 72 | 73 | - Complete binary tree: all levels of the tree must be filled but the bottom. 74 | - Each value stored in a vertex is bigger than or equal to its children. 75 | 76 | A nice thing to know: it can sort an array in-place, just by inserting all the 77 | elements into the heap, and extracting the minimum n times. 78 | 79 | # B-tree 80 | 81 | Achieve O(log(n)) worst-case. Great for IO with large blocks of data: databases, 82 | filesystems. 83 | 84 | Each vertex can have 2 or 3 children, and 1 or 2 keys (pieces of ordered data). 85 | 86 | - The left branch has keys smaller than the left key, 87 | - The right branch has keys higher than the right key, 88 | - The middle branch has keys in between. 89 | -------------------------------------------------------------------------------- /tree/binary-search.js: -------------------------------------------------------------------------------- 1 | // A binary search tree is a binary tree (a rooted tree with vertices with up to 2 | // 2 children) where every vertex on the left branch holds keys lower than that 3 | // of the current vertex, and every vertex on the right branch holds keys 4 | // higher. 5 | 6 | function BinarySearchTree() {} 7 | 8 | BinarySearchTree.prototype = { 9 | key: null, 10 | // Not having a value can make this work like a set 11 | // with a findMin over the keys. 12 | value: null, 13 | left: null, 14 | right: null, 15 | 16 | // Ensure that search(key) returns the value for that key. 17 | // O(log n) with random input, O(n) worst-case. 18 | insert: function(key, value) { 19 | if (this.key == null) { 20 | // This is below a leaf or the tree is empty. 21 | this.key = key; 22 | this.value = value; 23 | } else if (key < this.key) { 24 | this.leftInsert(key, value); 25 | } else if (key > this.key) { 26 | this.rightInsert(key, value); 27 | } else { 28 | // this.key == key; the key was already there. 29 | } 30 | }, 31 | 32 | leftInsert: function(key, value) { 33 | if (this.left == null) { 34 | this.left = new BinarySearchTree(); 35 | } 36 | this.left.insert(key, value); 37 | }, 38 | 39 | rightInsert: function(key, value) { 40 | if (this.right == null) { 41 | this.right = new BinarySearchTree(); 42 | } 43 | this.right.insert(key, value); 44 | }, 45 | 46 | // Return the value that was inserted. 47 | // O(log n) with random input, O(n) worst-case. 48 | search: function(key) { 49 | if (this.key == null) { 50 | // We reached a leaf without success; that key was never inserted. 51 | return null; 52 | } else if (key < this.key) { 53 | if (this.left != null) { 54 | return this.left.search(key); 55 | } else { return null; } 56 | } else if (key > this.key) { 57 | if (this.right != null) { 58 | return this.right.search(key); 59 | } else { return null; } 60 | } else { 61 | return this.value; 62 | } 63 | }, 64 | 65 | // Ensure that search(key) returns null. 66 | // O(log n) with random input, O(n) worst-case. 67 | delete: function(key) { 68 | if (this.key == null) { 69 | return; 70 | } else if (key < this.key) { 71 | if (this.left != null) { 72 | this.left.delete(key); 73 | } 74 | } else if (key > this.key) { 75 | if (this.right != null) { 76 | this.right.delete(key); 77 | } 78 | } else { 79 | // We found the key to delete. 80 | 81 | if (this.left == null && this.right == null) { 82 | // We have no children, we just disappear. 83 | this.key = this.value = null; 84 | } else if (this.left == null) { 85 | // We have one child, we switch place with it. 86 | this.replaceWith(this.right); 87 | } else if (this.right == null) { 88 | this.replaceWith(this.left); 89 | } else { 90 | // We have two children. Replace with the biggest vertex on the left, 91 | // and delete that biggest vertex downward. 92 | var max = this.left.findMax(); 93 | // It cannot be null here. 94 | this.key = max.key; 95 | this.value = max.value; 96 | this.left.delete(max.key); 97 | } 98 | } 99 | }, 100 | 101 | replaceWith: function(tree) { 102 | this.key = tree.key; 103 | this.value = tree.value; 104 | this.left = tree.left; 105 | this.right = tree.right; 106 | }, 107 | 108 | // Return the biggest vertex. 109 | findMax: function() { 110 | if (this.right == null) { 111 | return this; 112 | } else { 113 | return this.right.findMax(); 114 | } 115 | }, 116 | 117 | // In-order walk. O(n). 118 | walk: function(f) { 119 | if (this.left != null) { 120 | this.left.walk(f); 121 | } 122 | if (this.key != null) { 123 | f(this.key, this.value); 124 | } 125 | if (this.right != null) { 126 | this.right.walk(f); 127 | } 128 | }, 129 | }; 130 | 131 | // Usage. 132 | 133 | var tree = new BinarySearchTree(); 134 | tree.insert("orange", "A citrus fruit with a slightly sour flavour."); 135 | tree.insert("banana", "An elongated curved tropic fruit with a creamy flesh."); 136 | tree.insert("strawberry", "A sweet fruit of a plant of the genus Fragaria."); 137 | console.log("An orange is " + tree.search("orange")); 138 | tree.delete("orange"); 139 | console.log("Once deleted, an orange is " + tree.search("orange") + "."); 140 | tree.walk(function(key, value) { console.log("- " + key + ": " + value); }); 141 | -------------------------------------------------------------------------------- /tree/heap.js: -------------------------------------------------------------------------------- 1 | // Binary Heap as a Priority Queue. 2 | // 3 | // 1. Each vertex is bigger than any descendent. 4 | // 2. All levels of the tree must be filled but the bottom. 5 | 6 | function BinaryHeap() { 7 | this.size = 0; 8 | this.array = []; 9 | } 10 | 11 | BinaryHeap.prototype = { 12 | // In this array, for each vertex at position n: 13 | // - the left child is at 2*n + 1, 14 | // - the right child is at 2*n + 2. 15 | array: [], 16 | 17 | // First, some obvious functions specific to the use of an array. 18 | swap: function(i, j) { 19 | var tmp = this.array[i]; 20 | this.array[i] = this.array[j]; 21 | this.array[j] = tmp; 22 | }, 23 | push: function(item) { this.size++; this.array.push(item); }, 24 | pop: function() { this.size--; return this.array.pop(); }, 25 | 26 | // Now, the real magic. 27 | // O(log n) worst-case. 28 | insert: function(priority, value) { 29 | this.push({key: priority, value: value}); 30 | this.shiftUp(this.size - 1); 31 | }, 32 | 33 | max: function() { 34 | return this.array[0]; 35 | }, 36 | 37 | removeMax: function() { 38 | // Swap the max with the min, push the min down. 39 | this.swap(0, this.size - 1); 40 | var max = this.pop(); 41 | this.shiftDown(0); 42 | return max; 43 | }, 44 | 45 | // Primitives required for balancing the tree. 46 | 47 | shiftUp: function(j) { 48 | for (;;) { 49 | // Is j's parent (i) bigger? 50 | // If yes, we have kept rule #1, we can exit. 51 | var i = Math.floor((j - 1) / 2); 52 | if (i < 0) { i = 0; } 53 | if (this.array[i].key >= this.array[j].key) { 54 | break; 55 | } 56 | this.swap(i, j); 57 | j = i; 58 | } 59 | }, 60 | 61 | shiftDown: function(i) { 62 | var j; 63 | for (;;) { 64 | // j1 is the left child of i. 65 | // j2 is the right child of i. 66 | var j1 = 2*i + 1; 67 | var j2 = 2*i + 2; 68 | 69 | // We want to switch i with the biggest of its child, 70 | // to maintain rule #1. 71 | if (j1 >= this.size) { break; } 72 | if ((j2 < this.size) && (this.array[j1].key <= this.array[j2].key)) { 73 | j = j2; 74 | } else { 75 | j = j1; 76 | } 77 | 78 | // If we already follow rule #1, we're good to go. 79 | if (this.array[i].key >= this.array[j].key) { break; } 80 | 81 | // We don't follow rule #1. Swap and continue. 82 | this.swap(i, j); 83 | i = j; 84 | } 85 | }, 86 | }; 87 | 88 | var heap = new BinaryHeap(); 89 | heap.insert(2, 'two'); 90 | heap.insert(5, 'five'); 91 | heap.insert(3, 'three'); 92 | heap.insert(4, 'four'); 93 | heap.insert(1, 'one'); 94 | console.log('The top of the heap is ' + heap.max().key + ' (five).'); 95 | console.log('Reverse sorted items:'); 96 | var item; 97 | while (item = heap.removeMax()) { 98 | console.log(item.key + '. ' + item.value); 99 | } 100 | -------------------------------------------------------------------------------- /tree/red-black.js: -------------------------------------------------------------------------------- 1 | // A Red-Black tree has four properties: 2 | // 3 | // 1. Each vertex is either red or black. 4 | // 2. The root is black. 5 | // 3. A red vertex cannot have a red child. 6 | // 4. All paths from root to leaf must have the same number of black vertices. 7 | 8 | function RedBlackTree() {} 9 | 10 | var color = {red: 0, black: 1}; 11 | RedBlackTree.prototype = { 12 | key: null, 13 | value: null, 14 | left: null, 15 | right: null, 16 | parent: null, 17 | color: color.black, 18 | 19 | // O(log n) worst-case. 20 | // We choose to implement it procedurally instead of recursively to separate 21 | // all steps. 22 | insert: function(key, value) { 23 | var v = this; 24 | 25 | // First, find the insertion position as with a normal binary search tree. 26 | for (;;) { 27 | if (v.key == null) { 28 | // This is below a leaf or the tree is empty. 29 | v.key = key; 30 | v.value = value; 31 | v.color = color.red; // ← Every inserted vertex starts out red. 32 | break; 33 | } else if (key < v.key) { 34 | v = this.produceLeft(v); 35 | } else if (key > v.key) { 36 | v = this.produceRight(v); 37 | } else { 38 | // v.key == key; the key was already there. 39 | } 40 | } 41 | 42 | // Second, compare it to its parent. 43 | for (;;) { 44 | if (v.parent == null) { 45 | // Case 1. We are inserting the root: paint it black for rule #2. 46 | v.color = color.black; 47 | return; 48 | } 49 | if (v.parent.color === color.black) { 50 | // Case 2. Black parent, Red child, we are not breaking any rule. 51 | return; 52 | } 53 | // Red parent, Red child, we are breaking rule #3. 54 | // We know we have a black grandparent, since the root cannot be red, 55 | // but our parent is. 56 | var grandparent = v.parent.parent; 57 | var uncle = v.uncle(); 58 | if (uncle != null && uncle.color === color.red) { 59 | // Case 3. 60 | // (G) [G] 61 | // [P] [U] → (P) (U) 62 | // [V] [V] 63 | // 64 | // G = grandparent, P = parent, U = uncle; [Red], (Black). 65 | // 66 | // This keeps the number of blacks equal whatever the path. 67 | v.parent.color = color.black; 68 | uncle.color = color.black; 69 | grandparent.color = color.red; // Red is the new black. ☺ 70 | // What about the grandparent? 71 | // Did we make it break rule #2 (black root) or #3 (two red vertices)? 72 | v = grandparent; 73 | // Back to Case 1. 74 | // Case 3 is the only looping case, which is why insert() is O(log n). 75 | continue; 76 | } 77 | if (v.parent.right === v && v.parent === grandparent.left) { 78 | // Case 4. 79 | // (G) (G) 80 | // [P] (U) → [V] (U) 81 | // [V] [P] 82 | this.leftRotation(v.parent); 83 | // We still have two red vertices breaking rule #3. 84 | // We will fix that with case 5. 85 | v = v.left; 86 | } else if (v.parent.left === v && v.parent === grandparent.right) { 87 | // Case 4 cont. 88 | // (G) (G) 89 | // (U) [P] → (U) [V] 90 | // [V] [P] 91 | this.rightRotation(v.parent); 92 | // We still have two red vertices breaking rule #3. 93 | // We will fix that with case 5. 94 | v = v.right; 95 | } 96 | if (v.parent.left === v && v.parent === grandparent.left) { 97 | // Case 5. 98 | // (G) (P) 99 | // [P] (U) → [V] [G] 100 | // [V] (U) 101 | this.rightRotation(grandparent); 102 | v.parent.color = color.black; 103 | grandparent.color = color.red; 104 | // All is safe now. 105 | return; 106 | } else if (v.parent.right === v && v.parent === grandparent.right) { 107 | // Case 5 cont. 108 | // (G) (P) 109 | // (U) [P] → [G] [V] 110 | // [V] (U) 111 | this.leftRotation(grandparent); 112 | v.parent.color = color.black; 113 | grandparent.color = color.red; 114 | // All is safe now. 115 | return; 116 | } 117 | } 118 | }, 119 | 120 | // G 121 | // P U ← uncle 122 | // V ← vertex (this) 123 | uncle: function() { 124 | if (this.parent == null || this.parent.parent == null) { 125 | return null; 126 | } 127 | var grandparent = this.parent.parent; 128 | if (grandparent.right === this.parent) { 129 | return grandparent.left; 130 | } else { 131 | return grandparent.right; 132 | } 133 | }, 134 | 135 | // Tree rotations preserve the properties of binary search trees. 136 | 137 | // P V 138 | // V C → A P 139 | // A B B C 140 | rightRotation: function(parent) { 141 | var v = parent.left; 142 | var b = v.right; 143 | var grandparent = parent.parent; 144 | if (grandparent != null) { 145 | var parentIsLeft = (grandparent.left === parent); 146 | v.right = parent; 147 | parent.left = b; 148 | v.parent = grandparent; 149 | if (parentIsLeft) { 150 | grandparent.left = v; 151 | } else { 152 | grandparent.right = v; 153 | } 154 | } else { 155 | // parent (which is root) becomes v. 156 | // P P 157 | // V C → A V 158 | // A B B C 159 | v.switchData(parent); 160 | var a = v.left; 161 | v.left = b; 162 | v.right = parent.right; 163 | parent.left = a; 164 | parent.right = v; 165 | parent.parent = null; 166 | v.parent = parent; 167 | } 168 | }, 169 | 170 | // P V 171 | // A V → P C 172 | // B C A B 173 | leftRotation: function(parent) { 174 | var v = parent.right; 175 | var b = v.left; 176 | var grandparent = parent.parent; 177 | if (grandparent != null) { 178 | var parentIsLeft = (grandparent.left === parent); 179 | v.left = parent; 180 | parent.right = b; 181 | v.parent = grandparent; 182 | if (parentIsLeft) { 183 | grandparent.left = v; 184 | } else { 185 | grandparent.right = v; 186 | } 187 | } else { 188 | // parent (which is root) becomes v. 189 | // P P 190 | // A V → V C 191 | // B C A B 192 | v.switchData(parent); 193 | var c = v.right; 194 | v.right = b; 195 | v.left = parent.left; 196 | parent.right = c; 197 | parent.left = v; 198 | parent.parent = null; 199 | v.parent = parent; 200 | } 201 | }, 202 | 203 | switchData: function(vertex) { 204 | var key = this.key; 205 | var value = this.value; 206 | var color = this.color; 207 | 208 | this.key = vertex.key; 209 | this.value = vertex.value; 210 | this.color = vertex.color; 211 | 212 | vertex.key = key; 213 | vertex.value = value; 214 | vertex.color = color; 215 | }, 216 | 217 | produceLeft: function(vertex) { 218 | var left = vertex.left; 219 | if (left == null) { 220 | left = vertex.left = new RedBlackTree(); 221 | left.parent = vertex; 222 | } 223 | return left; 224 | }, 225 | 226 | produceRight: function(vertex) { 227 | var right = vertex.right; 228 | if (right == null) { 229 | right = vertex.right = new RedBlackTree(); 230 | right.parent = vertex; 231 | } 232 | return right; 233 | }, 234 | 235 | // Return the value that was inserted. 236 | // O(log n) worst-case. 237 | // Same implementation as a binary search. 238 | search: function(key) { 239 | if (this.key == null) { 240 | // We reached a leaf without success; that key was never inserted. 241 | return null; 242 | } else if (key < this.key) { 243 | if (this.left != null) { 244 | return this.left.search(key); 245 | } else { return null; } 246 | } else if (key > this.key) { 247 | if (this.right != null) { 248 | return this.right.search(key); 249 | } else { return null; } 250 | } else { 251 | return this.value; 252 | } 253 | }, 254 | 255 | // In-order walk. O(n). 256 | // Same implementation as a binary search. 257 | walk: function(f) { 258 | if (this.left != null) { 259 | this.left.walk(f); 260 | } 261 | if (this.key != null) { 262 | f(this.key, this.value); 263 | } 264 | if (this.right != null) { 265 | this.right.walk(f); 266 | } 267 | }, 268 | }; 269 | 270 | // Usage. 271 | var tree = new RedBlackTree(); 272 | // Note that those insertions would create a linked list with a naive 273 | // binary search tree. 274 | tree.insert("banana", "An elongated curved tropic fruit with a creamy flesh."); 275 | tree.insert("orange", "A citrus fruit with a slightly sour flavour."); 276 | tree.insert("strawberry", "A sweet fruit of a plant of the genus Fragaria."); 277 | console.log("An orange is " + tree.search("orange")); 278 | tree.walk(function(key, value) { console.log("- " + key + ": " + value); }); 279 | --------------------------------------------------------------------------------