├── .github └── ISSUE_TEMPLATE │ ├── reading-list.md │ └── research.md ├── README.md └── notes ├── 2019-12-28-linkedin-log.md ├── 2020-01-06-time-order.md ├── 2020-03-23-impossibility-of-distributed-consensus-with-one-faulty-process.md ├── 2020-05-04-scribe.md ├── 2020-05-05-byzantine-generals.md ├── 2020-05-05-effective-manycast-messaging-for-kad.md ├── 2020-05-12-pastry.md ├── 2020-05-26-quic.md ├── 2020-05-26-reliable-systems-series-model-based-testing.md └── 2020-06-16-lsmt.md /.github/ISSUE_TEMPLATE/reading-list.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: reading-list 3 | about: Reading list 4 | title: '' 5 | labels: reading list 6 | assignees: '' 7 | 8 | --- 9 | 10 | - [ ] []() 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/research.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: research 3 | about: Research issue template 4 | title: '' 5 | labels: research 6 | assignees: '' 7 | 8 | --- 9 | 10 | # Questions 11 | - 12 | # Literature 13 | - [ ] **Item** - [post]() | [paper]() | [video]() 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Research 2 | 3 | - [notes](/notes) - A set of notes and summaries from papers, blogs etc. that I have read. 4 | 5 | More long form posts see [my blog](https://dean.eigenmann.me/blog). 6 | 7 | -------------------------------------------------------------------------------- /notes/2019-12-28-linkedin-log.md: -------------------------------------------------------------------------------- 1 | # [The Log: What every software engineer should know about real-time data's unifying abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) 2 | 3 | Logs are a **data structure**, where **records** are appended to the end. Logs are read from top-to-bottom (left-to-right). 4 | 5 | This post is not about application logging, rather about logging that is used by software. Like logs that are written to before database changes are made for purposes of atomicity, replication and restoration. 6 | 7 | **Logs solve 2 of the main issues in distributed systems, ordering changes & distributing data.** 8 | 9 | > If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state. 10 | 11 | As long as 2 nodes process the log in the same way, they will reach the same end state. It doesn't really matter what is put in the log however, as long as it is **enough** for nodes to remain consistent. 12 | 13 | In databases we have **physical logging**, which is logging the contents of each row that was changed. We also have **logical logging**, which is not what was changed but which command was executed to change the row. 14 | 15 | Similarly in other distributed systems we have logical and physical logging, where all nodes either process the full request or one node does and then distributes the results to other nodes. 16 | 17 | Logs can serve as a sort of backup of the entire state. Version control is essentially a log. Additionally logs provide logical clocks in systems where multiple nodes synchronize a single log. This means we can say for example, don't read from any process that has not synchronized up to logical clock `x`. 18 | 19 | Logs act as a buffer making data production asynchronous from data consumption. 20 | 21 | The author uses log because they are more specific about semantics, they are a messaging system with durability guarantees and strong ordering semantics, unlike more generic messaging systems or pub sub. In distributed systems this is often also called an [atomic broadcast](https://en.wikipedia.org/wiki/Atomic_broadcast). 22 | 23 | Post pretty much ends with an explanation of kafka. 24 | 25 | -------------------------------------------------------------------------------- /notes/2020-01-06-time-order.md: -------------------------------------------------------------------------------- 1 | # [Time, Clocks, and the Ordering of Events in a Distributed System](https://lamport.azurewebsites.net/pubs/time-clocks.pdf) 2 | 3 | Time is generally how we order events. Sometimes it is hard in a distributed system to state which event happened before another, the "happened before" relation is only a partial ordering of the system. 4 | 5 | We can say that `a` happened before `b` if `a` causally affected `b`, we can say this without needing a notion of time within a system. Events are concurrent if neither causally affected the other. 6 | 7 | With this ordering of events, we can add a `clock` to the system. The `clock` is a way of assigning a number to an event representing at which point in time an event has occurred. 8 | 9 | > clock **Ci** exists for every process **Pi** and assigns a number **Ci(a)** for every event **a** in that process. 10 | > 11 | > **C** represents the entire system of clocks where **C(b)** assigns a number to event **b** where **C(b)** == **Cj(b)** if **b** is an event of process **j**. 12 | 13 | The clocks we have implemented are **logical clocks** meaning they may be implemented with counters and not actual physical time. 14 | 15 | If an event `a` happened before `b` then the clock value of `a` should be smaller than that of `b`. The converse does not hold as this would mean that all concurrent events must happen at the same clock value. If 2 events occur on one node, they can both be concurrent to the same event on another node without occurring at the same clock value. 16 | 17 | Changing the value of ``Ci`` takes place between events, meaning that the changing of the value does not itself constitute an event. 18 | 19 | When a node receives a message from another node, that message contains a clock value indicating its clock value on the sender node, the receiving node must then set its own clock value to be greater than the one received. 20 | 21 | We can use these clocks to totally order the set of all events that occurred in our system by sorting them using their clock values. To break ties we use an arbitrary ordering of events. 22 | 23 | ## References 24 | - [Research Issue](https://github.com/decanus/research/issues/10) 25 | -------------------------------------------------------------------------------- /notes/2020-03-23-impossibility-of-distributed-consensus-with-one-faulty-process.md: -------------------------------------------------------------------------------- 1 | # Notes: Impossibility of Distributed Consensus with One Faulty Process 2 | 3 | This paper won the [Dijkstra award](http://eatcs.org/index.php/dijkstra-prize). 4 | 5 | Reaching agreement is one of the fundamental issues in distributed systems and is at the core of many algorithms. Reaching agreement is simple if all participating nodes are reliable as well as the network. However real-world systems can have multiple faults: 6 | - partitions 7 | - node crashes 8 | - lost, distorted or duplicated messages 9 | 10 | Additionally Byzantine failures can occur where nodes conduct in completely adveserial behaviour. Ideally, we could have a protocol that is reliable in the face of such faults. 11 | 12 | The paper shows that no completely asynchronous algorithm can tolerate even a single unannounced process death. Byzantine failures are not considered, and the message system is assumed to be reliable, all messages are delivered correctly and exactly once. 13 | 14 | **The above means that this problem has no robust solutions, unless we make further assumptions on the enviroment and greater restrictions on the kinds of failures tolerated.** 15 | 16 | Assumptions: 17 | - all processing is completely asynchronous. 18 | - there is no access to synchronized clocks. 19 | - it is impossible to tell whether a process has died or is simply slow. 20 | 21 | We exclude algorithms that by default simply vote `NO` as this would be completely useless. 22 | 23 | An algorithm that satisfies the below properties would solve the consensus problem: 24 | 25 | - **Agreement**: Correct nodes `i` output the same value `o = o`. 26 | - **Validity**: If all nodes have the same input `b`, then `o = b`. 27 | - **Termination**: All correct nodes decide on an output and terminate. 28 | 29 | The paper essentially states that if such an algorithm existed one could cause it to remain undecided for an arbitrary amount by delaying message delivery, which is allowed in asynchronous enviroments. 30 | 31 | ## References 32 | - https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/ 33 | - https://www.the-paper-trail.org/post/2012-03-25-flp-and-cap-arent-the-same-thing/ 34 | - https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf 35 | - http://resources.mpi-inf.mpg.de/departments/d1/teaching/ws14/ToDS/script/flp.pdf 36 | - https://www.youtube.com/watch?v=Vmlj-67aymw 37 | - https://hackernoon.com/connecting-the-dots-flp-bft-and-consensus-algorithms-m9r62bs1 38 | - http://book.mixu.net/distsys/abstractions.html 39 | 40 | -------------------------------------------------------------------------------- /notes/2020-05-04-scribe.md: -------------------------------------------------------------------------------- 1 | # Notes: [Scribe](https://people.mpi-sws.org/~druschel/publications/Scribe-jsac.pdf) 2 | 3 | > SCRIBE is a scalable application-level multicast infrastructure. 4 | 5 | Network-level IP multicast was never widely deployed for various reason such the difficulty of how to track groups. This is why application-level multicast has gained popularity. These however still face certain challenges, like being able to scale and tolerate the failure modes of the general internet, while achieving low delay and effective use of resources. DHT's however offer scalable, self-organizing, fault-tolerant overlay networks for decentrallized distributed applications. Scribe is built on top of the [pastry DHT](https://en.wikipedia.org/wiki/Pastry_(DHT)). 6 | 7 | Scribe requires all participants to share equal responsibilities. It builds a multicast tree from each group member to a groups rendezvous-point. Messaging and group management rely on the robustness of Pastry itself. 8 | 9 | **Things I found interesting:** 10 | - Particularly what I find interesting about Scribe is how local it is, e.g. no specific node is responsible for the entirity of the group, it is all based on local information. 11 | 12 | ## Pastry 13 | 14 | Each Pastry node has a unique ID, the set of node IDs is uniformly distributed. Pastry routes messages to the node that is closest to given key contained in the message numerically. Pastry can route in less than `log2^b(n)` steps, assuming a network of size `n`. `b` is a configuration parameter, typically `4`. Pastry tables required by nodes only have `(2^b - 1) * log2^b(n) + l` entries. Invariants in tables can be restored through `O(log2^b N)` messages. 15 | 16 | Pastry routing provides 2 important properties for scribe. 17 | 18 | 1. **Short routes** - Due to the message being routed to the nearest node with a longer prefix match, models show that the average distance traveled by a message is between 1.59 and 2.2 times the distance between the source and destination in the underlying internet. 19 | 2. **Route convergence** - Simulations show that the average distance traveled by two messages sent to the same key, converge approximately at the point in their route equal to the distance between their respective source nodes. 20 | 21 | **Node failure** - To handle node failures, nodes periodically exchange keep alive messages with their closes nodes, if a node is unresponsive for a certain period it is assumed failed. 22 | 23 | ## Scribe 24 | 25 | Any scribe node may create a group, other nodes can join a group or send messages to that group. Scribe provides best-effort delivery and specifies no order. Scribe groups can have multiple sources of messages and many members. 26 | 27 | Scribe exposes a simple API. 28 | 29 | - `create(credentials, groupID)` - creates a new group. 30 | *credentials are later used for access control* 31 | - `join(credentials, groupID, messageHandler)` - causes the local node to join a group, all received messages by the group are then sent to the message handler. 32 | - `leave(credentials, groupID)` - causes the local node to leave a group. 33 | - `multicast(credentials, groupID, message)` - causes the message to be multicast within the group. 34 | 35 | Scribe is fully decentralized, each node can act as a multicast source, a root of a multicast tree, a group member, a node within a tree or any combination of the above. 36 | 37 | ### Implementation 38 | 39 | `forward` and `deliver` are specified by the scribe application as required by pastry. The deliver method is called when a message arrives at the node with nodeId numerically closest to the message's key or when a message was addressed to the local node using the Pastry send operation. The forward method is called whenever a scribe message is routed through a node. 40 | 41 | Each group has a unique groupID, the node with the numerically closest ID to the groupId acts as the rendezvous-point for the group. The groupID is a hash of the groups name concatenated with the creators name. 42 | 43 | GroupIDs are distributed uniformly, just like node IDs, ensuring equal distribution of groups across nodes. 44 | 45 | Scribe has the following message types: 46 | - `JOIN` - used to join a group. 47 | If a node is asked to forward this message by pastry, and it is not yet a forwarder for this group it adds it to its set of groups to become a forwarder ([see "dissemninating messages"](#dissemninating-messages)), additionally it adds the source node as a child. It must then become a forwarder by sending a `JOIN` message to the next closest node to the group ID. 48 | - `CREATE` - used to create a new group with a groupID as the messages key. 49 | - `LEAVE` - used when a node wants to leave. First a node marks itself locally as left, if a node has no children it just sends this message to its parent node in the tree. The message is passed up recursively until a node still has children after the departing node was removed. 50 | - `MULTICAST` 51 | 52 | Proper credentials are needed to `JOIN`, `CREATE` or `MULTICAST`. 53 | 54 | #### Disseminating messages 55 | 56 | Scribe creates a multicast tree to disseminate multicast messages, the tree is created using a scheme similar to reverse path forwarding. The tree is formed by joining the pastry routes from each group member to the rendezvous-point. Group joining is managed in a decentralized manner to support large and dynamic membership. Scribe nodes that are part of the tree are called **forwarders** and may or may not be part of the tree, each forwarder keeps a table containing each of its children in the tree. 57 | 58 | Nodes cache the rendezvous point when multicasting, they do so by routing the `MULTICAST` message and asking the rendezvous point to return its IP address. This can then be used for subsequent messages to avoid routing. If the rendezvous-point fails, pastry is used to find subsequent rendezvous points. Multicast messages are disseminated from the rendezvous point along the multicast tree. A single rendezvous point is used, allowing for it to perform access control. 59 | 60 | #### Repairing the multicast tree 61 | 62 | Parents send heartbeat messages to all their children periodically, if a child fails to receive such a message it can assume that the parent has failed. If this is the case, it simply routes a `JOIN` message again using pastry, which will route that message to a new parent, fixing the tree. 63 | 64 | In order to handle failure of rendezvous points, the state containing the group creator and an access control list is replicated across the `k` closest nodes to the root node (a typical value of K is 5). If the root fails, its chilren detect it and call `JOIN` through Pastry, this message is then routed to a new route. 65 | 66 | ## Related Work 67 | - [Overcast](https://www.researchgate.net/publication/220851762_Overcast_Reliable_Multicasting_with_an_Overlay_Network) 68 | - [Narada](https://courses.cs.washington.edu/courses/csep561/08au/papers/chu-jsac02.pdf) 69 | - [Bayeux](https://people.eecs.berkeley.edu/~adj/publications/paper-files/bayeux.pdf) 70 | -------------------------------------------------------------------------------- /notes/2020-05-05-byzantine-generals.md: -------------------------------------------------------------------------------- 1 | # Notes: The Byzantine Generals' Problem 2 | 3 | Distributed Systems need to be able to deal with failures. An often overlooked failure is when nodes send conflicting information to various parts of the system. 4 | 5 | Assume an army camped outside an enemy fortress, they must coordinate an attack. They can only communicate through messages and some of the generals may be trators trying to prevent the loyal generals from reaching agreement. 6 | 7 | We must have an algorithm where: 8 | 9 | 1. All loyal generals agree to the same plan. (Traitors may do anything they wish) 10 | 2. The trators cannot cause the loyal generals to adopt a bad plan. 11 | 12 | It is hard to say what a bad plan is, therefore we say that a bad plan is one which the majority clearly did not agree to. In a majority vote, the traitors can only affect the vote if the loyal generals were almost equally divided, meaning neither decision could be called bad. 13 | 14 | The first condition is satisfied by ensuring all generals use the same method for combining recieved messages. 15 | 16 | The easiest way to solve this would be for every general to send their opinion by messenger to each other general. This does not work because to satisfy condition A every general should obtain the same values. However, a traitor may send seperate values to generals. 17 | 18 | For Condition 1 to be satisfied, every loyal general must obtain the same information. We cannot use the value obtained directly from a general since a traitors general may send different values to different generals. A loyal general may use the value `i` for general `i` which is different from the value that general `i` actually sent. Additionally, if the a general is loyal, then the value they send must be used by every other loyal general. 19 | 20 | So Condition 1 can be rewritten as the following, any two loyal generals use the same value for a specific general, whether or not that specific general is loyal. 21 | 22 | We can restrict our consideration to the problem of how a single general sends his value to the others. 23 | 24 | Therefore we can simplify the problem to be the following: 25 | 26 | A general must send an order to his `n - 1` lieutenants such that: 27 | - **IC1**: All loyal lieutenants obey the same order 28 | - **IC2**: If the general is loyal, then every loyal lieutenant obeys the order. 29 | 30 | IC1 follows from IC2 if the commander is loyal, these conditions are interactive consistency conditions. 31 | 32 | The Byzantine Generals Problem may seem simple, however if the generals can only send oral messages then no solution will work unless more than two-thirds of the generals are loyal. Meaning that with 3 generals, no solution will work with a single traitor. An oral message is one whose contents are completely under the control of the sender, so a traitor can send any possible message. 33 | 34 | So let's show that for **oral messages** no solution can work for 3 generals with one traitor. We allow for the two decisions `attack` or `retreat`. 35 | 36 | Let's first consider this, our general is loyal and sent an `attack` order. But one lieutenant is a traitor and reports to the other that he received a `retreat` order. For **IC2** to be satisfied, we the loyal lieutenant should obey the `attack` order. 37 | 38 | Now let's consider a scenario where the general is a traitor and sends an `attack` order to one general and a `retreat` to another. In both scenarios, the one lieutenant does not know who the traitor is and hence to that lieutenant both scenarios are identical. 39 | 40 | If the traitor continues to lie, then the loyal lieutenant has no way to distinguish the two situations and must honour the `attack` order. So whenever a lieutenant receives an `attack` from the commander, he must obey it. However, we can say the same thing in the other direction, if a lieutenant receives a `retreat` order from the general he must obey it even if the other lieutenant states that he received an `attack` order from the general. 41 | 42 | Therefore in our second scenario, one lieutenant obeys the `attack` while the other obeys the `retreat` violating our condition IC1. This concludes that we have no solution that works without a single traitor. 43 | 44 | This shows us that no solution can work for `3m` generals with `m` traitors. 45 | 46 | It may seem like The Byzantine Generals problem stems from the requirement of reching exact agreement. However, reaching approximate agreement is just as hard. 47 | 48 | Let's assume that the generals need to agree on an approximate time of when to attack. To do this, we require the following conditions: 49 | - **IC1\`** - All loyal lieutenants attack within 10 minutes of one another. 50 | - **IC2\`** - If the commanding general is loyal, then every loyal liutenant attacks within 10 minutes of the time given by the commander. 51 | 52 | The time at which an order is received is irrelvant, only the attack time given matters. The orders are processed far enough in the past so agreement can be reached in time. 53 | 54 | This problem is unsolvable like the Byzantine Generals Problem, unless more than two-thirds of generals are loyal. We can prove this because if we could solve this problem for a solution of three generals that was able to cope with one traitor then we could construct a three geneeral solution to the Byzantine Generals' Problem that also worked in the presence of one traitor. 55 | 56 | The general orders an attack by sending an attack time of `1:00`, and orders a retrat by sending an attack time of `2:00`. Each lieutenant uses the following procedue to obtain the order. 57 | 58 | 1. Upon receiving a message a lieutenant does one of the following: 59 | 1. If the time is 1:10 or earlier then attack 60 | 2. If the time is 1:50 or later then retreat 61 | 3. Otherwise continue to step 2. 62 | 63 | 2. Ask the other lieutenant what decision he reached in step 1. 64 | 1. If the other liutenant reached a decision, make the same as they did. 65 | 2. Otherwise, retreat. 66 | 67 | If the commander is loyal, then a loyal lieutenant will obtain the correct order in step 1. So we know that IC2 is satisfied. If the commander is loyal, then IC1 follows from IC2, so we only need to prove that IC1 under the assumption that the commander is a traitor. There is only one traitor meaning we know that both lieutenants are loyal. **IC1\`** shows us that if one lieutenant decides to attack in step 1, then the other cannot decide to retreat in step 1, or one of them should defer decision until step 2. We can see that they both arrive at the same solution meaning that IC1 is satisfied. We have constructed a solution to the Byzantine Generals Problem that handles a single traitor, which is impossible. Hence, we cannot have a three-general algorithm that maintains IC1\` and IC2\` in the presence of a traitor. 68 | 69 | The above uses the simulation of `m` generals to try and construct an algorithm that works in the presence of one traitor which we know to be impossible. We can now use this solution to prove that no solution with fewer than `3m + 1` generals can cope with `m` traitors. 70 | 71 | ## A solution with Oral Messages 72 | 73 | It was proven that to deal with `m` traitors, `3m + 1` generals are needed. We now construct a solution for `3m + 1` generals. 74 | 75 | An oral message is embodied in the following assumptions: 76 | 1. Every message that is sent is delivered correctly. 77 | 2. The receiver of a message knows who sent it. 78 | 3. The absence of a message can be detected. 79 | 80 | The assumption 1 and 2 mean a traitor cannot interfere in 2 generals communication. 3 will stop a traitor from preventing a decision by not sending messages. 81 | 82 | The algorithm requires that generals can send messages directly to every other general. 83 | 84 | A traitorous commander may decide not to send any order. As the lieutenants must obey the same order, we need a default, in this case we use `retreat` as the default. 85 | 86 | 87 | Let's define our algorithm `OM(m)` for all nonnegative integers `m`, where a commander sends an `attack` or `retreat` order to `n - 1` lieutenants. 88 | 89 | Firstly, our algorithm assumes a function called `majority`, this function either returns the majority of the recieved orders or the default value, in our case `retreat`. 90 | 91 | The algorithmn only requires the `majority` property. 92 | 93 | For `OM(0)` it works as follows: 94 | 1. The commander sends his value to every lieutenant. 95 | 2. Each lieutenant uses the value he recieves from the commander or uses the value `retreat` if no value was received. 96 | 97 | Now let's look at it for `OM(M), m > 0`: 98 | 1. The commander sends his value to every lieutenant. 99 | 2. The lieutenant stores the received value, or `retreat` if not value was received. The lieutenant then acts as the general to send the value to other lieutenants, by executing the algorithm `OM(M - 1)`. 100 | 3. The lieutenant stores the value received from other lieutenants in step to or `retreat` if no values were received. The lieutenant then uses the `majority` of all these stored values. 101 | 102 | Let's consider the example with 4 generals, annd a single malicious lieutenant. The commander sends his value `v` to all the lieutenants, then one if the non traterous lieutenant sends his value to another, and the traitor sends some random value to that same lieutenant. The lieutenant has then received 3 values, he obtains the correct one by using the `majority` of values. Which would be the ones he received from both the non traitorous lieutenant and the commander. 103 | 104 | Next, let's see what happens if the the commander is malicious. The commander sends 3 arbitrary values to the lieutenants. Each lieutenant will receive all 3 of those values as they invoce the algorithm multiple times recursively and investigate the values received from the other lieutenants. This means that the majority of those 3 arbitrary values is used. 105 | 106 | ## A solution with signed messages 107 | 108 | The Byzantine Generals' Problem is hard to solve because our traitors can lie, however we can remove this ability. One way to do this is by ensuring that the generals send unforgeable signed messages. This means that we add the following assumptions to our previosuly made assumptions on what an oral message embodies: 109 | 1. A loyal general's signature cannot be forged and any alteration of the contents of his signed message can be detected. 110 | 2. Anyone can verify the authenticity of a general's signature. 111 | 112 | We make no assumptions on traitor general's signatures. This means that traitors could collude with eachother. With our signature introduced, our previous argument saying we need `3m + 1` generals to cope no longer holds. We can now construct an algorithm that copes with `m` traitors for any given number of generals. 113 | 114 | In our new algorithm the commander sends a signed message to the lieutenants. Each of which then adds their signature to that order and sends it ot the others, who in turn add their signatures and send it to the others, and so on. Generals can now know if another is a traitor by simply checking whether several signed messages were received originating from the same general. 115 | -------------------------------------------------------------------------------- /notes/2020-05-05-effective-manycast-messaging-for-kad.md: -------------------------------------------------------------------------------- 1 | # Notes: [Effective Manycast Messaging for Kademlia Network](https://cs.baylor.edu/~donahoo/papers/MCD15.pdf) 2 | 3 | Kademlia only allows contacting a single network peer. This paper descirbes an extension to Kademlia providing group communication. 4 | 5 | Every group has an identifer the same way everyone node has one. Nodes can contact a members of the group in three ways. 6 | - **Anycast** - a message delivered to a single member. 7 | - **Multicast** - delivers to all memebers. 8 | - **Manycast** - delivers to `N` group memebers where `N` is defined by the sender. 9 | 10 | > Interesting note: The benefit of Kademlia over Pastry is that Kademlia only uses one algorithm to find a specific node. Pastry instead uses one algorithm for half the distance and then switches to another. The first using the routing table, and the second using the leaf set. 11 | > 12 | > The nodes found in the first stage may be geographically farther away, so Kademlia may have faster delivery times. 13 | > 14 | > Kademlia however requires far more lookup messages. 15 | 16 | For this Kademlia Group extension a tree architecture is used, similar to scribe. All messages are sent to the root node. As Kademlia does not use forwarding, but instead lookups. When a node receives a lookup for a group that it is a part of, it does not continue with the routing algorithm but instead sends back a message saying that it is the recepient. 17 | 18 | When a peer wants to join a group, they send a lookup message to find the specific group. If a group memeber is found during the lookup, the join message can be sent that group memeber making them the join point. 19 | 20 | An anycast message is sent the same way as a join message. The first node to receive this message is the anycast receiver. 21 | 22 | A multicast message is received by all group memebers. The first phase is the same as sending an anycast message, then the message is distributed inside the group tree. Every group memeber forwards the message to its desecendants and parents ignoring the node from which the message came. This has a complexity of `O(log(n) + m)` where `n` is the number of peers in the network and `m` is the number of tree memebers. 23 | 24 | Sending a manycast message is divided into two phases, the first being the same as the anycast. The second phase finds the an amount of manycast receivers which is done using Kademlia lookups and the message is then sent to those. This has a complexity of `O(log(n) + m)` where `n` is the number of peers in the network and `m` is the desired number of receivers. 25 | 26 | The tree must be fault-tolerant, every memeber periodically sends a control message to its parent. If the parent cannot be reached the node finds another group member who is not its descendant. Upon receiving a join, a group member must verify the sender is not its ancestor. 27 | 28 | ## Test Results 29 | 30 | Scribe has better time results for fewer manycast receivers, and Kademlia for more many cast receivers. Kademlia has a lower network load, however requires more traffic. The difference in traffic is far higher than in delivery time. Kademlia is better for clients with bigger connection limits. 31 | -------------------------------------------------------------------------------- /notes/2020-05-12-pastry.md: -------------------------------------------------------------------------------- 1 | # Notes: [Pastry](http://rowstron.azurewebsites.net/PAST/pastry.pdf) 2 | 3 | **Quick Facts** 4 | - Like most DHTs pastry routes messages to the closest numerical node. 5 | - Pastry requires `O(log N)` lookups, where `N` is the amount of nodes in the network. 6 | - At each node in the network the application is notified and may perform various actions with the message. 7 | 8 | Nodes are given a `128` bit identifier. Node IDs are used to indicate a node's position in a circular node space from `0` to `(2^128) - 1`. 9 | 10 | A node routes a message to a peer who shares a prefix of `b` bits longer than the key shares with the present nodes id. If such a node is not found, the message is forwarded to a node that shares the same prefix as the present node but is numerically closer. 11 | 12 | 13 | ## Node state 14 | 15 | Each node maintains the following: 16 | 17 | - **Routing Table (`R`)** - The routing table is divided into `n` rows, an entry at row `n` shares the first same `n` digits of the present node's id. But its `n + 1` digit can be one of the `2^b - 1` values other than the `n + 1th` digit found in the present nodes id. 18 | - **Neighbourhood Set (`M`)** - Contains nodes that are closest according to the proximity metric, it is not usually used for routing. But is useful in maintaining locality properties. 19 | - **Leaf Set (`L`)** - Is the set of nodes with the numerically closest larger and smaller node ids from the present node. It is used in routing. 20 | 21 | The information stored for a specific node usually contains things such as the IP Address. 22 | 23 | ## Routing 24 | 25 | Whenever a message arrives at a node, the following procedure is executed: 26 | 1. If messages target falls between the range of the leaf set, the message is forwarded directly to the closest node in the leafset. 27 | 2. If not, the routing table is used and the message is forwarded to a node with a common prefix by at least one more digit. 28 | 3. If a node cannot be found at step 2, then the message is forwarded to a node with the same common prefix as the present node that is numerically closer to the message than the present node. 29 | 30 | **Notes** 31 | - If a message is forwarded using the routing table, then each step reduces the set of nodes whose ids have a longer prefix match with the key by a factor of `2^b`, meaning messages are delivered within `log2^b(N)` steps. 32 | - If the key is within the range of the leaf set, then the message is at most 1 step away. 33 | 34 | In the event of many simultanious node failures, the amount of routing steps required may be linear to `N`. Routing performance degrades gradually with the amount of node failures. 35 | 36 | ## Pastry API 37 | 38 | - **`route(msg, key)`** - routes message to the node with the numerically closest ID to key. 39 | - **`deliver(msg, key)`** - called by pastry when a node ID is numerically closest to the key. 40 | - **`forward(msg, key, nextId)`** - called before a message is forwarded it the node with the id `nextId`. The application may change the message contents or the nextID, changing it to `NULL` would stop the message from being forwarded. 41 | - **`newLeafSet(leafSet)`** - called by pastry whenever the leafset is updated. 42 | 43 | ## Self Organization 44 | 45 | Now, let's look at what happens when nodes leave or join the network. 46 | 47 | - **Arrival** - to join a network a node most initialize its node state. To join a network, a node sends a special `join` message to another node, with yet another node as the target. Pastry routes the message to a node numerically closest to the target id, all nodes that encounter the message, from the first node to the last send their state tables to the source. Which can request more information from a specific node if it feels the need to. The joining node then informs all required nodes of its arrival. 48 | 49 | @TODO DISCUSS FULL ARRIVAL ALGORITHM 50 | 51 | - **Departure** - a node may fail or depart without warning. A node is considered failed when its immediate neighbours in the neighbourhood set can no longer communicate with it. To replace a failed node in a leafset, a node contacts the node with the largest index on the leaf set side of the failed node and asks that node for its leaf table. The leaf set will partially overlap but also contain nodes not currently in the node's leafset, it will pick an appropriate node to add to its leafset verifying that the node is actually alive. To replace a failed node in the routing table, a node contacts another entry of the same row and asks for that nodes entry at the same index, if none of the entries are appropriate, the node contacts a node at the next row. Eventually it will find a node. The neighbourhood set is not used for routing but must be kept up-to-date, to do so a node periodically calls every neighbour to see if it is still alive, in case it is not. It asks all the other neighbour for their neighbours, checks the distance of those, and updates the set accordingly. 52 | 53 | @TODO LOCALITY & ARBITRARY FAILURES 54 | -------------------------------------------------------------------------------- /notes/2020-05-26-quic.md: -------------------------------------------------------------------------------- 1 | # Notes: QUIC 2 | 3 | QUIC (Quick UDP Internet Connections) is a transport protocol that has the goal of replacing TCP and TLS. 4 | 5 | QUIC was oroginally developed at Google and then later adopted by the IETF. The IETF version has long diverged from the original Google version and is considered a new protocol. 6 | 7 | **Notes** 8 | - QUIC connections fail to be negotatied around 8% of the time. 9 | 10 | ## Security 11 | 12 | One of the main goals of QUIC is to provide a secure-by-default transport protocol, through things like authentication and encryption that is handled by the transport protocol itself rather than at a higher level like TLS. 13 | 14 | QUIC keeps the TLS handshake messages while replacing the record layer with its own framing format. The same three-way handshake you get with TCP is done on QUIC. 15 | 16 | QUIC connections are automaticaly authenticated and encrypted and only require a single round trip. 17 | 18 | Additionally QUIC also encrypts connection metadata meaning middle-boxes cannot interfere with connections. Encrypting this metadata means no middlebox can correlate activity other than the endpoints in the connection. 19 | 20 | ## Multiplexing 21 | 22 | QUIC provides support for multiplexing such that different HTTP streams can in turn be mapped to different QUICK transport streams. 23 | 24 | It is structured that the same QUIC connection is shared, but streams are independent meaning packet loss only affects one stream. 25 | 26 | ## NAT 27 | 28 | Typical NATs can precicely manage the lifetime of NAT bindings. With QUIC this is not yet possible, this is because NAT routers currently do not understand QUIC, so they will fallback on default UDP flows. These often use arbitrary timesouts which could affect long-running connections. 29 | 30 | ## Connection Migration 31 | 32 | QUIC has a feature called connection migration, which allows QUIC endpoints to migrate connections to different IP addresses at will. This means a mobile client can migrate a connection from celular data to Wifi when a network becomes available. 33 | 34 | ## QPACK 35 | 36 | QPACK allows Header compression, this cannot work the same as HTTP/2s HPACK due to the way QUIC streams work. An additionaly stream is created to send QPACK table updates and another to acknowledge these updates. 37 | 38 | ## Reflection 39 | 40 | UDP-based protocols are susceptible to [reflection attacks](https://blog.cloudflare.com/reflections-on-reflections/). To mitigate this, QUIC provides source-address verification, this means a server can be sure a client is not spoofing IP addresses. The downside is that this mitigation increases the initial handshake from a single round-trip to tow. 41 | 42 | ## Criticism 43 | 44 | Juho Snellman said that the encrpyion of layer 4 headers provides no meaningful privacy or security benefits. Instead the goal is to break middleboxes. They do this because they believe TCP cannot evolve due to these middle boxes. TCP protocol extensions have to jump through a lot of hurdles to reduce the damage that can be caused by middleboxes. 45 | 46 | Middleboxes serve a criticle role, crippling them isn't a great idea. But even if this were not the case, readable headers enable troubleshooting. 47 | 48 | ## References 49 | - https://blog.cloudflare.com/the-road-to-quic/ 50 | - https://www.snellman.net/blog/archive/2016-12-01-quic-tou/ 51 | -------------------------------------------------------------------------------- /notes/2020-05-26-reliable-systems-series-model-based-testing.md: -------------------------------------------------------------------------------- 1 | # Notes: [Reliable Systems Series: Model-Based Testing](https://medium.com/@tylerneely/reliable-systems-series-model-based-property-testing-e89a433b360) 2 | 3 | **Models** - In this context not anything formal, just simple code that behaves in a way that can track our expectations. We run automated tests against the model and an implementation, if they diverge, we found a bug. 4 | 5 | [QuickCheck](https://hackage.haskell.org/package/QuickCheck), is a property testing library. Property testing checks if a given property holds for a set of inputes. 6 | 7 | This post is meant to summarize how to apply some of the techniques found in these papers: 8 | - [Experiences with QuickCheck](https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quviq-testing.pdf) 9 | - [Testing Monadic Code with QuickCheck](http://www.cse.chalmers.se/~rjmh/Papers/QuickCheckST.ps) 10 | 11 | The goals are to create a toy model of how something should work, and then test identical behaviour between the actual implementation and the toy model. If they diverge, the model or the implementation or both could be wrong. 12 | 13 | Models can be a cognitive aid for reasoning about what your system actually does. It can help new developers, they can look at a simplified model to grasp what the system does overall. 14 | 15 | Models are good to write before the actual implementation, its essentially TDD. By letting the model guide the implementation there is a single method to test your assumptions. 16 | 17 | Sometimes it is hard to build a model of everything, this may just end up looking like a second implementation so it may be easier to track specific system behaviours you want to ensure. 18 | 19 | Additionally, you may only need to check few invariants at times. 20 | 21 | > If it’s hard to model or check specific invariants of a system, it could be a symptom that the architecture is unsound, leading to unreliable implementations and high human costs for debugging over time. If you can’t describe it in a model or set of invariants, good luck getting humans to understand what should be happening. 22 | 23 | Build systems in a way that they can be single-stepped and then replayed, this restricts nondeterminism and increases the observability of causal relationships. 24 | 25 | Model based tests can eliminate the need for hand-written unit tests with far less code and far more coverage. 26 | 27 | Model based testing tests far more possibilities than normal tests and avoids the biases of the creators. Generating of regresssion tests saves time trying to reproduce subtle issues. 28 | 29 | -------------------------------------------------------------------------------- /notes/2020-06-16-lsmt.md: -------------------------------------------------------------------------------- 1 | # Notes: Log Structured Merge Trees 2 | 3 | Files are stored onto disk, in-memory you store the set of changes to a file. Once the memory is full, these changes are moved to the filesystem. Files are not updated, but instead new files are created. 4 | 5 | Writes are always fast because they are always in-memory, we keep indexes of the data in memory too so reading becomes easy. 6 | 7 | ## References 8 | 9 | - http://www.benstopford.com/2015/02/14/log-structured-merge-trees/ 10 | - https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/ 11 | - http://paperhub.s3.amazonaws.com/18e91eb4db2114a06ea614f0384f2784.pdf 12 | --------------------------------------------------------------------------------