├── docs ├── 05_yottafs │ ├── sata.md │ ├── warp.md │ ├── 21_yottafs.md │ ├── blockFormat.md │ ├── trim.md │ ├── hashTable.md │ ├── yfs.md │ └── 11_introduction.md ├── 10_benchmarks │ └── benchmarks.md ├── 01_architecture │ ├── 21_principles.md │ ├── 41_consistency.md │ ├── 32_instructions.md │ ├── 11_problem.md │ └── 31_machine.md ├── 06_yottadb │ └── 11_introduction.md ├── 00_Introduction │ ├── 11_introduction.md │ ├── 31_related_work.md │ └── 21_business_case.md ├── 07_infrastructure │ ├── 13_sizing.md │ ├── 12_hybrid.md │ ├── 11_serverless.md │ └── 21_node_types.md ├── package.json ├── XX_pending │ ├── yottafs.md │ ├── README.md │ ├── fs.md │ ├── yottadb.md │ ├── rendezvous.md │ └── yottapack.md ├── README.md └── 03_partitioning │ ├── 11_introduction.md │ ├── 21_rebar.md │ ├── 31_gap.md │ └── 41_examples.md ├── infra ├── README.md └── package.json ├── pkgs ├── hash │ ├── README.md │ └── package.json └── gossip │ ├── README.md │ └── package.json ├── .yarnrc.yml ├── package.json ├── .editorconfig ├── yarn.lock ├── .gitignore ├── LICENSE └── README.md /docs/05_yottafs/sata.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/05_yottafs/warp.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /infra/README.md: -------------------------------------------------------------------------------- 1 | # infra 2 | -------------------------------------------------------------------------------- /docs/05_yottafs/21_yottafs.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/10_benchmarks/benchmarks.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pkgs/hash/README.md: -------------------------------------------------------------------------------- 1 | # hash 2 | -------------------------------------------------------------------------------- /docs/01_architecture/21_principles.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/01_architecture/41_consistency.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/06_yottadb/11_introduction.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pkgs/gossip/README.md: -------------------------------------------------------------------------------- 1 | # gossip 2 | -------------------------------------------------------------------------------- /docs/00_Introduction/11_introduction.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/01_architecture/32_instructions.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/07_infrastructure/13_sizing.md: -------------------------------------------------------------------------------- 1 | queue theory 2 | -------------------------------------------------------------------------------- /.yarnrc.yml: -------------------------------------------------------------------------------- 1 | yarnPath: .yarn/releases/yarn-3.2.1.cjs 2 | -------------------------------------------------------------------------------- /docs/07_infrastructure/12_hybrid.md: -------------------------------------------------------------------------------- 1 | standby replicas 2 | -------------------------------------------------------------------------------- /docs/00_Introduction/31_related_work.md: -------------------------------------------------------------------------------- 1 | // TODO: put related work 2 | -------------------------------------------------------------------------------- /docs/07_infrastructure/11_serverless.md: -------------------------------------------------------------------------------- 1 | compute decoupled from storage 2 | -------------------------------------------------------------------------------- /docs/07_infrastructure/21_node_types.md: -------------------------------------------------------------------------------- 1 | random (reads) or linear (writes) 2 | -------------------------------------------------------------------------------- /docs/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "docs", 3 | "packageManager": "yarn@3.2.1" 4 | } 5 | -------------------------------------------------------------------------------- /infra/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "infra", 3 | "packageManager": "yarn@3.2.1" 4 | } 5 | -------------------------------------------------------------------------------- /pkgs/hash/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "hash", 3 | "packageManager": "yarn@3.2.1" 4 | } 5 | -------------------------------------------------------------------------------- /pkgs/gossip/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "gossip", 3 | "packageManager": "yarn@3.2.1" 4 | } 5 | -------------------------------------------------------------------------------- /docs/00_Introduction/21_business_case.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | - BigQuery + ML costed 2000 euros 4 | - DynamoDB, but with each column indexed 5 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "yottaStore", 3 | "workspaces": [ 4 | "docs", 5 | "infra", 6 | "pkgs/*" 7 | ], 8 | "packageManager": "yarn@3.2.1" 9 | } 10 | -------------------------------------------------------------------------------- /.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | end_of_line = lf 5 | insert_final_newline = true 6 | 7 | [*.{js,json,yml}] 8 | charset = utf-8 9 | indent_style = space 10 | indent_size = 2 11 | -------------------------------------------------------------------------------- /docs/05_yottafs/blockFormat.md: -------------------------------------------------------------------------------- 1 | # Block Format 2 | 3 | Serialization: 4 | 5 | - [Flatbuffers](https://en.wikipedia.org/wiki/FlatBuffers) 6 | - [Avro](https://en.wikipedia.org/wiki/Apache_Avro) 7 | 8 | 9 | Error Correction: 10 | 11 | CRC32-C 12 | 13 | -------------------------------------------------------------------------------- /docs/XX_pending/yottafs.md: -------------------------------------------------------------------------------- 1 | # Drivers 2 | 3 | - classic: buffered io, xfs 4 | - direct: direct io, xfs 5 | - uring-classic: classic with io_uring 6 | - uring-direct: direct with io_uring 7 | - uring-nvme: NMVe commands 8 | - uring-zns: Zoned namespace 9 | 10 | 11 | # Namespace type 12 | 13 | - Random 14 | - Linear -------------------------------------------------------------------------------- /yarn.lock: -------------------------------------------------------------------------------- 1 | # This file is generated by running "yarn install" inside your project. 2 | # Manual changes might be lost - proceed with caution! 3 | 4 | __metadata: 5 | version: 6 6 | 7 | "yottaStore@workspace:.": 8 | version: 0.0.0-use.local 9 | resolution: "yottaStore@workspace:." 10 | languageName: unknown 11 | linkType: soft 12 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .yarn/* 2 | !.yarn/patches 3 | !.yarn/plugins 4 | !.yarn/releases 5 | !.yarn/sdks 6 | !.yarn/versions 7 | 8 | # Swap the comments on the following lines if you don't wish to use zero-installs 9 | # Documentation here: https://yarnpkg.com/features/zero-installs 10 | !.yarn/cache 11 | #.pnp.* 12 | 13 | 14 | # IDE 15 | .idea 16 | .vscode 17 | -------------------------------------------------------------------------------- /docs/01_architecture/11_problem.md: -------------------------------------------------------------------------------- 1 | # Problem 2 | 3 | Here's the problem we're facing: 4 | 5 | Easy: 6 | - As an engineer, I don't want to care about scale 7 | - As an engineer, I want easy migrations 8 | - As an engineer, I want indexes 9 | 10 | 11 | Powerful: 12 | - As a customer, I want availability 13 | - As a customer, I want low latency 14 | - As a business leader, I want reliability 15 | - As a business leader, I want transactions 16 | 17 | Efficient: 18 | -------------------------------------------------------------------------------- /docs/XX_pending/README.md: -------------------------------------------------------------------------------- 1 | # Readme 2 | 3 | # Architecture 4 | 5 | - Yottastore: compute layer 6 | - Yottafs: storage layer 7 | - Client 8 | 9 | 10 | # Modules 11 | 12 | - Yottadb: module for yottastore 13 | - Self: awareness and optimizations, ml based 14 | - Rendezvous: module for yottastore 15 | - Gossip: module for yottastore and yottadb 16 | - Yottapack: optimal serialization format 17 | 18 | 19 | ## Yottadb 20 | 21 | Adds advanced features to yottastore. 22 | Possible drivers are: 23 | 24 | - KeyValue: simplest one, default 25 | - Columnar: Column values 26 | - Document: For collections 27 | - PubSub: optimized for message queues 28 | - Indexed: support queries, based on btrees 29 | - Graph: optimized for trees and graphs 30 | -------------------------------------------------------------------------------- /docs/05_yottafs/trim.md: -------------------------------------------------------------------------------- 1 | # Efficient trim algorithm 2 | 3 | Concurrent, parallel, state machine. 4 | 5 | Each erase zone has an array of 1 bit per per sector/word (state). 6 | (bloof filter with perfect hash) 7 | 8 | 512 mb every 16 TB of SSD. 9 | 10 | ## Delete ops 11 | 12 | YFS marks the bit for each sector involved as 1. 13 | 14 | If sum of 1s greater than threeshold, invoke compact ops. 15 | 16 | ## Compact ops 17 | 18 | - New, empty zone is selected and frozen. 19 | - YFS freeze the state, and send it to compact process. 20 | - Any further delete is stored in a temp queue 21 | - Compact process read state, and issue batch copy command for each valid sector to yfs 22 | - When done, YFS issue erase zone command 23 | - YFS apply pending deletes ops 24 | 25 | 26 | # Advantages 27 | 28 | - Very fast 29 | - Never blocking 30 | - Ultra low latencies 31 | -------------------------------------------------------------------------------- /docs/XX_pending/fs.md: -------------------------------------------------------------------------------- 1 | # A pragmatic implementation of Wait Free Persistent Storage 2 | 3 | ## Abstract 4 | 5 | We present a pragmatic implementation of a wait free persistent storage, designed 6 | around NVMe characteristics and asynchronous IO with `io_uring`. 7 | 8 | 9 | 10 | ## Introduction 11 | 12 | In the context of distributed datastores latency and predictability are key metrics, 13 | with implications on the transactional throughput and the cost of the system. 14 | 15 | 16 | ## Constraints 17 | 18 | ## Related work 19 | 20 | ## The algorithm 21 | 22 | The fundamental data structure of the algorithm is the Record: 23 | 24 | ``` 25 | 26 | const ( 27 | Size = 4096 // Underlying block size 28 | ) 29 | type Block [Size]byte 30 | 31 | type Record struct { 32 | Body []Block 33 | Tails [][]Block 34 | Appends [] 35 | } 36 | 37 | ``` 38 | 39 | ## XFS implementation 40 | 41 | The fundamental structure is a 42 | 43 | 44 | ## NVMe implementation 45 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Alberto 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /docs/XX_pending/yottadb.md: -------------------------------------------------------------------------------- 1 | # Yottadb 2 | 3 | 4 | ## Drivers 5 | 6 | ### Key-value 7 | 8 | Basic driver, given a record in the form: 9 | - `account@[driver:]tableName/recordName[/recordRow]` 10 | 11 | it allows to find which shard contains the binary data associated 12 | with the record. 13 | 14 | Possible operations are: 15 | 16 | - Read 17 | - Write 18 | - Append 19 | - Delete 20 | 21 | ### Columnar 22 | 23 | Like key-value, but optimized for storing very long columns. 24 | 25 | Possible operations are: 26 | 27 | - Read 28 | - Write 29 | - Append 30 | - Delete 31 | 32 | ### Document 33 | 34 | Improvement over the key-value driver, allow to deal with 35 | documents instead of plain binary data. Possible operations are: 36 | 37 | - Read document 38 | - Write document 39 | - Update document 40 | - Delete document 41 | - Get collection 42 | - Create collection 43 | - Update collection 44 | - Delete collection 45 | 46 | ### PubSub 47 | 48 | Improvement over the columnar store, to handle queues. Possible 49 | operations are: 50 | 51 | - Crud Topic 52 | - Publish 53 | - Crud Subscription 54 | - Consume 55 | 56 | ### Indexed 57 | 58 | An improvement over the document driver, it uses the 59 | columnar driver to create indexes. 60 | 61 | ### Graph 62 | 63 | Uses key-value store to represent a graph 64 | -------------------------------------------------------------------------------- /docs/XX_pending/rendezvous.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | ## Tree Layers 4 | 5 | - Region: Picked ad config time 6 | - Zone: DC, or group of closely related DCs 7 | - L1,...Ln: Skeleton construction 8 | - Node: Node, or group of closely related nodes 9 | - Shard: Namespace, or group of closely related namespaces 10 | - Blocks: List of blocks containing data 11 | 12 | 13 | ## Walk the tree 14 | 15 | Perform rendezvous at each layer of the tree, inspired by the 16 | skeleton construction 17 | 18 | ## Collection step 19 | 20 | For each collection and node type (random, linear, compute): 21 | 22 | - `R`: Replication factor 23 | - `MAX_LOAD`: Maximum number of shards from the same 24 | collection in a node, power of 2 25 | - `S = N * MAX_LOAD`: Shard factor, greater or equal than `MAX_LOAD` 26 | - `N`: Number of nodes, power of 2 27 | 28 | TODO: handle dynamic S per collection 29 | 30 | - Pick regions at config time for each collection 31 | - Pick `R` walks starting at the root of the tree 32 | - At the next level of the tree, pick `N` walks 33 | until you reach a node 34 | - (Optional): Add `log(N)` walks, to pick failover nodes 35 | - For each node pick `MAX_LOAD` shards 36 | 37 | TODO: handle weights and dynamic MAX_LOAD per node 38 | 39 | We end up with a pool of `R * S` shards, distributed among `N` nodes 40 | in `R` regions/zones. We also have a pool of `R * log(N)` nodes as 41 | backup. 42 | 43 | ## Record step 44 | 45 | For each record in a collection, given a pool of `R*S` shards 46 | 47 | - Start `R` walks 48 | - For each walk, pick one node 49 | - For each node, pick one shard 50 | 51 | You end up with `R` shards, which should all contain 52 | replicas of your record 53 | 54 | ## 55 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | # Documentation 2 | 3 | I hope that the reader that will bear with reading this, will then be able to fully understand the system, 4 | and implement one himself if he decides so. 5 | 6 | ### Chapter 1: Architecture 7 | 8 | - [Chapter 1.1](01_architecture/11_problem.md): The problem 9 | - [Chapter 1.2](01_architecture/21_principles.md): Design principles 10 | - [Chapter 1.3](01_architecture/31_machine.md): The distributed machine 11 | - [Chapter 1.3.1](01_architecture/32_instructions.md): The instruction set 12 | - [Chapter 1.4](01_architecture/41_consistency.md): Consistency Model 13 | 14 | 15 | ### Chapter 2: Leaderless partitioning and replication 16 | 17 | - [Chapter 2.1](03_partitioning/11_introduction.md): Introduction 18 | - [Chapter 2.2](03_partitioning/21_rebar.md): Rendezvous based routing 19 | - [Chapter 2.3](03_partitioning/31_gap.md): Gossip agreement protocol 20 | - [Chapter 2.4] Examples 21 | - [Chapter 2.4.1](03_partitioning/41_examples.md): Partitions and replicas 22 | - [Chapter 2.4.2](03_partitioning/41_examples.md): Key routing 23 | - [Chapter 2.4.3](03_partitioning/41_examples.md): Gossip rounds 24 | 25 | 26 | ### Chapter 3: YottaFS 27 | 28 | - [Chapter 3.1](05_yottafs/nvme.md): The best filesystem 29 | - [Chapter 3.2](05_yottafs/warp.md): Transaction and indexes 30 | 31 | ### Chapter 4: Infrastructure 32 | 33 | - [Chapter 4.1](07_infrastructure/11_serverless.md): Decoupling storage and compute 34 | - [Chapter 4.2](07_infrastructure/12_hybrid.md): Hybrid cloud and standby replicas 35 | - [Chapter 4.3](07_infrastructure/13_sizing.md): The best node size (queue theory) 36 | 37 | ### Chapter 5: Benchmarking 38 | 39 | - [Chapter 5.1](10_benchmarks/benchmarks.md): Performance and costs estimates 40 | 41 | -------------------------------------------------------------------------------- /docs/XX_pending/yottapack.md: -------------------------------------------------------------------------------- 1 | # Yottapack 2 | 3 | Yottapack is a bandwidth and cpu efficient 4 | 5 | - Size efficient 6 | - Efficient streaming 7 | - Query fields 8 | - Schema or schemaless 9 | - Row or columnar fields 10 | - Skip list for fast access 11 | - Extensible 12 | 13 | 14 | # Format 15 | 16 | ## Header 17 | - Intro byte 18 | - 1+ size byte (needed for streaming) 19 | - (optional): Heads encoding 20 | 21 | ### Fat heads encoding 22 | 23 | Not needed if schema or streaming. Encode strings 24 | representing the name of fields, (and nested fields?) 25 | 26 | - 1 type byte (is it needed?) 27 | - 1+ byte size 28 | - N+ chars for value 29 | 30 | ### Light heads encoding 31 | 32 | If with schema 33 | 34 | - 1+ byte size for each record 35 | - 0 for some, if order is guaranteed 36 | 37 | ## Data 38 | 39 | - 1 type byte (optional) 40 | - 1+ size byte 41 | - N+ value byte 42 | 43 | # Special bytes 44 | 45 | ## Intro byte 46 | 47 | 1 byte for versioning and flags 48 | 49 | ## Forbidden sequences 50 | 51 | - 0x00, 0x00, ETX 52 | - 0x00, ETX, EOT 53 | 54 | 55 | ## Type byte 56 | 57 | - 1 bit isArray 58 | - 1 bit isMap 59 | - 1 bit reserved 60 | - 5 bit type (32 types) 61 | 62 | ### List of types 63 | 64 | - Char 65 | - Binary 66 | - Octal 67 | - Hex 68 | - Bool 69 | - Nil 70 | - Integers ( 2 * 4 ) 71 | - BigInt 72 | - Floats (2) 73 | - BigDecimal 74 | - Time (2) 75 | - Geolocation 76 | - Dynamic 77 | - Lookup 78 | - Padding 79 | - Reserved (8) 80 | - Extended (2 bytes) 81 | 82 | Total: 32 83 | 84 | ## Length encoding 85 | 86 | 1+ byte of u8. Like utf8 87 | 88 | 1st byte signal if you should read also next byte. 89 | 90 | - 0 -> 128: 1 byte 91 | - 128+1 -> 2^14-1: 2 bytes (first byte is 1, 8th byte is 0) 92 | - 2^14+1 -> 2^21-1: 3 bytes (first byte is 1, 8th byte is 1, 93 | 16th byte is 1) 94 | - and so on.... 95 | 96 | ## Separators 97 | 98 | EOT or ETX used as separators. Escaped in strings 99 | -------------------------------------------------------------------------------- /docs/03_partitioning/11_introduction.md: -------------------------------------------------------------------------------- 1 | # The problem 2 | 3 | To achieve our size and throughput goals we need to 4 | find a partitioning scheme with the following properties: 5 | 6 | - Deterministic: all the observers should agree to the same 7 | partitioning with minimal information exchange 8 | - [Consistent](https://en.wikipedia.org/wiki/Consistent_hashing): 9 | removal or addition of nodes should cause minimal reshuffling 10 | of keys 11 | - Hierarchical: the scheme is sensitive to locality 12 | 13 | To solve this we use [rendezvous hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing), 14 | which allows us to choose `k` nodes in a consistent way, with 15 | minimal information sharing across observers. 16 | 17 | This alone is not enough to solve all our problems: a 18 | naif implementation would require that all nodes to know about each 19 | other status, which is too expensive for large clusters. 20 | We need to introduce a concept of **hierarchicy tree**, 21 | which is a locality sensitive way to partition the nodes. 22 | 23 | Moreover, we observe that while the addition of new nodes can 24 | be planned and executed in a optimal way, the failure of 25 | a node can happen at any time, and thus should be the case for 26 | which we optimize for. 27 | 28 | # Hierarchicy tree 29 | 30 | The [skeleton based variant](https://en.wikipedia.org/wiki/Rendezvous_hashing#Skeleton-based_variant) 31 | of the rendezvous hashing algorithm gives us the idea of building 32 | a tree to speed up operations. 33 | But before doing that, let's analyze the failure modes of the cluster: 34 | 35 | - Under the assumption that our code is correct we can assume 36 | that shards will not fail. In practice we can recover even from that. 37 | - The smallest mode of which we have no control is the disk failure. 38 | - Then we have node failure, where all the disks attached to a node become unavailable. 39 | - Then we have several kind of group failures: rack failure, room failure, 40 | datacenter failure, and so on. 41 | 42 | We expect the disk failure to be the most common mode, and the one 43 | for which we will optimize, and then the other modes in a decreasing 44 | order of probability. This allows us to define a natural tree-like 45 | hierarchy based on failure modes, for which the levels are: 46 | 47 | - Regions 48 | - Zones 49 | - Racks 50 | - Nodes 51 | - Disks 52 | - Shards 53 | 54 | 55 | -------------------------------------------------------------------------------- /docs/03_partitioning/21_rebar.md: -------------------------------------------------------------------------------- 1 | # Rendezvous based routing alogrithm 2 | 3 | The Rendezvous Based Routing (ReBaR) algorithm 4 | is used to navigate a hierarchy tree, to find 5 | the shard responsible for a record. It's called 6 | like that because it uses a rendezvous hashing 7 | approach to agree on a set of choices with 8 | minimal information exchange. 9 | 10 | Given a record in the form: 11 | 12 | `account@driver:collection/record/subrecord` 13 | 14 | And a hierarchy tree, how do we find the 15 | shards responsible? 16 | 17 | Before answering that, we need the user to tell us: 18 | - The region(s) involved 19 | - The type of replication (single region or multi region) 20 | - The sharding factor `s` 21 | - The replication factor `r` 22 | 23 | 24 | In total, we will need to find `s*r` shards, spread across 25 | one or multiple regions according to the replication type. 26 | To do that we split the algorithm in 3 steps: 27 | - Parsing the record 28 | - Finding the shard pool responsible for the collection 29 | - Finding the shards responsible for the record from the pool 30 | 31 | 32 | ## Parsing the record 33 | 34 | Given a record string, it is parsed in a struct like: 35 | 36 | ```go 37 | package main 38 | 39 | type Record struct { 40 | Account string 41 | Driver string 42 | Collection string 43 | Record string 44 | PoolPtr string 45 | ShardPtr string 46 | } 47 | ``` 48 | 49 | // TODO: put link to driver types 50 | The driver [type]() 51 | will determine which types of nodes are involved 52 | in the collection (`linear` or `random` nodes). 53 | 54 | The `PoolPtr` is in the form `account@driver:collection` and 55 | is the same for all the records in a collection. 56 | 57 | The `ShardPtr` is in the form `account@driver:collection/record` and 58 | is the same for all the subrecords of a root record. 59 | 60 | ## Hashing step 61 | 62 | Before describing the next parts, let's define the hashing step, 63 | which is a slight variation of the classic rendezvous hashing. 64 | Given a 65 | 66 | ## Finding the pool 67 | 68 | Each collection is assigned to a pool of shards, determined by 69 | the `PoolPtr` of the record. This has the advantage that is 70 | possible to enumerate all the shards of a collection, making queries easier. 71 | 72 | We start by picking a region 73 | 74 | ## Finding the shard 75 | 76 | Once the pool is found, we need to find the shards responsible for the record 77 | -------------------------------------------------------------------------------- /docs/03_partitioning/31_gap.md: -------------------------------------------------------------------------------- 1 | # Gossip agreement protocol 2 | 3 | The GAP is the protocol tasked with building the hierarchy tree. 4 | 5 | 6 | # Optimizations 7 | 8 | The traditional rendezvous gossip protocol relies on picking a random node 9 | for each round of gossip. This is means that unless for lucky choices 10 | we need more than `o(log(n))` steps to converge. Please see our 11 | simulations. 12 | 13 | Here we propose a couple of simple improvements which we call rendezvous 14 | gossip, with the following improvements: 15 | 16 | - Guaranteed `o(log(n))` convergence 17 | - Fast failure detection 18 | - Very large cluster size (10^9 nodes) 19 | 20 | Implementation [here](). 21 | 22 | ## Small cluster 23 | 24 | // TODO: put in rebar 25 | 26 | Let's say we have a cluster of 8 nodes and the fanout is 3. 27 | The cluster is already at equilibrium, so each node knows about all the other nodes. 28 | 29 | For the next round, each node perform a round of rendezvous hashing to determine 30 | which node to send the next message to. The hashed string is: 31 | 32 | `nodeName,iteration,peer1,peer2, ... peer8` 33 | 34 | Which is built using the lexicographic order of the node hostnames. The 512 bit 35 | hash is split in 8 chunks of 64 bits each, and the next 3 nodes are selected. 36 | 37 | The message is sent over the UDP protocol, and contains a hash of current 38 | known list of healthy nodes: 39 | 40 | `peer1,peer2, ... peer8` 41 | 42 | Each peer can then immediately verify if all the senders agree on the list 43 | of peers. If not then a synchronization over TCP is performed. 44 | 45 | This approach guarantees that in `o(log(n))` steps not only the network 46 | will converge, but also that each node will also receive at least 47 | one confirmation, for extremely fast failure detection. 48 | 49 | ## Large clusters 50 | 51 | Up to 1024 nodes the above strategy is fine, but failures becomes 52 | increasingly frequent and more expensive as they will 53 | require a lot of synchronizations. 54 | 55 | To scale up we can start using hierarchies and merkle trees: 56 | 57 | - Nodes are organized in groups 58 | - Groups are organized in Zones 59 | - Groups can have a size from 25 nodes to 250 nodes 60 | - Each node knows about the local nodes, and the hash of other groups. This control the size of the message. 61 | - Only few nodes communicate across groups 62 | 63 | Let's say we have a group of 25 nodes, and a fanout of 6. We know that there are 4 groups in 2 zones. 64 | For a node in group 1, our string thus becomes: 65 | 66 | `peer1,peer2, ...,peer25,group2` 67 | 68 | When any node select group2 as a candidate, it will send the message one node in group2. 69 | The message is generated using a merkle tree. 70 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /docs/01_architecture/31_machine.md: -------------------------------------------------------------------------------- 1 | # The distributed machine 2 | 3 | Our approach to solve the task is to build a distributed machine, 4 | with a 512 bits memory pointer and a large word size. 5 | 6 | To build that we need: 7 | 8 | - Memory pointers 9 | - Memory to store data 10 | - A virtual machine with an instruction set 11 | 12 | 13 | ## The memory pointer 14 | 15 | To access the memory we will use a [hash table](../filesystem/hashTable.md): given the key, it's possible to calculate 16 | the hash and thus find the position of the value in the memory. 17 | A key is of the form: 18 | 19 | `account/resource/resourceType/primaryKey/sortKey` 20 | 21 | - `account` to allow multi-tenancy on the same machine. 22 | - `resource` can be either the table name or the queue name 23 | - `resourceType` can be either `rnd` for random access keys, backed by NVMe, 24 | or `lnr` for queues, backed by HDDs. 25 | - `primaryKey` represents the key to which a value is associated. 26 | - `sortKey` is optional and is not hashed, like in dynamoDB. 27 | Typically used for versioning, it permits to store different values of 28 | the same primary key together. 29 | 30 | The hash used is `SHA-512`, hence the 512 bit size of the pointer. To determine which 31 | node is responsible for the key, we use [rendezvous hashing](../partitioning/rendezvous.md). 32 | 33 | ## The memory 34 | 35 | To achieve our size goals, we will treat NVMe SSDs as our main memory. Although not as fast as RAM, 36 | NVMe provides excellent throughput and reliability. Having a non-volatile random memory means that is 37 | much easier to recover from failures, while still providing exceptional performance characteristics. 38 | 39 | More in [SSD vs RAM performace]() 40 | 41 | ## Word size 42 | 43 | The large word size is due to the nature of the backing memory. For random 44 | access is the SSD sector size which is 4 KiB, even if a value is smaller than 45 | that the controller will still use a 4 KiB sector, so it's better to optimize for that. 46 | 47 | In the case of linear memory, which is typically an 48 | [SMR HDDs](https://en.wikipedia.org/wiki/Shingled_magnetic_recording) 49 | the word size is much bigger, up to 32 MiB, to optimize for the 50 | high linear performance characteristics of hard disks. Queues 51 | can be temporary stored in random memory before being aggregated 52 | on disk, like the metadata needed to read the queue. More [here]() 53 | 54 | 55 | ## The virtual machine 56 | 57 | For simplicity, we can take the WebAssembly virtual machine instruction set as reference, 58 | with a few important additional instructions: 59 | 60 | - LOAD $POINTER: Load the memory stored at POINTER 61 | - COMPARE $POINTER $VALUE: 62 | - WRITE $POINTER $VALUE 63 | - COMPARE&SWAP $POINTER $VALUE1 $VALUE2: Atomic compare and swap in the global memory 64 | 65 | These instructions are used to build lock free concurrent data structures, the key for the 66 | high availability of this system. For a more detailed discussion read [here]() 67 | 68 | -------------------------------------------------------------------------------- /docs/05_yottafs/hashTable.md: -------------------------------------------------------------------------------- 1 | # Hash Table 2 | 3 | Here we describe a hash table optimized for NVMe storage. For 4 | simplicity, we assume a namespace of 1 TB, and a block size of 4 KB, 5 | which means 2^28 blocks. 6 | 7 | If we were to implement a naif hash table, we would face the following 8 | challenges: 9 | 10 | - Poorly utilized space 11 | - Needs wear leveling 12 | - Values bigger than a block 13 | - Needs to handle collisions 14 | 15 | ## Pointers array 16 | 17 | When we hash the key, we obtain a value which represents the index of an array. 18 | The array at the index will contain either the value of the key or a 19 | pointer to the value: 20 | 21 | - 512 bits to reference the key hash 22 | - 32 bits to reference the first block containing the value 23 | - 16 bits to represent how many blocks are used to store the value 24 | 25 | Even if we assume a total size of 1024 bits, this means a 4 KiB block 26 | can contain 32 pointers. Let's say we decide to store 8 pointers per block. 27 | 28 | We allocate a large numbers of blocks to store the buckets of the hash table, let's say 29 | 64 gb or 2^24 blocks. This means we can address a total of 2^27 keys, or 512 GiB at 30 | 4 KiB per key. The remaining space on the device is used to store the values, 31 | which can be of arbitrary size. 32 | 33 | To further improve wear leveling, we decide to aggregate 3 blocks together, and rotate writing: 34 | Each block will contain up to 24 pointers, still less than the maximum of 32 pointers, 35 | and everytime a record is created or updated, we will rotate the block written. 36 | 37 | A record key is created or updated less often than the value, so this approach should 38 | provide optimal performance and endurance. We now need a way to efficiently 39 | allocate memory pointers. 40 | 41 | ## Memory allocation 42 | 43 | The first 128 blocks are reserved for the disk metadata, the following 64 GiB are 44 | reserved for the associative array, this means we have around 960 GiB of space to 45 | store the values. 46 | 47 | We need a mechanism to keep track of which blocks are free and which are used. 48 | To do that we can store a list containing the free memory pointers, and the number of blocks. 49 | For example at the beginning it will be: 50 | 51 | `[{0: 2^30}]` 52 | 53 | Which means starting at the address 0, we have 2^30 blocks, which are free. 54 | 55 | Out of this space we decide to allocate 1024 blocks for keeping track of the free blocks, so 56 | our free space will look like: 57 | 58 | `[1024: 2^30-1024]` written at block 0 and 512. 59 | 60 | Which means that the first free block is 1024, and we have 2^30-1024 free blocks. 61 | When we want to save a new key, we can read the the first 1024 blocks, detect the most 62 | recent state, and write the new state with a compare and swap. The atomicity of 63 | the compare and swap allow us concurrency without locks, for very high performance. 64 | 65 | To avoid block trashing, we can decide to allocate a little bit more memory than required, 66 | for example 30% more, rounded up. In case we need to write one block, we will allocate two, 67 | so that successive updates can rotate the blocks without needing another allocation. 68 | 69 | Smart allocation will be used, to ensure that the list will not grow unbounded. 70 | -------------------------------------------------------------------------------- /docs/05_yottafs/yfs.md: -------------------------------------------------------------------------------- 1 | # Yottastore File System 2 | 3 | User space filesystem. Shard-per-core, single threaded concurrent. Parallelism is achieved by sharding. 4 | 5 | Listen on Unix Domain Socket to requests from network demon. 6 | 7 | Garbage collector and compactification done concurrently on separate thread (see 8 | [Trim](./trim.md)) 9 | 10 | Have big hashmap (hash trie?) in memory, pointing uints to sectors. (resize?) 11 | If hashmap is 32 bit key + 32 bit value, then 2 gb every TB of SSD. 12 | Compute node takes care of hashing and issuing consecutive commands. 13 | 14 | ## Loop 15 | 16 | For loop, wait for io_uring events. 17 | 18 | In case of request (socket event), issue the command with UUID (32 bit unique, + 32 bit flag) 19 | and store it in hashmap. 20 | 21 | In case of IO ready, use UUID to find relevant request and issue send command (on socket). 22 | 23 | # Operations supported by filesystem 24 | 25 | - Read 26 | - Read Range 27 | - Read Follow 28 | - Write 29 | - Append 30 | - Delete 31 | - Compare 32 | - Compare and Swap 33 | - Flush 34 | - Verify 35 | 36 | 37 | ## Read 38 | 39 | Read take a `uint`, and return the sector which the hash trie points to. 40 | 41 | Read Range takes a list of ranges to read 42 | 43 | Read Follow return the sector, and any sector the skip list points to. 44 | 45 | 46 | ## Write 47 | 48 | Write attempt to WAL. 49 | Write a data stream to disk, for the given key. Takes care of using multiple sectors if needed. 50 | If key exist, it gets rewritten. Compute node is responsible for merging existing data. 51 | If successful, record to WAL and return true. 52 | 53 | 54 | 55 | Options: 56 | - Compressed: signal that compute node compressed the data 57 | - Compact: if size is much less than sector, signal that compactification with other records is desired 58 | - Revert: possible to mark the write as revertable 59 | - Compare: verify existing state before writing, fail if mismatch 60 | - Flush: Make write strongly consistent 61 | 62 | ## Append 63 | 64 | Append a data stream to an existing key. Takes care of using multiple sectors if needed. 65 | Useful for columnar records. 66 | 67 | Options: 68 | - compressed: signal that compute node compressed the data 69 | - compact: if size is much less than sector, signal that compactification with other records is desired 70 | - Revert: possible to mark the write as revertable 71 | - Compare: verify existing state before writing, fail if mismatch 72 | - Flush: Make append strongly consistent 73 | 74 | ## Delete 75 | 76 | Given a key, mark it as deleted. 77 | 78 | If zone is full, compute share of deleted sectors per zone, if it is above threeshold then issue trim command. 79 | 80 | ## Compare 81 | 82 | Verify that a key has not changed sector pointer in the hashmap. 83 | 84 | Return true or false. 85 | 86 | ## Compare and swap 87 | 88 | Write temp data. 89 | Verify that a key has not changed sector pointer in the hashmap. 90 | If not, perform a swap. 91 | 92 | Possible to be multi word, to support transactions. 93 | 94 | Return true or false. 95 | 96 | Optional: 97 | 98 | Count failures through sliding window bloom filter, to switch from CAS to Locks for high contention. 99 | 100 | ## Flush 101 | 102 | Issue Flush command to empty the write queue. 103 | 104 | ## Verify 105 | 106 | Given a key, verify that it matches the CRC. Done by the compute node. 107 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | Yotta Store is a [next generation]() storage system aiming to **scale out to the yotta byte range** 3 | and **scale up to millions of concurrent read and writes per record**. 4 | The goal is to have **two orders of magnitude more throughput than DynamoDB**, 5 | dollar per dollar, while maintaining a **sub-ms latency**. 6 | Check our [benchmarks](docs/10_benchmarks/benchmarks.md) 7 | 8 | Yotta Store is built on top of a 512 bit distributed machine, with a 4kib word size. 9 | We try to design a system which can exploit the capabilities of 10 | modern hardware and software, like NVMe disks or `io_uring`. 11 | Read more in the [docs](docs/README.md) 12 | 13 | - [Golang implementation](https://github.com/yottaStore/golang) (WIP) backed by Cloud Storage 14 | - [Rust implementation](https://github.com/yottaStore/rust) (WIP) Backed by io_uring NVMe API 15 | - 16 | ## Main features 17 | 18 | - Linear scalability, up to 10^9 nodes and 10 yotta bytes of addressable space. 19 | - Anti fragility, the multi tenant setup increase reliability and availability with load. 20 | - Self configuring and optimizing, thanks to machine learning 21 | - Strong consistency guarantees, aiming at sub ms latency. 22 | - Cheap transactions and indexes, at around `o(n)`. 23 | - Storage decoupled from compute with a serverless architecture. 24 | - Two orders of magnitude faster than DynamoDB, dollar per dollar. 25 | 26 | 27 | ## Techniques used 28 | 29 | Yotta Store does not create any new technique, but uses existing ones in a novel combination: 30 | 31 | | Problem | Solution | Description | 32 | |----------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------| 33 | | Partitioning and replication | ReBaR: Rendezvous Based Routing | A generalization of consistent hashing, for agreeing k
choices in a stable way and with minimal information exchange | 34 | | Highly available read and writes | YottaFS | A wait free userspace filesystem, designed around NVMe devices and CRDTs. | 35 | | Hot records | YottaSelf | Hot records can be sandboxed in dedicated partitions. | 36 | | Transactions | Modified Warp | Thanks to the modified warp algorithm no distributed consensus is needed. | 37 | | Indexes, queries across keys | Partitioned indexes | Indexes are sharded and lazily built on the same partition seeing the changes. | 38 | | Membership and failure detection | Gossip agreement protocol | Thanks to GAP and ReBaR the expected cost is `o(log(n))` deterministically. | 39 | 40 | 41 | ## Inspirations 42 | 43 | - [Designing Data-Intensive Applications, Martin Kleppmann](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) 44 | - [The Dynamo paper, many authors](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) 45 | - [A pragmatic implementation of non-blocking linked-lists, Timothy L. Harris](https://timharris.uk/papers/2001-disc.pdf) 46 | - [Warp: Lightweight Multi-Key Transactions for Key-Value Stores, many authors](https://arxiv.org/pdf/1509.07815.pdf) 47 | - [Thread per core](https://helda.helsinki.fi/bitstream/handle/10138/313642/tpc_ancs19.pdf) 48 | - [Kafka](https://docs.confluent.io/platform/current/kafka/design.html) 49 | 50 | 51 | -------------------------------------------------------------------------------- /docs/03_partitioning/41_examples.md: -------------------------------------------------------------------------------- 1 | # Failure example 2 | 3 | For each hierarchy level there is a maximum number of failures tolerated, 4 | called **quorum**. 5 | Let's say a group of 64 disks are attached to 8 nodes in 2 racks: 6 | if a disk fails, the node will rebalance the load across the remaining 7 | 7 disks without problems. 8 | If instead more than quorum disks fails in a node, the system will trigger 9 | a rebalance across the other nodes in the same rack. 10 | 11 | Failures are local in the sense that first recovery is attempted locally, 12 | and then if that fails the recovery is attempted involving only the minimum 13 | amount of levels in the hierarchy tree. 14 | 15 | # Scaling example 16 | 17 | Generally using weights with the rendezvous hashing algorithm is not 18 | a good idea, because changing the weights will cause a rebalance across 19 | many keys. 20 | 21 | Instead, we will use weights to accomodate for future expansion of the cluster, 22 | which can be planned in advance, or to represent smaller nodes. 23 | We can create fake zones or nodes and assign them a weight of 0. 24 | When we start adding new nodes, we can assign a weight of 1 only to new accounts, to allow for gradually scaling up. 25 | Eventually all the accounts will see a weight of 1, but the rebalancing 26 | will be way more gradual. 27 | 28 | See an example [here](examples.md). For an implementation please see [here]() 29 | 30 | 31 | # Key addressing 32 | 33 | Let's say we have a 128 nodes spread across 2 regions and we want 34 | to know the position of a key: 35 | 36 | `account/tableName/resourceType/primaryKey` 37 | 38 | We know that the replication factor is 6, and the partitioning factor 39 | is 16. We decide to store the data in a single region. 40 | 41 | Our hierarchy could be: 42 | 43 | - 2 Regions 44 | - 2 Zone per region 45 | - Each zone has 32 nodes with 16 disks each. 46 | - Each disk has 8 1TB namespaces, 47 | - Each namespace has 128 queues 48 | 49 | This means that we have a total of 2^21 combinations. Using a naif rendezvous hashing 50 | would be too expensive. We start by understanding which nodes are responsible for the 51 | table. Given the root of the key: `account/tableName/resourceType`, we need to pick 52 | 16*6 = 96 nodes using the rendezvous hashing on this tree: 53 | 54 | - 2 Regions, with weight [1, 0] 55 | - 8 Zones, with weight [1, 1, 0, 0, 0, 0, 0, 0] 2 56 | - Nodes, without weight 57 | - Disks, without weight 58 | - Queues, without weight 59 | 60 | The first level is region, the weights forces the algorithm to pick the first region, 61 | to express a user preference and doesn't need hashing. 62 | The second level is zone, with 2 real zones and 6 fake zones to plan for future expansion. 63 | Given that we need 96 nodes, we know that both zones will be involved. 64 | The third level is node, with 32 possibilities times 2 zones. For each zone we will perform 65 | a rendezvous hashing, and pick 16 nodes. 66 | The fourth level is disk, with 8 possibilities times 16 nodes. 67 | 68 | We have thus selected 256 possibilities. We can pick the 96 disks with the highest value and thus 69 | have our pool. We can now build a tree which represent the `account/tableName/resourceType` 70 | hierarchy: 71 | 72 | - 1 region 73 | - 2 zones 74 | - 16 nodes 75 | - 16 disks per node 76 | - 8 namespaces per disk 77 | 78 | Now everytime we want to know the position of a key for the table, instead of performing another round 79 | on the cluster tree, we can do a lookup on this smaller tree: 80 | 81 | - 2 zones 82 | - 16 nodes 83 | - 16 disks per node 84 | - 8 namespaces per disk 85 | - 128 queues per namespace 86 | 87 | For a total of 2^19 possibilities. We can build a virtual tree: 88 | 89 | - 32 virtual nodes 90 | - 32 virtual nodes 91 | - 32 virtual nodes 92 | - 16 queues 93 | 94 | To know the position of a key we need 4 steps, 8 steps if we want to know all the 6 replicas. 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | -------------------------------------------------------------------------------- /docs/05_yottafs/11_introduction.md: -------------------------------------------------------------------------------- 1 | # Filesystem 2 | 3 | It's possible to experiment with several different filesystem configurations, 4 | among the most common ones are: 5 | 6 | - **EXT4**: Old, but reliable. 7 | - **BTRFS, NILFS**: Better suited for our case. 8 | - **NOATIME**: To improve performance. 9 | - **RAID**: to control replication and availability. 10 | - **TRIM**: To improve latency 11 | 12 | Instead I would recommend a different approach, more aligned with our hardware 13 | capabilities, which is to use the **NVMe** specification 14 | 15 | # NVMe specification 16 | 17 | NVMe is a new specification for storage devices. It gives use a set of commands 18 | to access the SSD as a block device, without any filesystem. It exposes a set 19 | of basic operation like: 20 | 21 | - **READ**: read a number of sectors on the device 22 | - **WRITE**: write a number of sectors on the device 23 | - **FLUSH**: execute all the commands in the queue 24 | - **VERIFY**: verify the data on the device 25 | - **COMPARE**: compare the data on the device with the data in the memory 26 | 27 | On top of that we also have `fused operations`, which allow us to combine 28 | a `compare` and `write` operation in a single atomic command. This is 29 | fundamental for our [instruction set](../architecture/instructionSet.md). 30 | 31 | Discarding a filesystem gives us a lot of perfromance but also 32 | some new responsibilities: 33 | 34 | - Detect and interface with the NVMe device 35 | - Define a disk format 36 | - Handle wear leveling 37 | - Handle trimming 38 | - Handle replication (RAID) 39 | - Extend this to HDDs 40 | 41 | Let's solve them one by one 42 | 43 | ## NVMe interface 44 | 45 | Although powerful, the NVMe specification can feel daunting. Thankfully, 46 | there exist a very nice library tailored exactly to our needs: 47 | [Storage Performance Development Kit](https://spdk.io/) 48 | 49 | Behind a set of simple C primitives, it provides a high level of abstraction 50 | to the NVMe device, while fully residing in user land. This allow us 51 | to exploit the high concurrency of the device, by setting up multiple queues 52 | with high depth. 53 | 54 | nvme cli as fallback without queue or fused commands 55 | 56 | ## Disk format 57 | 58 | NVMe devices appears to Linux as a block device, which can be partitioned 59 | in several **namespaces**, backed by a number of underlying blocks. For 60 | simplicity of discussion, we will assume a namespace size of 1 TB, 61 | which at 4kb per sector, is equivalent to 2^28 blocks. 62 | 63 | ### Disk Metadata 64 | 65 | The disk format is a set of metadata that describes the disk layout, and 66 | allow us to parse the contents of the disk. It's especially important 67 | that this data is not corrupted, as it would make the disk unusable. 68 | 69 | We start by reserving the first 128 blocks, and store 70 | the initial metadata at position 0 and 64, as copies. It's easy to 71 | imagine that a block of 4kb is more than enough to 72 | store all the metadata, but eventually we can use 2 blocks instead. 73 | 74 | If for any reason the metadata needs to be updated, we will store the 75 | data in the next blocks, 1 and 65, again as copies. Once a few copies 76 | will be written, let's say 9, we will write a number of zeros to the 77 | oldest block, to mark it as available. 78 | 79 | Whenever a process wants to read the metadata, it can simply ask for 80 | the first 128 blocks of the namespace, and then parse the most recent 81 | block. The double copy is to provide corruption resilience. 82 | 83 | ### Block format 84 | 85 | Since we are dealing with a low level device, it's useful to define 86 | a binary format for the blocks. We can take for example `Avro` as a 87 | serialization format, maybe with a couple of addition: 88 | 89 | - We need a pointer to the next block, in case one block is not enough 90 | to store the data, and the next chunk is not in the same block. 91 | - We need a checksum to verify the integrity of the data. 92 | 93 | Read more [here](blockFormat.md). 94 | 95 | ### Wear leveling 96 | 97 | The metadata chapter give us an idea of how wear leveling can be implemented. The 98 | application can decide to aggregate several blocks, and then rotate the writes 99 | to provide wear leveling. Ideally we would like to have a mechanism 100 | to determine in advance which are the possible blocks, to submit the read to multiple 101 | blocks as a single command. 102 | 103 | **Example**: A is in block 0, B is in block 1. The system might decide to write 104 | (A, B) to block 0, and leave block 1 free. When an updates comes, we write 105 | it to block 1 and vice versa. Reading either A or B require to scan both 106 | blocks to find the most recent data. 107 | 108 | ### Trimming 109 | 110 | Like with wear leveling, we can also implement a mechanism to trim the disk inspired 111 | from the metadata chapter. It seems we need a mechanism with the following properties: 112 | 113 | - Keep track of the free blocks 114 | - Trim the unused blocks in an asynchronous way 115 | - Be thread safe 116 | 117 | It seems like we need a non blocking garbage collector, which will be useful also 118 | for other scopes like improving our wear leveling mechanism. More information [here](hashTable.md). 119 | 120 | ## Replication 121 | 122 | Normally replication is handled both by RAID and the application, which is a waste. 123 | We instead can implement replication at the application level, with several advantages: 124 | 125 | - Unlike RAID, replicas can be read and write concurrently, increasing the throughput 126 | - We can guarantee that copies will be spread across machines, increasing reliability 127 | - Replicas are used as voting machines 128 | 129 | The rendezvous hashing allow us to choose in a deterministic way the nodes 130 | that will hold the replicas for our key, and thus chose accordingly. 131 | 132 | ## Hard disks 133 | 134 | We can now extend the disk model to hard disks, which are optimizied for linear 135 | performance. If we want to publish to a queue, we can first determine which 136 | nodes are the owner of the queue, and then publish to that node. 137 | 138 | The problem is that a message is usually small in size, and we would prefer to write 139 | 32 mb chunks at a time. To achieve that we can use an intermediate cache on the 140 | SSDs: Let's say we allocate a chunk of 512 mb. Everytime a message arrives 141 | we write a 4 kb chunk to the disk in the first free block. When enough blocks 142 | are written they can sent to the hard disk. The large chunk allocation allows 143 | for wear leveling, non blocking concurrency writes. 144 | 145 | The only thing we need is to store the metadata of the disk, with the list of chunks 146 | allocated and to which queue. For that we can use an SSD like we did for the original metadata. 147 | More information [here](sata.md). 148 | --------------------------------------------------------------------------------