├── docs
    ├── 05_yottafs
    │   ├── sata.md
    │   ├── warp.md
    │   ├── 21_yottafs.md
    │   ├── blockFormat.md
    │   ├── trim.md
    │   ├── hashTable.md
    │   ├── yfs.md
    │   └── 11_introduction.md
    ├── 10_benchmarks
    │   └── benchmarks.md
    ├── 01_architecture
    │   ├── 21_principles.md
    │   ├── 41_consistency.md
    │   ├── 32_instructions.md
    │   ├── 11_problem.md
    │   └── 31_machine.md
    ├── 06_yottadb
    │   └── 11_introduction.md
    ├── 00_Introduction
    │   ├── 11_introduction.md
    │   ├── 31_related_work.md
    │   └── 21_business_case.md
    ├── 07_infrastructure
    │   ├── 13_sizing.md
    │   ├── 12_hybrid.md
    │   ├── 11_serverless.md
    │   └── 21_node_types.md
    ├── package.json
    ├── XX_pending
    │   ├── yottafs.md
    │   ├── README.md
    │   ├── fs.md
    │   ├── yottadb.md
    │   ├── rendezvous.md
    │   └── yottapack.md
    ├── README.md
    └── 03_partitioning
    │   ├── 11_introduction.md
    │   ├── 21_rebar.md
    │   ├── 31_gap.md
    │   └── 41_examples.md
├── infra
    ├── README.md
    └── package.json
├── pkgs
    ├── hash
    │   ├── README.md
    │   └── package.json
    └── gossip
    │   ├── README.md
    │   └── package.json
├── .yarnrc.yml
├── package.json
├── .editorconfig
├── yarn.lock
├── .gitignore
├── LICENSE
└── README.md


/docs/05_yottafs/sata.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/warp.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/infra/README.md:
--------------------------------------------------------------------------------
1 | # infra
2 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/21_yottafs.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/10_benchmarks/benchmarks.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pkgs/hash/README.md:
--------------------------------------------------------------------------------
1 | # hash
2 | 


--------------------------------------------------------------------------------
/docs/01_architecture/21_principles.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/01_architecture/41_consistency.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/06_yottadb/11_introduction.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pkgs/gossip/README.md:
--------------------------------------------------------------------------------
1 | # gossip
2 | 


--------------------------------------------------------------------------------
/docs/00_Introduction/11_introduction.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/01_architecture/32_instructions.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/07_infrastructure/13_sizing.md:
--------------------------------------------------------------------------------
1 | queue theory
2 | 


--------------------------------------------------------------------------------
/.yarnrc.yml:
--------------------------------------------------------------------------------
1 | yarnPath: .yarn/releases/yarn-3.2.1.cjs
2 | 


--------------------------------------------------------------------------------
/docs/07_infrastructure/12_hybrid.md:
--------------------------------------------------------------------------------
1 | standby replicas
2 | 


--------------------------------------------------------------------------------
/docs/00_Introduction/31_related_work.md:
--------------------------------------------------------------------------------
1 | // TODO: put related work
2 | 


--------------------------------------------------------------------------------
/docs/07_infrastructure/11_serverless.md:
--------------------------------------------------------------------------------
1 | compute decoupled from storage
2 | 


--------------------------------------------------------------------------------
/docs/07_infrastructure/21_node_types.md:
--------------------------------------------------------------------------------
1 | random (reads) or linear (writes)
2 | 


--------------------------------------------------------------------------------
/docs/package.json:
--------------------------------------------------------------------------------
1 | {
2 |   "name": "docs",
3 |   "packageManager": "yarn@3.2.1"
4 | }
5 | 


--------------------------------------------------------------------------------
/infra/package.json:
--------------------------------------------------------------------------------
1 | {
2 |   "name": "infra",
3 |   "packageManager": "yarn@3.2.1"
4 | }
5 | 


--------------------------------------------------------------------------------
/pkgs/hash/package.json:
--------------------------------------------------------------------------------
1 | {
2 |   "name": "hash",
3 |   "packageManager": "yarn@3.2.1"
4 | }
5 | 


--------------------------------------------------------------------------------
/pkgs/gossip/package.json:
--------------------------------------------------------------------------------
1 | {
2 |   "name": "gossip",
3 |   "packageManager": "yarn@3.2.1"
4 | }
5 | 


--------------------------------------------------------------------------------
/docs/00_Introduction/21_business_case.md:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | - BigQuery + ML costed 2000 euros
4 | - DynamoDB, but with each column indexed
5 | 


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "yottaStore",
 3 |   "workspaces": [
 4 |     "docs",
 5 |     "infra",
 6 |     "pkgs/*"
 7 |   ],
 8 |   "packageManager": "yarn@3.2.1"
 9 | }
10 | 


--------------------------------------------------------------------------------
/.editorconfig:
--------------------------------------------------------------------------------
 1 | root = true
 2 | 
 3 | [*]
 4 | end_of_line = lf
 5 | insert_final_newline = true
 6 | 
 7 | [*.{js,json,yml}]
 8 | charset = utf-8
 9 | indent_style = space
10 | indent_size = 2
11 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/blockFormat.md:
--------------------------------------------------------------------------------
 1 | # Block Format
 2 | 
 3 | Serialization: 
 4 | 
 5 | - [Flatbuffers](https://en.wikipedia.org/wiki/FlatBuffers)
 6 | - [Avro](https://en.wikipedia.org/wiki/Apache_Avro)
 7 | 
 8 | 
 9 | Error Correction:
10 | 
11 | CRC32-C
12 | 
13 | 


--------------------------------------------------------------------------------
/docs/XX_pending/yottafs.md:
--------------------------------------------------------------------------------
 1 | # Drivers
 2 | 
 3 | - classic: buffered io, xfs
 4 | - direct: direct io, xfs
 5 | - uring-classic: classic with io_uring
 6 | - uring-direct: direct with io_uring
 7 | - uring-nvme: NMVe commands
 8 | - uring-zns: Zoned namespace
 9 | 
10 | 
11 | # Namespace type
12 | 
13 | - Random
14 | - Linear


--------------------------------------------------------------------------------
/yarn.lock:
--------------------------------------------------------------------------------
 1 | # This file is generated by running "yarn install" inside your project.
 2 | # Manual changes might be lost - proceed with caution!
 3 | 
 4 | __metadata:
 5 |   version: 6
 6 | 
 7 | "yottaStore@workspace:.":
 8 |   version: 0.0.0-use.local
 9 |   resolution: "yottaStore@workspace:."
10 |   languageName: unknown
11 |   linkType: soft
12 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .yarn/*
 2 | !.yarn/patches
 3 | !.yarn/plugins
 4 | !.yarn/releases
 5 | !.yarn/sdks
 6 | !.yarn/versions
 7 | 
 8 | # Swap the comments on the following lines if you don't wish to use zero-installs
 9 | # Documentation here: https://yarnpkg.com/features/zero-installs
10 | !.yarn/cache
11 | #.pnp.*
12 | 
13 | 
14 | # IDE
15 | .idea
16 | .vscode
17 | 


--------------------------------------------------------------------------------
/docs/01_architecture/11_problem.md:
--------------------------------------------------------------------------------
 1 | # Problem
 2 | 
 3 | Here's the problem we're facing:
 4 | 
 5 | Easy:
 6 | - As an engineer, I don't want to care about scale
 7 | - As an engineer, I want easy migrations
 8 | - As an engineer, I want indexes
 9 | 
10 | 
11 | Powerful:
12 | - As a customer, I want availability
13 | - As a customer, I want low latency
14 | - As a business leader, I want reliability
15 | - As a business leader, I want transactions
16 | 
17 | Efficient:
18 | 


--------------------------------------------------------------------------------
/docs/XX_pending/README.md:
--------------------------------------------------------------------------------
 1 | # Readme
 2 | 
 3 | # Architecture
 4 | 
 5 | - Yottastore: compute layer
 6 | - Yottafs: storage layer
 7 | - Client
 8 | 
 9 | 
10 | # Modules
11 | 
12 | - Yottadb: module for yottastore
13 | - Self: awareness and optimizations, ml based
14 | - Rendezvous: module for yottastore
15 | - Gossip: module for yottastore and yottadb
16 | - Yottapack: optimal serialization format
17 | 
18 | 
19 | ## Yottadb
20 | 
21 | Adds advanced features to yottastore. 
22 | Possible drivers are:
23 | 
24 | - KeyValue: simplest one, default
25 | - Columnar: Column values
26 | - Document: For collections
27 | - PubSub: optimized for message queues
28 | - Indexed: support queries, based on btrees
29 | - Graph: optimized for trees and graphs
30 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/trim.md:
--------------------------------------------------------------------------------
 1 | # Efficient trim algorithm
 2 | 
 3 | Concurrent, parallel, state machine.
 4 | 
 5 | Each erase zone has an array of 1 bit per per sector/word (state).
 6 | (bloof filter with perfect hash)
 7 | 
 8 | 512 mb every 16 TB of SSD.
 9 | 
10 | ## Delete ops
11 | 
12 | YFS marks the bit for each sector involved as 1.
13 | 
14 | If sum of 1s greater than threeshold, invoke compact ops.
15 | 
16 | ## Compact ops
17 | 
18 | - New, empty zone is selected and frozen.
19 | - YFS freeze the state, and send it to compact process. 
20 | - Any further delete is stored in a temp queue
21 | - Compact process read state, and issue batch copy command for each valid sector to yfs
22 | - When done, YFS issue erase zone command
23 | - YFS apply pending deletes ops
24 | 
25 | 
26 | # Advantages
27 | 
28 | - Very fast
29 | - Never blocking
30 | - Ultra low latencies
31 | 


--------------------------------------------------------------------------------
/docs/XX_pending/fs.md:
--------------------------------------------------------------------------------
 1 | # A pragmatic implementation of Wait Free Persistent Storage
 2 | 
 3 | ## Abstract
 4 | 
 5 | We present a pragmatic implementation of a wait free persistent storage, designed
 6 | around NVMe characteristics and asynchronous IO with `io_uring`. 
 7 | 
 8 | 
 9 | 
10 | ## Introduction
11 | 
12 | In the context of distributed datastores latency and predictability are key metrics, 
13 | with implications on the transactional throughput and the cost of the system.
14 | 
15 | 
16 | ## Constraints
17 | 
18 | ## Related work
19 | 
20 | ## The algorithm
21 | 
22 | The fundamental data structure of the algorithm is the Record:
23 | 
24 | ```
25 | 
26 | const (
27 |     Size = 4096 // Underlying block size
28 | )
29 | type Block [Size]byte
30 | 
31 | type Record struct {
32 | 	Body []Block
33 | 	Tails [][]Block
34 | 	Appends []
35 | }
36 | 
37 | ```
38 | 
39 | ## XFS implementation
40 | 
41 | The fundamental structure is a 
42 | 
43 | 
44 | ## NVMe implementation
45 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Alberto
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/docs/XX_pending/yottadb.md:
--------------------------------------------------------------------------------
 1 | # Yottadb
 2 | 
 3 | 
 4 | ## Drivers
 5 | 
 6 | ### Key-value
 7 | 
 8 | Basic driver, given a record in the form: 
 9 | - `account@[driver:]tableName/recordName[/recordRow]`
10 | 
11 | it allows to find which shard contains the binary data associated
12 | with the record.
13 | 
14 | Possible operations are:
15 | 
16 | - Read
17 | - Write
18 | - Append
19 | - Delete
20 | 
21 | ### Columnar
22 | 
23 | Like key-value, but optimized for storing very long columns.
24 | 
25 | Possible operations are:
26 | 
27 | - Read
28 | - Write
29 | - Append
30 | - Delete
31 | 
32 | ### Document
33 | 
34 | Improvement over the key-value driver, allow to deal with
35 | documents instead of plain binary data. Possible operations are:
36 | 
37 | - Read document
38 | - Write document
39 | - Update document
40 | - Delete document
41 | - Get collection
42 | - Create collection
43 | - Update collection
44 | - Delete collection
45 | 
46 | ### PubSub
47 | 
48 | Improvement over the columnar store, to handle queues. Possible
49 | operations are:
50 | 
51 | - Crud Topic
52 | - Publish
53 | - Crud Subscription
54 | - Consume
55 | 
56 | ### Indexed
57 | 
58 | An improvement over the document driver, it uses the 
59 | columnar driver to create indexes.
60 | 
61 | ### Graph
62 | 
63 | Uses key-value store to represent a graph
64 | 


--------------------------------------------------------------------------------
/docs/XX_pending/rendezvous.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | ## Tree Layers
 4 | 
 5 | - Region: Picked ad config time
 6 | - Zone: DC, or group of closely related DCs
 7 | - L1,...Ln: Skeleton construction
 8 | - Node: Node, or group of closely related nodes
 9 | - Shard: Namespace, or group of closely related namespaces
10 | - Blocks: List of blocks containing data
11 | 
12 | 
13 | ## Walk the tree
14 | 
15 | Perform rendezvous at each layer of the tree, inspired by the 
16 | skeleton construction
17 | 
18 | ## Collection step
19 | 
20 | For each collection and node type (random, linear, compute):
21 | 
22 | - `R`: Replication factor
23 | - `MAX_LOAD`: Maximum number of shards from the same
24 |     collection in a node, power of 2
25 | - `S = N * MAX_LOAD`: Shard factor, greater or equal than `MAX_LOAD`
26 | - `N`: Number of nodes, power of 2
27 | 
28 | TODO: handle dynamic S per collection
29 | 
30 | - Pick regions at config time for each collection
31 | - Pick `R` walks starting at the root of the tree
32 | - At the next level of the tree, pick `N` walks
33 | until you reach a node
34 | - (Optional): Add `log(N)` walks, to pick failover nodes
35 | - For each node pick `MAX_LOAD` shards
36 | 
37 | TODO: handle weights and dynamic MAX_LOAD per node
38 | 
39 | We end up with a pool of `R * S` shards, distributed among `N` nodes
40 | in `R` regions/zones. We also have a pool of `R * log(N)` nodes as
41 | backup.
42 | 
43 | ## Record step
44 | 
45 | For each record in a collection, given a pool of `R*S` shards
46 | 
47 | - Start `R` walks
48 | - For each walk, pick one node
49 | - For each node, pick one shard
50 | 
51 | You end up with `R` shards, which should all contain 
52 | replicas of your record
53 | 
54 | ## 
55 | 


--------------------------------------------------------------------------------
/docs/README.md:
--------------------------------------------------------------------------------
 1 | # Documentation
 2 | 
 3 | I hope that the reader that will bear with reading this, will then be able to fully understand the system, 
 4 | and implement one himself if he decides so.
 5 | 
 6 | ### Chapter 1: Architecture
 7 | 
 8 | - [Chapter 1.1](01_architecture/11_problem.md): The problem
 9 | - [Chapter 1.2](01_architecture/21_principles.md): Design principles
10 | - [Chapter 1.3](01_architecture/31_machine.md): The distributed machine
11 |   - [Chapter 1.3.1](01_architecture/32_instructions.md): The instruction set
12 | - [Chapter 1.4](01_architecture/41_consistency.md): Consistency Model
13 | 
14 | 
15 | ### Chapter 2: Leaderless partitioning and replication
16 | 
17 | - [Chapter 2.1](03_partitioning/11_introduction.md): Introduction
18 | - [Chapter 2.2](03_partitioning/21_rebar.md): Rendezvous based routing
19 | - [Chapter 2.3](03_partitioning/31_gap.md): Gossip agreement protocol
20 | - [Chapter 2.4] Examples
21 |   - [Chapter 2.4.1](03_partitioning/41_examples.md): Partitions and replicas
22 |   - [Chapter 2.4.2](03_partitioning/41_examples.md): Key routing
23 |   - [Chapter 2.4.3](03_partitioning/41_examples.md): Gossip rounds
24 | 
25 | 
26 | ### Chapter 3: YottaFS
27 | 
28 | - [Chapter 3.1](05_yottafs/nvme.md): The best filesystem
29 | - [Chapter 3.2](05_yottafs/warp.md): Transaction and indexes
30 | 
31 | ### Chapter 4: Infrastructure
32 | 
33 | - [Chapter 4.1](07_infrastructure/11_serverless.md): Decoupling storage and compute
34 | - [Chapter 4.2](07_infrastructure/12_hybrid.md): Hybrid cloud and standby replicas
35 | - [Chapter 4.3](07_infrastructure/13_sizing.md): The best node size (queue theory)
36 | 
37 | ### Chapter 5: Benchmarking
38 | 
39 | - [Chapter 5.1](10_benchmarks/benchmarks.md): Performance and costs estimates
40 | 
41 | 


--------------------------------------------------------------------------------
/docs/XX_pending/yottapack.md:
--------------------------------------------------------------------------------
 1 | # Yottapack 
 2 | 
 3 | Yottapack is a bandwidth and cpu efficient
 4 | 
 5 | - Size efficient
 6 | - Efficient streaming
 7 | - Query fields
 8 | - Schema or schemaless
 9 | - Row or columnar fields
10 | - Skip list for fast access
11 | - Extensible
12 | 
13 | 
14 | # Format
15 | 
16 | ## Header
17 | - Intro byte
18 | - 1+ size byte (needed for streaming)
19 | - (optional): Heads encoding
20 | 
21 | ### Fat heads encoding
22 | 
23 | Not needed if schema or streaming. Encode strings
24 | representing the name of fields, (and nested fields?)
25 | 
26 | - 1 type byte (is it needed?)
27 | - 1+ byte size
28 | - N+ chars for value
29 | 
30 | ### Light heads encoding
31 | 
32 | If with schema
33 | 
34 | - 1+ byte size for each record
35 | - 0 for some, if order is guaranteed
36 | 
37 | ## Data
38 | 
39 | - 1 type byte (optional)
40 | - 1+ size byte
41 | - N+ value byte
42 | 
43 | # Special bytes
44 | 
45 | ## Intro byte
46 | 
47 | 1 byte for versioning and flags
48 | 
49 | ## Forbidden sequences
50 | 
51 | - 0x00, 0x00, ETX
52 | - 0x00, ETX, EOT
53 | 
54 | 
55 | ## Type byte
56 | 
57 | - 1 bit isArray
58 | - 1 bit isMap
59 | - 1 bit reserved
60 | - 5 bit type (32 types)
61 | 
62 | ### List of types
63 | 
64 | - Char
65 | - Binary
66 | - Octal
67 | - Hex
68 | - Bool
69 | - Nil
70 | - Integers ( 2 * 4 )
71 | - BigInt
72 | - Floats (2)
73 | - BigDecimal
74 | - Time (2)
75 | - Geolocation
76 | - Dynamic
77 | - Lookup
78 | - Padding
79 | - Reserved (8)
80 | - Extended (2 bytes) 
81 | 
82 | Total: 32
83 | 
84 | ## Length encoding
85 | 
86 | 1+ byte of u8. Like utf8
87 | 
88 | 1st byte signal if you should read also next byte.
89 | 
90 | - 0 -> 128: 1 byte
91 | - 128+1 -> 2^14-1: 2 bytes (first byte is 1, 8th byte is 0)
92 | - 2^14+1 -> 2^21-1: 3 bytes (first byte is 1, 8th byte is 1, 
93 | 16th byte is 1)
94 | - and so on....
95 | 
96 | ## Separators
97 | 
98 | EOT or ETX used as separators. Escaped in strings
99 | 


--------------------------------------------------------------------------------
/docs/03_partitioning/11_introduction.md:
--------------------------------------------------------------------------------
 1 | # The problem
 2 | 
 3 | To achieve our size and throughput goals we need to
 4 | find a partitioning scheme with the following properties:
 5 | 
 6 | - Deterministic: all the observers should agree to the same
 7 |   partitioning with minimal information exchange
 8 | - [Consistent](https://en.wikipedia.org/wiki/Consistent_hashing):
 9 |   removal or addition of nodes should cause minimal reshuffling
10 |   of keys
11 | - Hierarchical: the scheme is sensitive to locality
12 | 
13 | To solve this we use [rendezvous hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing), 
14 |  which allows us to choose `k` nodes in a consistent way, with 
15 | minimal information sharing across observers.
16 | 
17 | This alone is not enough to solve all our problems: a
18 | naif implementation  would require that all nodes to know about each 
19 | other status, which is too expensive for large clusters. 
20 | We need to introduce a concept of **hierarchicy tree**,
21 | which is a locality sensitive way to partition the nodes.
22 | 
23 | Moreover, we observe that while the addition of new nodes can
24 | be planned and executed in a optimal way, the failure of
25 | a node can happen at any time, and thus should be the case for
26 | which we optimize for.
27 | 
28 | # Hierarchicy tree
29 | 
30 | The [skeleton based variant](https://en.wikipedia.org/wiki/Rendezvous_hashing#Skeleton-based_variant)
31 | of the rendezvous hashing algorithm gives us the idea of building 
32 | a tree to speed up operations. 
33 | But before doing that, let's analyze the failure modes of the cluster:
34 | 
35 | - Under the assumption that our code is correct we can assume 
36 | that shards will not fail. In practice we can recover even from that.
37 | - The smallest mode of which we have no control is the disk failure.
38 | - Then we have node failure, where all the disks attached to a node become unavailable.
39 | - Then we have several kind of group failures: rack failure, room failure,
40 |   datacenter failure, and so on.
41 | 
42 | We expect the disk failure to be the most common mode, and the one
43 | for which we will optimize, and then the other modes in a decreasing 
44 | order of probability. This allows us to define a natural tree-like 
45 | hierarchy based on failure modes, for which the levels are:
46 | 
47 | - Regions
48 | - Zones
49 | - Racks
50 | - Nodes
51 | - Disks
52 | - Shards
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/docs/03_partitioning/21_rebar.md:
--------------------------------------------------------------------------------
 1 | # Rendezvous based routing alogrithm
 2 | 
 3 | The Rendezvous Based Routing (ReBaR) algorithm
 4 | is used to navigate a hierarchy tree, to find
 5 | the shard responsible for a record. It's called
 6 | like that because it uses a rendezvous hashing
 7 | approach to agree on a set of choices with 
 8 | minimal information exchange. 
 9 | 
10 | Given a record in the form:
11 | 
12 | `account@driver:collection/record/subrecord`
13 | 
14 | And a hierarchy tree, how do we find the 
15 | shards responsible?
16 | 
17 | Before answering that, we need the user to tell us:
18 | - The region(s) involved
19 | - The type of replication (single region or multi region)
20 | - The sharding factor `s`
21 | - The replication factor `r`
22 | 
23 | 
24 | In total, we will need to find `s*r` shards, spread across
25 | one or multiple regions according to the replication type.
26 | To do that we split the algorithm in 3 steps:
27 | - Parsing the record
28 | - Finding the shard pool responsible for the collection
29 | - Finding the shards responsible for the record from the pool
30 | 
31 | 
32 | ## Parsing the record
33 | 
34 | Given a record string, it is parsed in a struct like:
35 | 
36 | ```go
37 | package main
38 | 
39 | type Record struct {
40 | 	Account string
41 | 	Driver string
42 | 	Collection string
43 | 	Record string
44 | 	PoolPtr string
45 | 	ShardPtr string
46 | }
47 | ```
48 | 
49 | // TODO: put link to driver types
50 | The driver [type]() 
51 | will determine which types of nodes are involved
52 | in the collection (`linear` or `random` nodes).
53 | 
54 | The `PoolPtr` is in the form `account@driver:collection` and
55 | is the same for all the records in a collection.
56 | 
57 | The `ShardPtr` is in the form `account@driver:collection/record` and
58 | is the same for all the subrecords of a root record.
59 | 
60 | ## Hashing step
61 | 
62 | Before describing the next parts, let's define the hashing step, 
63 | which is a slight variation of the classic rendezvous hashing.
64 | Given a 
65 | 
66 | ## Finding the pool
67 | 
68 | Each collection is assigned to a pool of shards, determined by
69 | the `PoolPtr` of the record. This has the advantage that is 
70 | possible to enumerate all the shards of a collection, making queries easier.
71 | 
72 | We start by picking a region
73 | 
74 | ## Finding the shard
75 | 
76 | Once the pool is found, we need to find the shards responsible for the record
77 | 


--------------------------------------------------------------------------------
/docs/03_partitioning/31_gap.md:
--------------------------------------------------------------------------------
 1 | # Gossip agreement protocol
 2 | 
 3 | The GAP is the protocol tasked with building the hierarchy tree.
 4 | 
 5 | 
 6 | # Optimizations
 7 | 
 8 | The traditional rendezvous gossip protocol relies on picking a random node 
 9 | for each round of gossip. This is means that unless for lucky choices 
10 | we need more than `o(log(n))` steps to converge. Please see our 
11 | simulations.
12 | 
13 | Here we propose a couple of simple improvements which we call rendezvous 
14 | gossip, with the following improvements:
15 | 
16 | - Guaranteed `o(log(n))` convergence
17 | - Fast failure detection
18 | - Very large cluster size (10^9 nodes)
19 | 
20 | Implementation [here]().
21 | 
22 | ## Small cluster
23 | 
24 | // TODO: put in rebar
25 | 
26 | Let's say we have a cluster of 8 nodes and the fanout is 3. 
27 | The cluster is already at equilibrium, so each node knows about all the other nodes. 
28 | 
29 | For the next round, each node perform a round of rendezvous hashing to determine 
30 | which node to send the next message to. The hashed string is:
31 | 
32 | `nodeName,iteration,peer1,peer2, ... peer8` 
33 | 
34 | Which is built using the lexicographic order of the node hostnames. The 512 bit 
35 | hash is split in 8 chunks of 64 bits each, and the next 3 nodes are selected.
36 | 
37 | The message is sent over the UDP protocol, and contains a hash of current 
38 | known list of healthy nodes:
39 | 
40 | `peer1,peer2, ... peer8`
41 | 
42 | Each peer can then immediately verify if all the senders agree on the list
43 | of peers. If not then a synchronization over TCP is performed.
44 | 
45 | This approach guarantees that in `o(log(n))` steps not only the network
46 | will converge, but also that each node will also receive at least
47 | one confirmation, for extremely fast failure detection.
48 | 
49 | ## Large clusters
50 | 
51 | Up to 1024 nodes the above strategy is fine, but failures becomes
52 | increasingly frequent and more expensive as they will 
53 | require a lot of synchronizations.
54 | 
55 | To scale up we can start using hierarchies and merkle trees:
56 | 
57 | - Nodes are organized in groups
58 | - Groups are organized in Zones
59 | - Groups can have a size from 25 nodes to 250 nodes
60 | - Each node knows about the local nodes, and the hash of other groups. This control the size of the message.
61 | - Only few nodes communicate across groups
62 | 
63 | Let's say we have a group of 25 nodes, and a fanout of 6. We know that there are 4 groups in 2 zones.
64 | For a node in group 1, our string thus becomes:
65 | 
66 | `peer1,peer2, ...,peer25,group2`
67 | 
68 | When any node select group2 as a candidate, it will send the message one node in group2. 
69 | The message is generated using a merkle tree.
70 | 
71 | 
72 | 
73 | 


--------------------------------------------------------------------------------
/docs/01_architecture/31_machine.md:
--------------------------------------------------------------------------------
 1 | # The distributed machine
 2 | 
 3 | Our approach to solve the task is to build a distributed machine, 
 4 | with a 512 bits memory pointer and a large word size.
 5 | 
 6 | To build that we need:
 7 | 
 8 | - Memory pointers 
 9 | - Memory to store data
10 | - A virtual machine with an instruction set
11 | 
12 | 
13 | ## The memory pointer 
14 | 
15 | To access the memory we will use a [hash table](../filesystem/hashTable.md): given the key, it's possible to calculate 
16 | the hash and thus find the position of the value in the memory. 
17 | A key is of the form:
18 | 
19 | `account/resource/resourceType/primaryKey/sortKey`
20 | 
21 | - `account` to allow multi-tenancy on the same machine. 
22 | - `resource` can be either the table name or the queue name
23 | - `resourceType` can be either `rnd` for random access keys, backed by NVMe,
24 | or `lnr` for queues, backed by HDDs.                                        
25 | - `primaryKey` represents the key to which a value is associated. 
26 | - `sortKey` is optional and is not hashed, like in dynamoDB. 
27 | Typically used for versioning, it permits to store different values of 
28 | the same primary key together.
29 | 
30 | The hash used is `SHA-512`, hence the 512 bit size of the pointer. To determine which 
31 | node is responsible for the key, we use [rendezvous hashing](../partitioning/rendezvous.md).
32 | 
33 | ## The memory
34 | 
35 | To achieve our size goals, we will treat NVMe SSDs as our main memory. Although not as fast as RAM, 
36 | NVMe provides excellent throughput and reliability. Having a non-volatile random memory means that is
37 | much easier to recover from failures, while still providing exceptional performance characteristics.
38 | 
39 | More in [SSD vs RAM performace]()  
40 | 
41 | ## Word size
42 | 
43 | The large word size is due to the nature of the backing memory. For random 
44 | access is the SSD sector size which is 4 KiB, even if a value is smaller than 
45 | that the controller will still use a 4 KiB sector, so it's better to optimize for that.
46 | 
47 | In the case of linear memory, which is typically an 
48 | [SMR HDDs](https://en.wikipedia.org/wiki/Shingled_magnetic_recording) 
49 | the word size is much bigger, up to 32 MiB, to optimize for the 
50 | high linear performance characteristics of hard disks. Queues
51 | can be temporary stored in random memory before being aggregated
52 | on disk, like the metadata needed to read the queue. More [here]()
53 | 
54 | 
55 | ## The virtual machine
56 | 
57 | For simplicity, we can take the WebAssembly virtual machine instruction set as reference,
58 | with a few important additional instructions:
59 | 
60 | - LOAD $POINTER: Load the memory stored at POINTER
61 | - COMPARE $POINTER $VALUE: 
62 | - WRITE $POINTER $VALUE
63 | - COMPARE&SWAP $POINTER $VALUE1 $VALUE2: Atomic compare and swap in the global memory 
64 | 
65 | These instructions are used to build lock free concurrent data structures, the key for the 
66 | high availability of this system. For a more detailed discussion read [here]()
67 | 
68 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/hashTable.md:
--------------------------------------------------------------------------------
 1 | # Hash Table
 2 | 
 3 | Here we describe a hash table optimized for NVMe storage. For 
 4 | simplicity, we assume a namespace of 1 TB, and a block size of 4 KB, 
 5 | which means 2^28 blocks.
 6 | 
 7 | If we were to implement a naif hash table, we would face the following
 8 | challenges:
 9 | 
10 | - Poorly utilized space
11 | - Needs wear leveling
12 | - Values bigger than a block
13 | - Needs to handle collisions
14 | 
15 | ## Pointers array
16 | 
17 | When we hash the key, we obtain a value which represents the index of an array. 
18 | The array at the index will contain either the value of the key or a 
19 | pointer to the value:
20 | 
21 | - 512 bits to reference the key hash
22 | - 32 bits to reference the first block containing the value
23 | - 16 bits to represent how many blocks are used to store the value
24 | 
25 | Even if we assume a total size of 1024 bits, this means a 4 KiB block
26 | can contain 32 pointers. Let's say we decide to store 8 pointers per block.
27 | 
28 | We allocate a large numbers of blocks to store the buckets of the hash table, let's say
29 | 64 gb or 2^24 blocks. This means we can address a total of 2^27 keys, or 512 GiB at 
30 | 4 KiB per key. The remaining space on the device is used to store the values, 
31 | which can be of arbitrary size.
32 | 
33 | To further improve wear leveling, we decide to aggregate 3 blocks together, and rotate writing:
34 | Each block will contain up to 24 pointers, still less than the maximum of 32 pointers,
35 | and everytime a record is created or updated, we will rotate the block written.
36 | 
37 | A record key is created or updated less often than the value, so this approach should
38 | provide optimal performance and endurance. We now need a way to efficiently
39 | allocate memory pointers.
40 | 
41 | ## Memory allocation
42 | 
43 | The first 128 blocks are reserved for the disk metadata, the following 64 GiB are 
44 | reserved for the associative array, this means we have around 960 GiB of space to
45 | store the values.
46 | 
47 | We need a mechanism to keep track of which blocks are free and which are used. 
48 | To do that we can store a list containing the free memory pointers, and the number of blocks.
49 | For example at the beginning it will be:
50 | 
51 | `[{0: 2^30}]`
52 | 
53 | Which means starting at the address 0, we have 2^30 blocks, which are free.
54 | 
55 | Out of this space we decide to allocate 1024 blocks for keeping track of the free blocks, so
56 | our free space will look like:
57 | 
58 | `[1024: 2^30-1024]` written at block 0 and 512.
59 | 
60 | Which means that the first free block is 1024, and we have 2^30-1024 free blocks.
61 | When we want to save a new key, we can read the the first 1024 blocks, detect the most
62 | recent state, and write the new state with a compare and swap. The atomicity of 
63 | the compare and swap allow us concurrency without locks, for very high performance.
64 | 
65 | To avoid block trashing, we can decide to allocate a little bit more memory than required,
66 | for example 30% more, rounded up. In case we need to write one block, we will allocate two,
67 | so that successive updates can rotate the blocks without needing another allocation.
68 | 
69 | Smart allocation will be used, to ensure that the list will not grow unbounded.
70 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/yfs.md:
--------------------------------------------------------------------------------
  1 | # Yottastore File System
  2 | 
  3 | User space filesystem. Shard-per-core, single threaded concurrent. Parallelism is achieved by sharding. 
  4 | 
  5 | Listen on Unix Domain Socket to requests from network demon.
  6 | 
  7 | Garbage collector and compactification done concurrently on separate thread (see 
  8 | [Trim](./trim.md))
  9 | 
 10 | Have big hashmap (hash trie?) in memory, pointing uints to sectors. (resize?)
 11 | If hashmap is 32 bit key + 32 bit value, then 2 gb every TB of SSD.
 12 | Compute node takes care of hashing and issuing consecutive commands.
 13 | 
 14 | ## Loop
 15 | 
 16 | For loop, wait for io_uring events.
 17 | 
 18 | In case of request (socket event), issue the command with UUID (32 bit unique, + 32 bit flag) 
 19 | and store it in hashmap.
 20 | 
 21 | In case of IO ready, use UUID to find relevant request and issue send command (on socket).
 22 | 
 23 | # Operations supported by filesystem
 24 | 
 25 | - Read
 26 |   - Read Range
 27 |   - Read Follow
 28 | - Write
 29 | - Append
 30 | - Delete
 31 | - Compare
 32 | - Compare and Swap
 33 | - Flush
 34 | - Verify
 35 | 
 36 | 
 37 | ## Read
 38 | 
 39 | Read take a `uint`, and return the sector which the hash trie points to.
 40 | 
 41 | Read Range takes a list of ranges to read
 42 | 
 43 | Read Follow return the sector, and any sector the skip list points to.
 44 | 
 45 | 
 46 | ## Write
 47 | 
 48 | Write attempt to WAL.
 49 | Write a data stream to disk, for the given key. Takes care of using multiple sectors if needed.
 50 | If key exist, it gets rewritten. Compute node is responsible for merging existing data.
 51 | If successful, record to WAL and return true.
 52 | 
 53 | 
 54 | 
 55 | Options:
 56 | - Compressed: signal that compute node compressed the data
 57 | - Compact: if size is much less than sector, signal that compactification with other records is desired
 58 | - Revert: possible to mark the write as revertable
 59 | - Compare: verify existing state before writing, fail if mismatch
 60 | - Flush: Make write strongly consistent
 61 | 
 62 | ## Append
 63 | 
 64 | Append a data stream to an existing key. Takes care of using multiple sectors if needed.
 65 | Useful for columnar records.
 66 | 
 67 | Options:
 68 | - compressed: signal that compute node compressed the data
 69 | - compact: if size is much less than sector, signal that compactification with other records is desired
 70 | - Revert: possible to mark the write as revertable
 71 | - Compare: verify existing state before writing, fail if mismatch
 72 | - Flush: Make append strongly consistent
 73 | 
 74 | ## Delete
 75 | 
 76 | Given a key, mark it as deleted.
 77 | 
 78 | If zone is full, compute share of deleted sectors per zone, if it is above threeshold then issue trim command.
 79 | 
 80 | ## Compare
 81 | 
 82 | Verify that a key has not changed sector pointer in the hashmap.
 83 | 
 84 | Return true or false.
 85 | 
 86 | ## Compare and swap
 87 | 
 88 | Write temp data.
 89 | Verify that a key has not changed sector pointer in the hashmap.
 90 | If not, perform a swap.
 91 | 
 92 | Possible to be multi word, to support transactions.
 93 | 
 94 | Return true or false.
 95 | 
 96 | Optional: 
 97 | 
 98 | Count failures through sliding window bloom filter, to switch from CAS to Locks for high contention.
 99 | 
100 | ## Flush
101 | 
102 | Issue Flush command to empty the write queue.
103 | 
104 | ## Verify
105 | 
106 | Given a key, verify that it matches the CRC. Done by the compute node.
107 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | Yotta Store is a [next generation]() storage system aiming to **scale out to the yotta byte range** 
 3 | and **scale up to millions of concurrent read and writes per record**. 
 4 | The goal is to have **two orders of magnitude more throughput than DynamoDB**, 
 5 | dollar per dollar, while maintaining a **sub-ms latency**. 
 6 | Check our [benchmarks](docs/10_benchmarks/benchmarks.md)
 7 | 
 8 | Yotta Store is built on top of a 512 bit distributed machine, with a 4kib word size.
 9 | We try to design a system which can exploit the capabilities of
10 | modern hardware and software, like  NVMe disks or `io_uring`. 
11 | Read more in the [docs](docs/README.md)
12 | 
13 | - [Golang implementation](https://github.com/yottaStore/golang) (WIP) backed by Cloud Storage
14 | - [Rust implementation](https://github.com/yottaStore/rust) (WIP) Backed by io_uring NVMe API
15 | - 
16 | ## Main features
17 | 
18 | - Linear scalability, up to 10^9 nodes and 10 yotta bytes of addressable space.
19 | - Anti fragility, the multi tenant setup increase reliability and availability with load.
20 | - Self configuring and optimizing, thanks to machine learning
21 | - Strong consistency guarantees, aiming at sub ms latency.
22 | - Cheap transactions and indexes, at around `o(n)`.
23 | - Storage decoupled from compute with a serverless architecture.
24 | - Two orders of magnitude faster than DynamoDB, dollar per dollar.
25 | 
26 | 
27 | ## Techniques used
28 | 
29 | Yotta Store does not create any new technique, but uses existing ones in a novel combination:
30 | 
31 | | Problem                          | Solution                        | Description                                                                                                               |
32 | |----------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------|
33 | | Partitioning and replication     | ReBaR: Rendezvous Based Routing | A generalization of consistent hashing, for agreeing k <br/>choices in a stable way and with minimal information exchange |
34 | | Highly available read and writes | YottaFS                         | A wait free userspace filesystem, designed around NVMe devices and CRDTs.                                                 |
35 | | Hot records                      | YottaSelf                       | Hot records can be sandboxed in dedicated partitions.                                                                     |
36 | | Transactions                     | Modified Warp                   | Thanks to the modified warp algorithm no distributed consensus is needed.                                                 |
37 | | Indexes, queries across keys     | Partitioned indexes             | Indexes are sharded and lazily built on the same partition seeing the changes.                                            |
38 | | Membership and failure detection | Gossip agreement protocol       | Thanks to GAP and ReBaR the expected cost is `o(log(n))` deterministically.                                               |
39 | 
40 | 
41 | ## Inspirations
42 | 
43 | - [Designing Data-Intensive Applications, Martin Kleppmann](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/)
44 | - [The Dynamo paper, many authors](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
45 | - [A pragmatic implementation of non-blocking linked-lists, Timothy L. Harris](https://timharris.uk/papers/2001-disc.pdf)
46 | - [Warp: Lightweight Multi-Key Transactions for Key-Value Stores, many authors](https://arxiv.org/pdf/1509.07815.pdf)
47 | - [Thread per core](https://helda.helsinki.fi/bitstream/handle/10138/313642/tpc_ancs19.pdf)
48 | - [Kafka](https://docs.confluent.io/platform/current/kafka/design.html)
49 | 
50 | 
51 | 


--------------------------------------------------------------------------------
/docs/03_partitioning/41_examples.md:
--------------------------------------------------------------------------------
  1 | # Failure example
  2 | 
  3 | For each hierarchy level there is a maximum number of failures tolerated,
  4 | called **quorum**.
  5 | Let's say a group of 64 disks are attached to 8 nodes in 2 racks:
  6 | if a disk fails, the node will rebalance the load across the remaining
  7 | 7 disks without problems.
  8 | If instead more than quorum disks fails in a node, the system will trigger
  9 | a rebalance across the other nodes in the same rack.
 10 | 
 11 | Failures are local in the sense that first recovery is attempted locally,
 12 | and then if that fails the recovery is attempted involving only the minimum
 13 | amount of levels in the hierarchy tree.
 14 | 
 15 | # Scaling example
 16 | 
 17 | Generally using weights with the rendezvous hashing algorithm is not
 18 | a good idea, because changing the weights will cause a rebalance across
 19 | many keys.
 20 | 
 21 | Instead, we will use weights to accomodate for future expansion of the cluster,
 22 | which can be planned in advance, or to represent smaller nodes.
 23 | We can create fake zones or nodes and assign them a weight of 0.
 24 | When we start adding new nodes, we can assign a weight of 1 only to new accounts, to allow for gradually scaling up.
 25 | Eventually all the accounts will see a weight of 1, but the rebalancing
 26 | will be way more gradual.
 27 | 
 28 | See an example [here](examples.md). For an implementation please see [here]()
 29 | 
 30 | 
 31 | # Key addressing
 32 | 
 33 | Let's say we have a 128 nodes spread across 2 regions and we want
 34 | to know the position of a key: 
 35 | 
 36 | `account/tableName/resourceType/primaryKey`
 37 | 
 38 | We know that the replication factor is 6, and the partitioning factor
 39 | is 16. We decide to store the data in a single region.
 40 | 
 41 | Our hierarchy could be:
 42 | 
 43 | - 2 Regions
 44 | - 2 Zone per region
 45 | - Each zone has 32 nodes with 16 disks each.
 46 | - Each disk has 8 1TB namespaces,
 47 | - Each namespace has 128 queues
 48 | 
 49 | This means that we have a total of 2^21 combinations. Using a naif rendezvous hashing 
 50 | would be too expensive. We start by understanding which nodes are responsible for the 
 51 | table. Given the root of the key: `account/tableName/resourceType`, we need to pick
 52 | 16*6 = 96 nodes using the rendezvous hashing on this tree:
 53 | 
 54 | - 2 Regions, with weight [1, 0]
 55 | - 8 Zones, with weight [1, 1, 0, 0, 0, 0, 0, 0] 2
 56 | - Nodes, without weight
 57 | - Disks, without weight
 58 | - Queues, without weight
 59 | 
 60 | The first level is region, the weights forces the algorithm to pick the first region, 
 61 | to express a user preference and doesn't need hashing.
 62 | The second level is zone, with 2 real zones and 6 fake zones to plan for future expansion. 
 63 | Given that we need 96 nodes, we know that both zones will be involved.
 64 | The third level is node, with 32 possibilities times 2 zones. For each zone we will perform
 65 | a rendezvous hashing, and pick 16 nodes.
 66 | The fourth level is disk, with 8 possibilities times 16 nodes.
 67 | 
 68 | We have thus selected 256 possibilities. We can pick the 96 disks with the highest value and thus
 69 | have our pool. We can now build a tree which represent the `account/tableName/resourceType` 
 70 | hierarchy:
 71 | 
 72 | - 1 region
 73 | - 2 zones
 74 | - 16 nodes
 75 | - 16 disks per node
 76 | - 8 namespaces per disk
 77 | 
 78 | Now everytime we want to know the position of a key for the table, instead of performing another round 
 79 | on the cluster tree, we can do a lookup on this smaller tree:
 80 | 
 81 | - 2 zones
 82 | - 16 nodes
 83 | - 16 disks per node
 84 | - 8 namespaces per disk
 85 | - 128 queues per namespace
 86 | 
 87 | For a total of 2^19 possibilities. We can build a virtual tree:
 88 | 
 89 | - 32 virtual nodes
 90 | - 32 virtual nodes
 91 | - 32 virtual nodes
 92 | - 16 queues
 93 | 
 94 | To know the position of a key we need 4 steps, 8 steps if we want to know all the 6 replicas.
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 


--------------------------------------------------------------------------------
/docs/05_yottafs/11_introduction.md:
--------------------------------------------------------------------------------
  1 | # Filesystem 
  2 | 
  3 | It's possible to experiment with several different filesystem configurations,
  4 | among the most common ones are:
  5 | 
  6 | - **EXT4**: Old, but reliable. 
  7 | - **BTRFS, NILFS**: Better suited for our case.
  8 | - **NOATIME**: To improve performance.
  9 | - **RAID**: to control replication and availability.
 10 | - **TRIM**: To improve latency
 11 | 
 12 | Instead I would recommend a different approach, more aligned with our hardware 
 13 | capabilities, which is to use the **NVMe** specification
 14 | 
 15 | # NVMe specification
 16 | 
 17 | NVMe is a new specification for storage devices. It gives use a set of commands
 18 | to access the SSD as a block device, without any filesystem. It exposes a set 
 19 | of basic operation like:
 20 | 
 21 | - **READ**: read a number of sectors on the device
 22 | - **WRITE**: write a number of sectors on the device
 23 | - **FLUSH**: execute all the commands in the queue
 24 | - **VERIFY**: verify the data on the device
 25 | - **COMPARE**: compare the data on the device with the data in the memory
 26 | 
 27 | On top of that we also have `fused operations`, which allow us to combine
 28 | a `compare` and `write` operation in a single atomic command. This is 
 29 | fundamental for our [instruction set](../architecture/instructionSet.md).
 30 | 
 31 | Discarding a filesystem gives us a lot of perfromance but also
 32 | some new responsibilities:
 33 | 
 34 | - Detect and interface with the NVMe device
 35 | - Define a disk format
 36 | - Handle wear leveling
 37 | - Handle trimming
 38 | - Handle replication (RAID)
 39 | - Extend this to HDDs
 40 | 
 41 | Let's solve them one by one
 42 | 
 43 | ## NVMe interface
 44 | 
 45 | Although powerful, the NVMe specification can feel daunting. Thankfully, 
 46 | there exist a very nice library tailored exactly to our needs:
 47 | [Storage Performance Development Kit](https://spdk.io/)
 48 | 
 49 | Behind a set of simple C primitives, it provides a high level of abstraction
 50 | to the NVMe device, while fully residing in user land. This allow us 
 51 | to exploit the high concurrency of the device, by setting up multiple queues
 52 | with high depth.
 53 | 
 54 | nvme cli as fallback without queue or fused commands
 55 | 
 56 | ## Disk format
 57 | 
 58 | NVMe devices appears to Linux as a block device, which can be partitioned 
 59 | in several **namespaces**, backed by a number of underlying blocks. For 
 60 | simplicity of discussion, we will assume a namespace size of 1 TB, 
 61 | which at 4kb per sector, is equivalent to 2^28 blocks.
 62 | 
 63 | ### Disk Metadata
 64 | 
 65 | The disk format is a set of metadata that describes the disk layout, and 
 66 | allow us to parse the contents of the disk. It's especially important
 67 | that this data is not corrupted, as it would make the disk unusable.
 68 | 
 69 |  We start by reserving the first 128 blocks, and store
 70 | the initial metadata at position 0 and 64, as copies. It's easy to 
 71 |  imagine that a block of 4kb is more than enough to
 72 |  store all the metadata, but eventually we can use 2 blocks instead.
 73 | 
 74 | If for any reason the metadata needs to be updated, we will store the
 75 | data in the next blocks, 1 and 65, again as copies. Once a few copies 
 76 | will be written, let's say 9, we will write a number of zeros to the 
 77 | oldest block, to mark it as available.
 78 | 
 79 | Whenever a process wants to read the metadata, it can simply ask for
 80 | the first 128 blocks of the namespace, and then parse the most recent
 81 | block. The double copy is to provide corruption resilience.
 82 | 
 83 | ### Block format
 84 | 
 85 | Since we are dealing with a low level device, it's useful to define
 86 | a binary format for the blocks. We can take for example `Avro` as a 
 87 |  serialization format, maybe with a couple of addition:
 88 | 
 89 | - We need a pointer to the next block, in case one block is not enough
 90 | to store the data, and the next chunk is not in the same block.
 91 | - We need a checksum to verify the integrity of the data.
 92 | 
 93 | Read more [here](blockFormat.md).
 94 | 
 95 | ### Wear leveling
 96 | 
 97 | The metadata chapter give us an idea of how wear leveling can be implemented. The
 98 | application can decide to aggregate several blocks, and then rotate the writes
 99 | to provide wear leveling. Ideally we would like to have a mechanism
100 | to determine in advance which are the possible blocks, to submit the read to multiple
101 | blocks as a single command.
102 | 
103 | **Example**: A is in block 0, B is in block 1. The system might decide to write
104 | (A, B) to block 0, and leave block 1 free. When an updates comes, we write 
105 | it to block 1 and vice versa. Reading either A or B require to scan both 
106 | blocks to find the most recent data.
107 | 
108 | ### Trimming
109 | 
110 | Like with wear leveling, we can also implement a mechanism to trim the disk inspired 
111 | from the metadata chapter. It seems we need a mechanism with the following properties:
112 | 
113 | - Keep track of the free blocks
114 | - Trim the unused blocks in an asynchronous way
115 | - Be thread safe
116 | 
117 | It seems like we need a non blocking garbage collector, which will be useful also 
118 | for other scopes like improving our wear leveling mechanism. More information [here](hashTable.md).
119 | 
120 | ## Replication
121 | 
122 | Normally replication is handled both by RAID and the application, which is a waste.
123 | We instead can implement replication at the application level, with several advantages:
124 | 
125 | - Unlike RAID, replicas can be read and write concurrently, increasing the throughput
126 | - We can guarantee that copies will be spread across machines, increasing reliability
127 | - Replicas are used as voting machines
128 | 
129 | The rendezvous hashing allow us to choose in a deterministic way the nodes
130 | that will hold the replicas for our key, and thus chose accordingly.
131 | 
132 | ## Hard disks
133 | 
134 | We can now extend the disk model to hard disks, which are optimizied for linear 
135 | performance. If we want to publish to a queue, we can first determine which
136 | nodes are the owner of the queue, and then publish to that node.
137 | 
138 | The problem is that a message is usually small in size, and we would prefer to write
139 | 32 mb chunks at a time. To achieve that we can use an intermediate cache on the 
140 | SSDs: Let's say we allocate a chunk of 512 mb. Everytime a message arrives
141 | we write a 4 kb chunk to the disk in the first free block. When enough blocks
142 | are written they can sent to the hard disk. The large chunk allocation allows
143 | for wear leveling, non blocking concurrency writes.
144 | 
145 | The only thing we need is to store the metadata of the disk, with the list of chunks 
146 | allocated and to which queue. For that we can use an SSD like we did for the original metadata.
147 | More information [here](sata.md).
148 | 


--------------------------------------------------------------------------------