├── 0001-buffer.md ├── 0002-storage.md ├── 0003-cas-decomposition.md ├── 0004-icas.md ├── 0005-local-blob-access-persistency.md ├── 0006-operation-logging-and-monetary-resource-usage.md ├── 0007-nested-invocations.md ├── 0009-nfsv4.md ├── 0010-file-system-access-cache.md ├── 0011-rendezvous-hashing.md ├── README.md └── images └── 0005 ├── before.png ├── persistent.png └── volatile.png /0001-buffer.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #1: Buffer layer 2 | 3 | Author: Ed Schouten
4 | Date: 2020-01-09 5 | 6 | # Context 7 | 8 | The `BlobAccess` interface that Buildbarn currently uses to abstract 9 | away different kinds of backing stores for the CAS and AC (Redis, S3, 10 | gRPC, etc.) is a bit simplistic, in that contents are always transferred 11 | through `io.ReadCloser` handles. This is causing a couple of problems: 12 | 13 | - There is an unnecessary amount of copying of data in memory. In likely 14 | the worst case (`bb_storage` with a gRPC AC storage backend), data may 15 | be converted from ActionResult Protobuf message → byte slice → 16 | `io.ReadCloser` → byte slice → ActionResult Protobuf message. 17 | 18 | - It makes implementing a replicating `BlobAccess` harder, as 19 | `io.ReadCloser`s may not be duplicated. A replicating `BlobAccess` 20 | could manually copy blobs into a byte slice itself, but this would 21 | only contribute to more unnecessary copying of data. 22 | 23 | - Tracking I/O completion and hooking into errors is not done 24 | accurately and consistently. `MetricsBlobAccess` currently only counts 25 | the amount of time spent in `Get()`, which won't always include the 26 | time actually spent transferring data. `ReadCachingBlobAccess` is 27 | capable of responding to errors returned by `Get()`, but not ones that 28 | occur during the actual transfer. A generic mechanism for hooking into 29 | I/O errors and completion is absent. 30 | 31 | - To implement an efficient FUSE file system for `bb_worker`, we need to 32 | support efficient random access I/O, as userspace applications are 33 | capable of accessing files at random. This is not supported by 34 | `BlobAccess`. An alternative would be to not let a FUSE file system 35 | use `BlobAccess` directly, but this would lead to reduced operational 36 | flexibility. 37 | 38 | - Data integrity checking (checksumming) is currently done through 39 | `MerkleBlobAccess`. To prevent foot-shooting, this `BlobAccess` 40 | decorator is injected into the configuration automatically. The 41 | problem is that `MerkleBlobAccess` is currently only inserted at one 42 | specific level of the system. When Buildbarn is configured to use a 43 | Redis remote data store in combination with local on-disk caching, 44 | there is no way to enable checksum validation for both Redis → local 45 | cache and local cache → consumption. 46 | 47 | - Stretch goal: There is no framework in place to easily implement 48 | decomposition of larger blobs into smaller ones (e.g., using 49 | [VSO hashing](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md)). 50 | Supporting this would allow us to use data stores optimized for small 51 | blobs exclusively (e.g., Redis). 52 | 53 | # Decision 54 | 55 | The decision is to add a new abstraction to Buildbarn, called the buffer 56 | layer, stored in Go package 57 | `github.com/buildbarn/bb-storage/pkg/blobstore/buffer`. Below is a 58 | massively simplified version of what the API will look like: 59 | 60 | ```go 61 | func NewBufferFromActionResult(*remoteexecution.ActionResult) Buffer {} 62 | func NewBufferFromByteSlice([]byte) Buffer {} 63 | func NewBufferFromReader(io.ReadCloser) Buffer {} 64 | 65 | type Buffer interface { 66 | ToActionResult() *remoteexecution.ActionResult 67 | ToByteSlice() []byte 68 | ToReader() io.ReadCloser 69 | } 70 | ``` 71 | 72 | It can be thought of as a union/variant type that automatically does 73 | conversions from one format to another, but only when strictly 74 | necessary. Calling one of the `To*()` functions extracts the data from 75 | the buffer, thereby destroying it. 76 | 77 | To facilitate support for replication, the `Buffer` interface may 78 | contain a `Clone()` function. For buffers created from a byte slice, 79 | this function may be a no-op, causing the underlying slice to be shared. 80 | For buffers created from other kinds of sources, the implementation may 81 | be more complex (e.g., converting it to a byte slice on the spot). 82 | 83 | To track I/O completion and to support retrying, every buffer may have a 84 | series of `ErrorHandler`s associated with it: 85 | 86 | ```go 87 | type ErrorHandler interface { 88 | OnError(err error) (Buffer, error) 89 | Done() 90 | } 91 | ``` 92 | 93 | During a transmission, a buffer may call `OnError()` whenever it runs 94 | into an I/O error, asking the `ErrorHandler` to either capture the error 95 | (as done by `MetricsBlobAccess`), substitute the error (as done by 96 | `ExistencePreconditionBlobAccess`) or to continue the transmission using 97 | a different buffer. The latter may be a copy of the same data obtained 98 | from another source (high availability). 99 | 100 | To facilitate fast random access I/O, the `Buffer` interface may 101 | implement `io.ReaderAt` to extract just a part of the data. It is likely 102 | the case that only a subset of the buffer types are capable of 103 | implementing this efficiently, but that is to be expected. 104 | 105 | Data integrity checking could be achieved by having special flavors of 106 | the buffer construction functions: 107 | 108 | ```go 109 | func NewACBufferFromReader(io.ReadCloser, RepairStrategy) Buffer {} 110 | func NewCASBufferFromReader(*util.Digest, io.ReadCloser, RepairStrategy) Buffer {} 111 | ``` 112 | 113 | Buffers created through these functions may enforce that their contents 114 | are valid prior to returning them to their consumer. When detecting 115 | inconsistencies, the provided `RepairStrategy` may contain a callback 116 | that the storage backend can use to repair or delete the inconsistent 117 | object. 118 | 119 | Support for decomposing large blobs into smaller ones and recombining 120 | them may be realized by adding more functions to the `Buffer` interface 121 | or by adding decorator types. The exact details of that are outside the 122 | scope of this ADR. 123 | -------------------------------------------------------------------------------- /0002-storage.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #2: Towards fast and reliable storage 2 | 3 | Author: Ed Schouten
4 | Date: 2020-01-09 5 | 6 | # Context 7 | 8 | Initial versions of Buildbarn as published on GitHub made use of a 9 | combination of Redis and S3 to hold the contents of the Action Cache 10 | (AC) and Content Addressable Storage (CAS). Due to their performance 11 | characteristics, Redis was used to store small objects, while S3 was 12 | used to store large ones (i.e., objects in excess of 2 MiB). This 13 | partitioning was facilitated by the SizeDistinguishingBlobAccess storage 14 | class. 15 | 16 | Current versions of Buildbarn still support Redis and S3, but the 17 | example deployments use bb\_storage with the "circular" storage backend 18 | instead. Though there were valid reasons for this change at the time, 19 | these reasons were not adequately communicated to the community, which 20 | is why users are still tempted to use Redis and S3 up until this day. 21 | 22 | The goal of this document is as follows: 23 | 24 | - To describe the requirements that Bazel and Buildbarn have on storage. 25 | - To present a comparative overview of existing storage backends. 26 | - To propose changes to Buildbarn's storage architecture that allow us 27 | to construct a storage stack that satisfies the needs for most large 28 | users of Buildbarn. 29 | 30 | # Storage requirements of Bazel and Buildbarn 31 | 32 | In the common case, Bazel performs the following operations against 33 | Buildbarn when executing a build action: 34 | 35 | 1. ActionCache.GetActionResult() is called to check whether a build 36 | action has already been executed previously. This call extracts an 37 | ActionResult message from the AC. If such a message is found, Bazel 38 | continues with step 5. 39 | 2. Bazel constructs a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree) 40 | of Action, Command and Directory messages and associated input files. 41 | It then calls ContentAddressableStorage.FindMissingBlobs() to 42 | determine which parts of the Merkle tree are not present in the CAS. 43 | 3. Any missing nodes of the Merkle tree are uploaded into the CAS using 44 | ByteStream.Write(). 45 | 4. Execution of the build action is triggered through 46 | Execution.Execute(). Upon successful completion, this function 47 | returns an ActionResult message. 48 | 5. Bazel downloads all of the output files referenced by the 49 | ActionResult message from the CAS to local disk using 50 | ByteStream.Read(). 51 | 52 | By letting Bazel download all output files from the CAS to local disk, 53 | there is a guarantee that Bazel is capable of making forward progress. 54 | If any of the objects referenced by the build action were to disappear 55 | during these steps, the RPCs will either fail with code `NOT_FOUND` or 56 | `FAILED_PRECONDITION`. This instructs Bazel to reupload inputs from 57 | local disk and retry. Bazel therefore never has to perform any 58 | backtracking on its build graph, an assumption that is part of its 59 | design. 60 | 61 | This implementation places relatively weak requirements on Buildbarn in 62 | terms of consistency and persistence. Given a sufficient amount of 63 | storage space, Buildbarn will generally make sure that input files don't 64 | disappear between ContentAddressableStorage.FindMissingBlobs() and the 65 | completion of Execution.Execute(). It will also generally make sure that 66 | output files remain present at least long enough for Bazel to download 67 | them. Violating these weak requirements only affects performance; not 68 | reliability. 69 | 70 | Bazel recently gained a feature called 71 | ['Builds without the Bytes'](https://github.com/bazelbuild/bazel/issues/6862). 72 | By enabling this feature using the `--remote_download_minimal` command 73 | line flag, Bazel will no longer attempt to download output files to 74 | local disk. This feature causes a significant drop in build times and 75 | network bandwidth consumed. This is especially noticeable for workloads 76 | that yield large output files. Buildbarn should attempt to support those 77 | workloads. 78 | 79 | Even with 'Builds without the Bytes' enabled, Bazel assumes it is always 80 | capable of making forward progress. This strengthens the storage 81 | requirements on Buildbarn as follows: 82 | 83 | 1. Bazel no longer downloads output files referenced by 84 | ActionResult messages, but may use them as inputs for other build 85 | actions at any later point during the build. This means that 86 | ActionCache.GetActionResult() and Execution.Execute() may never return 87 | ActionResult messages that refer to objects that are either not 88 | present or not guaranteed to remain present during the remainder of 89 | the build. 90 | 2. Technically speaking, it only makes sense for Bazel to call 91 | ContentAddressableStorage.FindMissingBlobs() against objects that 92 | Bazel is capable of uploading. With 'Builds without the Bytes' 93 | enabled, this set no longer has to contain output files of dependent 94 | build actions. However, this specific aspect is not implemented by 95 | Bazel. It still passes in the full set. Even worse: returning 96 | these objects as absent tricks Bazel into uploading files that are 97 | not present locally, causing a failure in the Bazel client. 98 | 99 | To tackle the first requirement, Buildbarn recently gained a 100 | CompletenessCheckingBlobAccess decorator that causes bb\_storage to only 101 | return ActionResult entries from the AC in case all output files are 102 | present in the CAS. Presence is checked by calling 103 | BlobAccess.FindMissing() against the CAS, which happens to be the same 104 | function used to implement ContentAddressableStorage.FindMissingBlobs(). 105 | 106 | This places strong requirements on the behaviour of 107 | BlobAccess.FindMissing(). To satisfy the first requirement, 108 | BlobAccess.FindMissing() may not under-report the absence of objects. To 109 | satisfy the second requirement, it may not over-report. In other words, 110 | it has to be *exactly right*. 111 | 112 | Furthermore, BlobAccess.FindMissing() now has an additional 113 | responsibility. Instead of only reporting absence, it now has to touch 114 | objects that are present, ensuring that they don't get evicted during 115 | the remainder of the build. Because Buildbarn itself has no notion of 116 | full build processes (just individual build actions), this generally 117 | means that Buildbarn needs to hold on to the data as long as possible. 118 | This implies that underlying storage needs to use an LRU or pseudo-LRU 119 | eviction policy. 120 | 121 | # Comparison of existing storage backends 122 | 123 | Now that we've had an overview of our storage requirements, let's take a 124 | look at the existing offering of Buildbarn storage backends. In addition 125 | to consistency requirements, we take maintenance aspects into account. 126 | 127 | ## Redis 128 | 129 | Advantages: 130 | 131 | - Lightweight, bandwidth efficient protocol. 132 | - An efficient server implementation exists. 133 | - Major Cloud providers offer managed versions: Amazon ElastiCache, 134 | Google Cloud Memorystore, Azure Cache for Redis, etc. 135 | - Pipelining permits efficiently implementing FindMissing() by sending a 136 | series of `EXISTS` requests. 137 | - [Supports LRU caching.](https://redis.io/topics/lru-cache) 138 | 139 | Disadvantages: 140 | 141 | - The network protocol isn't multiplexed, meaning that clients need to 142 | open an excessive number of network sockets, sometimes up to dozens 143 | of sockets per worker, per storage backend. Transfers of large objects 144 | (files) may hold up requests for small objects (Action, Command and 145 | Directory messages). 146 | - Redis is not designed to store large objects. The maximum permitted 147 | object size is 512 MiB. If Buildbarn is used to generate installation 148 | media (e.g., DVD images), it is desirable to generate artifacts that 149 | are multiple gigabytes in size. 150 | - The client library that is currently being used by Buildbarn, 151 | [go-redis](https://github.com/go-redis/redis), does not use the 152 | [context](https://golang.org/pkg/context/) package, meaning that 153 | handling of cancelation, timeouts and retries is inconsistent with how 154 | the rest of Buildbarn works. This may cause unnecessary propagation of 155 | transient error conditions and exhaustion of connection pools. Of 156 | [all known client library implementations](https://redis.io/clients#go), 157 | only [redispipe](https://github.com/joomcode/redispipe) uses context. 158 | - go-redis does not support streaming of objects, meaning that large 159 | objects need to at some point be stored in memory contiguously, which 160 | causes heavy fluctuations in memory usage. Only the unmaintained 161 | [shipwire/redis](https://github.com/shipwire/redis) library has an 162 | architecture that would support streaming. 163 | 164 | Additional disadvantages of Redis setups that have replication enabled, 165 | include: 166 | 167 | - They violate consistency requirements. A `PUT` operation does not 168 | block until the object is replicated to read replicas, meaning that a 169 | successive `GET` may fail. Restarts of read replicas may cause objects 170 | to be reported as absent, even though they are present on the master. 171 | - It has been observed that performance of replicated setups decreases 172 | heavily under high eviction rates. 173 | 174 | ## Amazon S3, Google Cloud Storage, Microsoft Azure Blob, etc. 175 | 176 | Advantages: 177 | 178 | - Virtually free of maintenance. 179 | - Decent SLA. 180 | - Relatively inexpensive. 181 | 182 | Disadvantages: 183 | 184 | - Amazon S3 is an order of magnitude slower than Amazon ElastiCache 185 | (Redis). 186 | - Amazon S3 does not provide the right consistency model, as 187 | [it only guarantees eventual consistency](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel) 188 | for our access pattern. 189 | [Google Cloud Storage offers stronger consistency.](https://cloud.google.com/blog/products/gcp/how-google-cloud-storage-offers-strongly-consistent-object-listing-thanks-to-spanner) 190 | - There are no facilities for performing bulk existence queries, meaning 191 | it is hard to efficiently implement FindMissing(). 192 | - Setting TTLs on objects will make a bucket behave like a FIFO cache, 193 | which is incompatible with 'Builds without the Bytes'. 194 | Turning this into an LRU cache requires triggering `COPY` requests 195 | upon access. Again, due to the lack of bulk queries, calls like 196 | FindMissing() will become even slower. 197 | 198 | ## bb\_storage with the "circular" storage backend 199 | 200 | Advantages: 201 | 202 | - By using gRPC, handling of cancelation, timeouts and retries is either 203 | straightforward or dealt with automatically. 204 | - [gRPC has excellent support for OpenTracing](https://github.com/grpc-ecosystem/grpc-opentracing), 205 | meaning that it will be possible to trace requests all the way from 206 | the user to the storage backend. 207 | - gRPC has excellent support for multiplexing requests, meaning the 208 | number of network sockets used is reduced dramatically. 209 | - By using gRPC all across the board, there is rich and non-ambiguous 210 | error message propagation. 211 | - bb\_storage exposes decent Prometheus metrics. 212 | - The "circular" storage backend provides high storage density. All 213 | objects are concatenated into a single file without any form of 214 | padding. 215 | - The "circular" storage backend has a hard limit on the amount of 216 | storage space it uses, meaning it is very unlikely storage nodes will 217 | run out of memory or disk space, even under high load. 218 | - The performance of the "circular" storage backend doesn't noticeably 219 | degrade over time. The data file doesn't become fragmented. The hash 220 | table that is used to look up objects uses a scheme that is inspired 221 | by [cuckoo hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing). It 222 | prefers displacing older entries over newer ones. This makes the hash 223 | table self-cleaning. 224 | 225 | Disadvantages: 226 | 227 | - The "circular" storage backend implements a FIFO eviction scheme, 228 | which is incompatible with 'Builds without the Bytes'. 229 | - The "circular" storage backend is known to have some design bugs that 230 | may cause it to return corrupted data when reading blobs close to the 231 | write cursors. 232 | - When running Buildbarn in the Cloud, a setup like this has more 233 | administrative overhead than Redis and S3. Scheduling bb\_storage on 234 | top of Kubernetes using stateful sets and persistent volume claims may 235 | require administrators to periodically deal with stuck pods and 236 | corrupted volumes. Though there are likely various means to automate 237 | these steps (e.g., by running recovery cron jobs), it is not free in 238 | terms of effort. 239 | 240 | ## bb\_storage with the "inMemory" storage backend 241 | 242 | Advantages: 243 | 244 | - The eviction policy can be set to FIFO, LRU or random eviction, though 245 | LRU should be preferred for centralized storage. 246 | - Due to there being no persistent state, it is possible to run 247 | bb\_storage through a simple Kubernetes deployment, meaning long 248 | amounts of downtime are unlikely. 249 | - Given a sufficient amount of memory, it deals well with objects that 250 | are gigabytes in size. Both the client and server are capable of 251 | streaming results, meaning memory usage is predictable. 252 | 253 | In addition to the above, it shared the gRPC-related advantages of 254 | bb\_storage with the "circular" storage backend. 255 | 256 | Disadvantages: 257 | 258 | - There is no persistent storage, meaning all data is lost upon crashes 259 | and power failure. This will cause running builds to fail. 260 | - By placing all objects in separate memory allocations, it puts a lot 261 | of pressure on the garbage collector provided by the Go runtime. For 262 | worker-level caches that are relatively small, this is acceptable. For 263 | centralized caching, this becomes problematic. Because Go does not 264 | ship with a defragmenting garbage collector, heap fragmentation grows 265 | over time. It seems impossible to set storage limits anywhere close to 266 | the available amount of RAM. 267 | 268 | # A new storage stack for Buildbarn 269 | 270 | Based on the comparison above, replicated Redis, S3 and bb\_storage with 271 | the "circular" storage backend should **not** be used for Buildbarn's 272 | centralized storage. They do not provide the right consistency 273 | guarantees to satisfy Bazel. That said, there may still be valid use 274 | cases for these backends in the larger Buildbarn ecosystem. For example, 275 | S3 may be a good fit for archiving historical build actions, so that 276 | they may remain accessible through bb\_browser. Such use cases are out 277 | of scope for this document. 278 | 279 | This means that only non-replicated Redis and bb\_storage with the 280 | "inMemory" storage backend remain viable from a consistency standpoint, 281 | though both of these also have enough practical disadvantages that they 282 | cannot be used (no support for large objects and excessive heap 283 | fragmentation, respectively). 284 | 285 | This calls for the creation of a new storage stack for Buildbarn. Let's 286 | first discuss what such a stack could look like for simple, single-node 287 | setups. 288 | 289 | ## Single-node storage 290 | 291 | For single-node storage, let's start off with designing a new storage 292 | backend named "local" that is initially aimed at replacing the 293 | "inMemory" and later on the "circular" backends. Instead of coming up 294 | with a brand-new design for such a storage backend, let's reuse the 295 | overall architecture of "circular", but fix some of the design flaws it 296 | had: 297 | 298 | - **Preventing data corruption:** Data corruption in the "circular" 299 | backend stems from the fact that we overwrite old blobs. Those old 300 | blobs may still be in the process of being downloaded by a user. An 301 | easy way to prevent data corruption is thus to stop overwriting 302 | existing data. Instead, we can let it keep track of its data by 303 | storing it in a small set of blocks (e.g., 10 blocks that are 1/10th 304 | the size of total storage). Whenever all blocks become full, the 305 | oldest one is discarded and replaced by a brand new block. By 306 | implementing reference counting (or relying on the garbage collector), 307 | requests for old data may complete without experiencing any data 308 | corruption. 309 | - **Pseudo-LRU:** Whereas "circular" uses FIFO eviction, we can let our 310 | new backend provide LRU-like eviction by copying blobs from older 311 | blocks to newer ones upon access. To both provide adequate performance 312 | and reduce redundancy, this should only be performed on blobs coming 313 | from blocks that are beyond a certain age. By only applying this 314 | scheme to the oldest 1/4 of data, only up to 1/3 of data may be stored 315 | twice. 316 | - **Preventing 'tidal waves':** The "circular" backend would be prone to 317 | 'tidal waves' of writes. When objects disappear from storage, it is 318 | highly probable that many related files disappear at the same time. 319 | With the pseudo-LRU design, the same thing applies to refreshing 320 | blobs: when BlobAccess.FindMissing() needs to refresh a single file 321 | that is part of an SDK, it will likely need to refresh the entire SDK. 322 | We can amortize the cost of this process by smearing writes across 323 | multiple blocks, so that they do not need to be refreshed at the same 324 | time. 325 | 326 | Other Buildbarn components (e.g., bb\_worker) communicate to the 327 | single-node storage server by using the following blobstore 328 | configuration: 329 | 330 | ```jsonnet 331 | { 332 | local config = { grpc: { address: 'bb-storage.example.com:12345' } }, 333 | contentAddressableStorage: config, 334 | actionCache: config, 335 | } 336 | ``` 337 | 338 | ## Adding scalability 339 | 340 | For larger setups it may be desired to store more data in cache than 341 | fits in memory of a single system. For these setups it is suggested that 342 | the already existing ShardingBlobAccess is used: 343 | 344 | ```jsonnet 345 | { 346 | local config = { 347 | sharding: { 348 | hashInitialization: 3151213777095999397, // A random 64-bit number. 349 | shards: [ 350 | { 351 | backend: { grpc: { address: 'bb-storage-0.example.com:12345' } }, 352 | weight: 1, 353 | }, 354 | { 355 | backend: { grpc: { address: 'bb-storage-1.example.com:12345' } }, 356 | weight: 1, 357 | }, 358 | { 359 | backend: { grpc: { address: 'bb-storage-2.example.com:12345' } }, 360 | weight: 1, 361 | }, 362 | ], 363 | }, 364 | }, 365 | contentAddressableStorage: config, 366 | actionCache: config, 367 | } 368 | ``` 369 | 370 | In the example above, the keyspace is divided across three storage 371 | backends that will each receive approximately 33% of traffic. 372 | 373 | ## Adding fault tolerance 374 | 375 | The setup described above is able to recover from failures rapidly, due 376 | to all bb\_storage processes being stateful, but non-persistent. When an 377 | instance becomes unavailable, a system like Kubernetes will be able to 378 | quickly spin up a replacement, allowing new builds to take place once 379 | again. Still, there are two problems: 380 | 381 | - Builds running at the time of the failure all have to terminate, as 382 | results from previous build actions may have disappeared. 383 | - Due to the random nature of object digests, the loss of a single shard 384 | likely means that most cached build actions end up being incomplete, 385 | due to them depending on output files and logs stored across different 386 | shards. 387 | 388 | To mitigate this, we can introduce a new BlobAccess decorator named 389 | MirroredBlobAccess that supports a basic replication strategy between 390 | pairs of servers, similar to [RAID 1](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_1). 391 | When combined with ShardingBlobAccess, it allows the creation of a 392 | storage stack that roughly resembles [RAID 10](https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_(RAID_1+0)). 393 | 394 | - BlobAccess.Get() operations may be tried against both backends. If it 395 | is detected that only one of the backends possesses a copy of the 396 | object, it is replicated on the spot. 397 | - BlobAccess.Put() operations are performed against both backends. 398 | - BlobAccess.FindMissing() operations are performed against both 399 | backends. Any inconsistencies in the results of both backends are 400 | resolved by replicating objects in both directions. The intersection of 401 | the results is then returned. 402 | 403 | The BlobAccess configuration would look something like this: 404 | 405 | ```jsonnet 406 | { 407 | local config = { 408 | sharding: { 409 | hashInitialization: 3151213777095999397, // A random 64-bit number. 410 | shards: [ 411 | { 412 | backend: { 413 | mirrored: { 414 | backendA: { grpc: { address: 'bb-storage-a0.example.com:12345' } }, 415 | backendB: { grpc: { address: 'bb-storage-b0.example.com:12345' } }, 416 | }, 417 | }, 418 | weight: 1, 419 | }, 420 | { 421 | backend: { 422 | mirrored: { 423 | backendA: { grpc: { address: 'bb-storage-a1.example.com:12345' } }, 424 | backendB: { grpc: { address: 'bb-storage-b1.example.com:12345' } }, 425 | }, 426 | }, 427 | weight: 1, 428 | }, 429 | { 430 | backend: { 431 | mirrored: { 432 | backendA: { grpc: { address: 'bb-storage-a2.example.com:12345' } }, 433 | backendB: { grpc: { address: 'bb-storage-b2.example.com:12345' } }, 434 | }, 435 | }, 436 | weight: 1, 437 | }, 438 | ], 439 | }, 440 | }, 441 | contentAddressableStorage: config, 442 | actionCache: config, 443 | } 444 | ``` 445 | 446 | With these semantics in place, it is completely safe to replace one half 447 | of each pair of servers with an empty instance without causing builds to 448 | fail. This also permits rolling upgrades without serious loss of data by 449 | only upgrading half of the fleet during every release cycle. Between 450 | cycles, one half of systems is capable of repopulating the other half. 451 | 452 | This strategy may even be used to grow the storage pool without large 453 | amounts of downtime. By changing the order of ShardingBlobAccess and 454 | MirroredBlobAccess, it is possible to temporarily turn this setup into 455 | something that resembles [RAID 01](https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_01_(RAID_0+1)), 456 | allowing the pool to be resharded in two separate stages. 457 | 458 | To implement this efficiently, a minor extension needs to be made to 459 | Buildbarn's buffer layer. To implement MirroredBlobAccess.Put(), buffer 460 | objects need to be cloned. The existing Buffer.Clone() realises this by 461 | creating a full copy of the buffer, so that writing to each of the 462 | backends can take place at its own pace. For large objects copying may 463 | be expensive, which is why Buffer.Clone() should be replaced by one 464 | flavour that copies and one that multiplexes the underlying stream. 465 | 466 | # Alternatives considered 467 | 468 | Instead of building our own storage solution, we considered switching to 469 | distributed database systems, such as 470 | [CockroachDB](https://www.cockroachlabs.com) and 471 | [FoundationDB](https://www.foundationdb.org). Though they will solve all 472 | consistency problems we're experiencing, they are by no means designed 473 | for use cases that are as bandwidth intensive as ours. These systems are 474 | designed to only process up to about a hundred megabytes of data per 475 | second. They are also not designed to serve as caches, meaning separate 476 | garbage collector processes need to be run periodically. 477 | 478 | Data stores that guarantee eventual consistency can be augmented to 479 | provide the required consistency requirements by placing an index in 480 | front of them that is fully consistent. Experiments where a Redis-based 481 | index was placed in front of S3 proved successful, but were by no means 482 | fault-tolerant. 483 | 484 | # Future work 485 | 486 | For Buildbarn setups that are not powered on permanently or where 487 | RAM-backed storage is simply too expensive, there may still be a desire 488 | to have disk-backed storage for the CAS and AC. We should extend the new 489 | "local" storage backend to support on-disk storage, just like "circular" 490 | does. 491 | 492 | To keep the implementation simple and understandable, MirroredBlobAccess 493 | will initially be written in such a way that it requires both backends 494 | to be available. This is acceptable, because unavailable backends can 495 | easily be replaced by new non-persistent nodes. For persistent setups, 496 | this may not be desired. MirroredBlobAccess would need to be extended to 497 | work properly in degraded environments. 498 | 499 | Many copies of large output files with only minor changes to them cause 500 | cache thrashing. This may be prevented by decomposing large output files 501 | into smaller chunks, so that deduplication on the common parts may be 502 | performed. Switching from SHA-256 to a recursive hashing scheme, such as 503 | [VSO-Hash](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md), 504 | makes it possible to implement this while retaining the Merkle tree 505 | property. 506 | -------------------------------------------------------------------------------- /0003-cas-decomposition.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #3: Decomposition of large CAS objects 2 | 3 | Author: Ed Schouten
4 | Date: 2020-04-16 5 | 6 | # Context 7 | 8 | [The Remote Execution protocol](https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/execution/v2/remote_execution.proto) 9 | that is implemented by Buildbarn defines two data stores: the Action 10 | Cache (AC) and the Content Addressable Storage (CAS). The CAS is where 11 | almost all data is stored in terms of size (99%+), as it is used to 12 | store input files and output files of build actions, but also various 13 | kinds of serialized Protobuf messages that define the structure of build 14 | actions. 15 | 16 | As the name implies, the CAS is [content addressed](https://en.wikipedia.org/wiki/Content-addressable_storage), 17 | meaning that all objects are identified by their content, in this case a 18 | cryptographic checksum (typically [SHA-256](https://en.wikipedia.org/wiki/SHA-2)). 19 | This permits Buildbarn to automatically deduplicate identical objects 20 | and discard objects that are corrupted. 21 | 22 | Whereas most objects stored in the AC are all similar in size 23 | (kilobytes), sizes of objects stored in the CAS may vary a lot. Small 24 | [Directory](https://github.com/bazelbuild/remote-apis/blob/b5123b1bb2853393c7b9aa43236db924d7e32d61/build/bazel/remote/execution/v2/remote_execution.proto#L673-L685) 25 | objects are hundreds of bytes in size, while container layers created 26 | with [rules\_docker](https://github.com/bazelbuild/rules_docker) may 27 | consume a gigabyte of space. This is observed to be problematic for 28 | Buildbarn setups that use sharded storage. Each shard will receive a 29 | comparable number of requests, but the amount of network traffic and CPU 30 | load to which those requests translate, fluctuates. 31 | 32 | Buildbarn has code in place to checksum validate data coming from 33 | untrusted sources. For example, bb\_worker validates all data received 34 | from bb\_storage. bb\_storage validates all data obtained from clients 35 | and data read from disk. This is done for security reasons, but also to 36 | ensure that bugs in storage code don't lead to execution of malformed 37 | build actions. Unfortunately, there are an increasing number of use 38 | cases where partial reads need to be performed (lazy-loading file 39 | systems for bb\_worker, Bazel's ability to resume aborted transfers, 40 | etc.). Such partial reads currently still cause entire objects to be 41 | read, simply to satisfy the checksum validation process. This is 42 | wasteful. 43 | 44 | To be able to address these issues, this ADR proposes the addition to 45 | some new features to Buildbarn's codebase to be able to eliminate the 46 | existence of such large CAS objects altogether. 47 | 48 | # Methods for decomposing large objects 49 | 50 | In a nutshell, the idea is to partition files whose size exceed a 51 | certain threshold into smaller blocks. Each of the blocks will have a 52 | fixed size, except the last block in a file, which may be smaller. The 53 | question then becomes what naming scheme is used to store them in the 54 | CAS. 55 | 56 | The simplest naming scheme would be to give all blocks a suffix that 57 | indicates the offset in the file. For example, a file that is decomposed 58 | into three blocks could use digests containing hashes `-0`, 59 | `-1` and `-2`, where `` refers to the original file 60 | hash. Calls to `BlobAccess.Get()`, `Put()` and `FindMissing()` could be 61 | implemented to expand to a series of digests containing the block number 62 | suffix. 63 | 64 | Unfortunately, this approach is too simplistic. Because `` refers 65 | to the checksum of the file as a whole, there is no way the consistency 66 | of individual blocks can be validated. If blocks are partitioned across 67 | shards, there is no way an individual shard can verify the integrity of 68 | its own data. 69 | 70 | A solution would be to store each of the blocks using their native 71 | digest, meaning a hash of each of them is computed. This on its own, 72 | however, would only allow storing data. There would be no way to look up 73 | blocks afterwards, as only the combined hash is known. To be able to 74 | retrieve the data, we'd to store a separate manifest that contains a 75 | list of the digests of all blocks in the file. 76 | 77 | Even though such an approach would allow us to validate the integrity of 78 | individual blocks of data, conventional hashing algorithms wouldn't 79 | allow us to verify manifests themselves. There is no way SHA-256 hashes 80 | of individual blocks may be combined into a SHA-256 hash of the file as 81 | a whole. SHA-256's [Merkle-Damgård construction](https://en.wikipedia.org/wiki/Merkle–Damgård_construction) 82 | does not allow it. 83 | 84 | # VSO-Hash: Paged File Hash 85 | 86 | As part of their [BuildXL infrastructure](https://github.com/microsoft/BuildXL), 87 | Microsoft published a hashing algorithm called ["VSO-Hash"](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md) 88 | or "Paged File Hash". VSO-Hash works by first computing SHA-256 hashes 89 | for every 64 KiB page of data. Then, 32 page hashes (for 2 MiB of data) 90 | are taken together and hashed using SHA-256 to form a block hash. 91 | Finally, all block hashes are combined by [folding](https://en.wikipedia.org/wiki/Fold_(higher-order_function)) 92 | them. 93 | 94 | By letting clients use VSO-Hash as a digest function, Buildbarn could be 95 | altered to decompose blobs either in 64 KiB or 2 MiB blocks, or both 96 | (i.e., decomposing into 64 KiB chunks, but having two levels of 97 | manifests). This would turn large files into [Merkle trees](https://en.wikipedia.org/wiki/Merkle_tree) 98 | of depth two or three. Pages, blocks and their manifests could all be 99 | stored in the CAS and validated trivially. 100 | 101 | Some work has already been done in the Bazel ecosystem to 102 | facilitate this. For example, the Remote Execution protocol already 103 | contains an enumeration value for VSO-Hash, which was 104 | [added by Microsoft in 2019](https://github.com/bazelbuild/remote-apis/pull/84). 105 | There are, however, various reasons not to use VSO-Hash: 106 | 107 | - Microsoft seemingly only added support for VSO-Hash as a migration 108 | aid. There are efforts to migrate BuildXL from VSO-Hash to plain 109 | SHA-256. The author has mentioned that support for VSO-Hash will likely 110 | be retracted from the protocol once this migration is complete. 111 | - Bazel itself doesn't implement it. 112 | - It is somewhat inefficient. Because the full SHA-256 algorithm is used 113 | at every level, the Merkle-Damgård construction is used excessively. 114 | The algorithm could have been simpler by defining its scheme directly 115 | on top of the SHA-256 compression function. 116 | - Algorithms for validating files as a whole, (manifests of) pages and 117 | (manifests of) blocks are all different. Each of these needs a 118 | separate implementation, as well as a distinct digest format, so that 119 | storage nodes know which validation algorithm to use upon access. 120 | - It effectively leaks the decomposition strategy into the client. 121 | VSO-Hash only allows decomposition at the 64 KiB and 2 MiB level. 122 | Decomposition at any other power of two would require the use of an 123 | entirely different hashing algorithm. 124 | 125 | # BLAKE3 126 | 127 | At around the time this document was written, a new hashing algorithm 128 | was published, called [BLAKE3](https://github.com/BLAKE3-team/BLAKE3/). 129 | As the name implies, BLAKE3 is a successor of BLAKE, an algorithm that 130 | took part in the [SHA-3 competition](https://en.wikipedia.org/wiki/NIST_hash_function_competition). 131 | In the end BLAKE lost to Keccak, not because it was in any way insecure, 132 | but because it was too similar to SHA-2. The goal of the SHA-3 133 | competition was to have a backup in case fundamental security issues in 134 | SHA-2 are found. 135 | 136 | An interesting improvement of BLAKE3 over its predecessor BLAKE2 is 137 | that it no longer uses the Merkle-Damgård construction. Instead, 138 | chaining is only used for chunks up to 1 KiB in size. When hashing 139 | objects that are larger than 1 KiB, a binary Merkle tree of 1 KiB chunks 140 | is constructed. The final hash value is derived from the Merkle tree's 141 | root node. If clients were to use BLAKE3, Buildbarn would thus be able 142 | to decompose files into blocks that are any power of two, at least 1 KiB 143 | in size. 144 | 145 | BLAKE3 uses an Extendable-Output Function (XOF) to post-process the 146 | state of the hasher into an output sequence of arbitrary length. Because 147 | this process invokes the BLAKE3 compression function, it is not 148 | reversible. This means that manifests cannot contain hashes of its 149 | blocks, as that would not allow independent integrity checking of 150 | the manifest. Instead, manifest entries should hold the state of the 151 | hasher (i.e., the raw Merkle tree node). That way it's possible both to 152 | validate and convert to a block digest. 153 | 154 | For chunk nodes (that represent 1 KiB of data or less), the hasher state 155 | can be stored in 97 bytes (256 bit initialization vector, 512 bit message, 156 | 7 bit size, 1 bit chunk-start flag). For parent nodes (that 157 | represent more than 1 KiB of data), the hasher state is only 64 bytes 158 | (512 bit message), as other parameters of the compression function are 159 | implied. Because of that, decomposition of files into 1 KiB blocks 160 | should be discouraged. Buildbarn should only support decomposition into 161 | blocks that are at least 2 KiB in size. That way manifests only contain 162 | 64 byte entries, except for the last entry, which may be 64 or 97 bytes 163 | in size. 164 | 165 | # BLAKE3ZCC: BLAKE3 with a Zero Chunk Counter 166 | 167 | The BLAKE3 compression function takes a `t` argument that is for two 168 | different purposes. First of all, it is used as a counter for every 512 169 | bits of hash output generated by the XOF. Secondly, it contains the 170 | Chunk Counter when compressing input data. Effectively, the Chunk 171 | Counter causes every 1 KiB chunk of data to be hashed in a different way 172 | depending on the offset at which it is stored within the file. 173 | 174 | For our specific use case, it is not desirable that hashing of data is 175 | offset dependent. It would require that decomposed blocks contain 176 | additional metadata that specify at which offset the data was stored in 177 | the original file. Otherwise, there would be no way to validate the 178 | integrity of the block independently. It also rules out the possibility 179 | of deduplicating large sections of repetitive data (e.g., deduplicating 180 | a large file that contains only null bytes to just a single chunk). 181 | 182 | According to section 7.5 of the BLAKE3 specification, the Chunk Counter 183 | is not strictly necessary for security, but discourages optimizations 184 | that would introduce timing attacks. Though timing attacks are a serious 185 | problem, we can argue that in the case of the Remote Execution protocol 186 | such timing attacks already exist. For example, identical files stored 187 | in a single Directory hierarchy will only be uploaded to/downloaded from 188 | the CAS once. 189 | 190 | For this reason, this ADR proposes adding support for a custom hashing 191 | algorithm to Buildbarn, which we will call BLAKE3ZCC. BLAKE3ZCC is 192 | identical to regular BLAKE3, except that it uses a Zero Chunk Counter. 193 | BLAKE3ZCC generates exactly the same hash output as BLAKE3 for files 194 | that are 1 KiB or less in size, as those files fit in a single chunk. 195 | 196 | # Changes to the Remote Execution protocol 197 | 198 | All of the existing hashing algorithms supported by the Remote Execution 199 | protocol have a different hash size. Buildbarn makes use of this 200 | assumption, meaning that if it encounters a [Digest](https://github.com/bazelbuild/remote-apis/blob/b5123b1bb2853393c7b9aa43236db924d7e32d61/build/bazel/remote/execution/v2/remote_execution.proto#L779-L786) 201 | message whose `hash` value is 64 characters in size, it assumes it 202 | refers to an object that is SHA-256 hashed. Adding support for BLAKE3ZCC 203 | would break this assumption. BLAKE3 hashes may have any size, which 204 | makes matching by length impossible. The default hash length of 256 bits 205 | would also collide with SHA-256. 206 | 207 | To solve this, we could extend Digest to make it explicit which hashing 208 | algorithm is used. The existing `string hash = 1` field could be 209 | replaced with the following: 210 | 211 | ```protobuf 212 | message Digest { 213 | oneof hash { 214 | // Used by all of the existing hashing algorithms. 215 | string other = 1; 216 | // Used to address BLAKE3ZCC hashed files, or individual blocks in 217 | // case the file has been decomposed. 218 | bytes blake3zcc = 3; 219 | // Used to address manifests of BLAKE3ZCC hashed files, containing 220 | // Merkle tree nodes of BLAKE3ZCC hashed blocks. Only to be used by 221 | // Buildbarn, as Bazel will only request files as a whole. 222 | bytes blake3zcc_manifest = 4; 223 | } 224 | ... 225 | } 226 | ``` 227 | 228 | By using type `bytes` for these new fields instead of storing a base16 229 | encoded hash in a `string`, we cut down the size of Digest objects by 230 | almost 50%. This causes a significant space reduction for Action, 231 | Directory and Tree objects. 232 | 233 | Unfortunately, `oneof` cannot be used here, because Protobuf 234 | implementations such as [go-protobuf](https://github.com/golang/protobuf/issues/395) 235 | don't guarantee that fields are serialized in tag order when `oneof` 236 | fields are used. This property is required by the Remote Execution 237 | protocol. To work around this, we simply declare three separate fields, 238 | where implementations should ensure only one field is set. 239 | 240 | ```protobuf 241 | message Digest { 242 | string hash_other = 1; 243 | bytes hash_blake3zcc = 3; 244 | bytes hash_blake3zcc_manifest = 4; 245 | ... 246 | } 247 | ``` 248 | 249 | Digests are also encoded into pathname strings used by the ByteStream 250 | API. To distinguish BLAKE3ZCC files and manifests from other hashing 251 | algorithms, `B3Z:` and `B3ZM:` prefixes are added to the base16 encoded 252 | hash values, respectively. 253 | 254 | In addition to that, a `BLAKE3ZCC` constant is added to the 255 | DigestFunction enum, so that Buildbarn can announce support for 256 | BLAKE3ZCC to clients. 257 | 258 | # Changes to Buildbarn 259 | 260 | Even though BLAKE3 only came out recently, 261 | [one library for computing BLAKE3 hashes in Go exists](https://github.com/lukechampine/blake3). 262 | Though this library is of good quality, it cannot be used to compute 263 | BLAKE3ZCC hashes without making local code changes. Furthermore, this 264 | library only provides public interfaces for converting a byte stream to 265 | a hash value. In our case we need separate interfaces for converting 266 | chunks to Merkle tree nodes, computing the root node for a larger Merkle 267 | tree and obtaining Merkle tree nodes to hash values. Without such 268 | features, we'd be unable to generate and parse manifests. We will 269 | therefore design our own BLAKE3ZCC hashing library, which we will 270 | package at `github.com/buildbarn/bb-storage/pkg/digest/blake3zcc`. 271 | 272 | Buildbarn has an internal `util.Digest` data type that extends upon the 273 | Remote Execution Digest message by storing the instance name, thereby 274 | making it a fully qualified identifier of an object. It also has many 275 | operations that allow computing, deriving and transforming them. Because 276 | supporting BLAKE3ZCC and decomposition makes this type more complex, it 277 | should first be moved out of `pkg/util` into its own `pkg/digest` 278 | package. 279 | 280 | In addition to being extended to support BLAKE3ZCC hashing, the Digest 281 | data type will gain a new method: 282 | 283 | ```go 284 | func (d Digest) ToManifest(blockSizeBytes int64) (manifestDigest Digest, parser ManifestParser, ok bool) {} 285 | ``` 286 | 287 | When called on a digest object that uses BLAKE3ZCC that may be 288 | decomposed into multiple blocks, this function returns the digest of its 289 | manifest. In addition to that, it returns a ManifestParser object that 290 | may be used to extract block digests from manifest payloads or construct 291 | them. ManifestParser will look like this: 292 | 293 | ```go 294 | type ManifestParser interface { 295 | // Perform lookups of blocks on existing manifests. 296 | GetBlockDigest(manifest []byte, off int64) (blockDigest Digest, actualOffset int64) 297 | // Construct new manifests. 298 | AppendBlockDigest(manifest *[]byte, block []byte) Digest 299 | } 300 | ``` 301 | 302 | One implementation of ManifestParser for BLAKE3ZCC shall be provided. 303 | 304 | The `Digest.ToManifest()` function and its resulting ManifestParser 305 | shall be used by a new BlobAccess decorator, called 306 | DecomposingBlobAccess. The operations of this decorator shall be 307 | implemented as follows: 308 | 309 | - `BlobAccess.Get()` will simply forward the call if the provided digest 310 | does not correspond to a blob that can be decomposed. Otherwise, it 311 | will load the associated manifest from storage. It will then return a 312 | Buffer object that dynamically loads individual blocks from the CAS 313 | when accessed. 314 | - `BlobAccess.Put()` will simply forward the call if the provided digest 315 | does not correspond to a blob that can be decomposed. Otherwise, it 316 | will decompose the input buffer into smaller buffers for every block 317 | and write those into the CAS. In addition to that, it will write a 318 | manifest object into the CAS. 319 | - `BlobAccess.FindMissing()` will forward the call, except that digests 320 | of composed objects will be translated to the digests of their 321 | manifests. Manifests that are present will then be loaded from the 322 | CAS, followed by checking the existence of each of the individual 323 | blocks. 324 | 325 | To be able to implement DecomposingBlobAccess, the Buffer layer will 326 | need two minor extensions: 327 | 328 | - `BlobAccess.Get()` needs to return a buffer that needs to be backed by 329 | the concatenation of a sequence of block buffers. A new buffer type 330 | that implements this functionality shall be added, which can be 331 | created by calling `NewCASConcatenatingBuffer()`. 332 | - `BlobAccess.Put()` needs to read the input buffer in chunks equal to 333 | the block size. The `Buffer.ToChunkReader()` function currently takes 334 | a maximum chunk size argument, but provides no facilities for 335 | specifying a lower bound. This mechanism should be extended, so that 336 | `ToChunkReader()` can be used to read input one block at a time. 337 | 338 | With all of these changes in place, Buildbarn will have basic support 339 | for decomposing large objects in place. 340 | 341 | # Future work 342 | 343 | With the features proposed above, support for decomposition should be 344 | complete enough to provide better spreading of load for sharded setups. 345 | In addition to that, workers with lazy-loading file systems will be able 346 | to perform I/O on large files without faulting them in entirely. There 347 | are still two areas where performance may be improved further: 348 | 349 | - Using BatchedStoreBlobAccess, workers currently call `FindMissing()` 350 | before uploading output files. This is done at the output file level, 351 | which is suboptimal. When decomposition is enabled, it causes workers 352 | to load manifests from storage, even though those could be derived 353 | locally. A single absent block will cause the entire output file to be 354 | re-uploaded. To prevent this, `BlobAccess.Put()` should likely be 355 | decomposed into two flavours: one for streaming uploads and one for 356 | uploads of local files. 357 | - When decomposition is enabled, `FindMissing()` will likely become 358 | slower, due to it requiring more roundtrips to storage. This may be 359 | solved by adding yet another BlobAccess decorator that does short-term 360 | caching of `FindMissing()` results. 361 | -------------------------------------------------------------------------------- /0004-icas.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #4: Indirect Content Addressable Storage 2 | 3 | Author: Ed Schouten
4 | Date: 2020-06-17 5 | 6 | # Context 7 | 8 | In an ideal build environment, build actions scheduled through the REv2 9 | protocol are fully isolated from the outer world. They either rely on 10 | data that is provided by the client (source files and resources declared 11 | in Bazel's `WORKSPACE` file), or use files that are outputs of previous 12 | build actions. In addition to improving reproducibility of work, it 13 | makes this easier for Buildbarn and the build client to distinguish 14 | actual build failures (compiler errors) from infrastructure failures. 15 | 16 | One pain point of this model has always been network bandwidth 17 | consumption. Buildbarn clusters can be run in data centers that have 18 | excellent network connectivity, while build clients may run on a laptop 19 | that uses public WiFi connectivity. These clients are still responsible 20 | for downloading external artifacts from their upstream location, 21 | followed by uploading them into Buildbarn's Content Addressable Storage 22 | (CAS). 23 | 24 | People within the Bazel community are working on solving this problem by 25 | adding a new [Remote Asset API](https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/asset/v1/remote_asset.proto). 26 | When used, Bazel is capable of sending RPCs to a dedicated service to 27 | download external artifacts, optionally extract them and upload them 28 | into the CAS. Bazel is then capable of passing the resulting objects by 29 | reference to build actions that consume them. 30 | 31 | Though it makes a lot of sense to fully ingest remote assets into the 32 | CAS in the general case, there are certain workloads where doing this is 33 | undesirable. If objects are already stored on a file share or cloud 34 | storage bucket close to a Buildbarn cluster, copying these objects into 35 | Buildbarn's centralized storage based on LocalBlobAccess may be 36 | wasteful. It may be smarter to let workers fetch these assets remotely 37 | only as part of the input root population. This ADR proposes changes to 38 | Buildbarn to facilitate this. 39 | 40 | # Changes proposed at the architectural level 41 | 42 | Ideally, we would like to extend the CAS to be able to store references 43 | to remote assets. A CAS object would then either contain a blob or a 44 | reference describing where the asset is actually located. When a worker 45 | attempts to read an object from the CAS and receives a reference, it 46 | retries the operation against the remote service. This is unfortunately 47 | hard to realise for a couple of reasons: 48 | 49 | - None of the Buildbarn code was designed with this in mind. 50 | [The Buffer interface](0001-buffer.md) assumes that every data store 51 | is homogeneous, in that all objects stored within have the same type. 52 | The CAS may only hold content addressed blobs, while the AC may only 53 | contain ActionResult messages. 54 | - The ByteStream and ContentAddressableStorage protocols have no 55 | mechanisms for returning references. This is problematic, because we 56 | also use these protocols inside of our clusters. 57 | 58 | This is why we propose adding a separate data store that contains these 59 | references, called the Indirect Content Addressable Storage (ICAS). When 60 | attempting to load an object, a worker will first query the CAS, falling 61 | back to loading a reference from the ICAS and following it. References 62 | are stored in the form of Protobuf messages. 63 | 64 | In terms of semantics, the ICAS can be seen as a mixture between the CAS 65 | and the Action Cache (AC). Like the AC, it holds Protobuf messages. Like 66 | the CAS, references expand to data that is content addressed and needs 67 | checksum validation. It also needs to support efficient bulk existence 68 | queries (i.e., `FindMissingBlobs()`). 69 | 70 | # Changes proposed at the implementation level 71 | 72 | Because the ICAS is similar to both the CAS and AC, we should integrate 73 | support for it into bb-storage's `pkg/blobstore`. This would allow us to 74 | reuse most of the existing storage backends, including LocalBlobAccess, 75 | MirroredBlobAccess and ShardingBlobAccess. This requires two major 76 | changes: 77 | 78 | - The Buffer interface has only been designed with the CAS and the AC as 79 | data sources and sinks in mind. In order to support ICAS references as 80 | sources, the `NewACBuffer*()` functions will be replaced with generic 81 | `NewProtoBuffer*()` functions that take arbitrary Protobuf messages. 82 | To support ICAS references as sinks, the `Buffer.ToActionResult()` 83 | method will be replaced with generic method `Buffer.ToProto()`. 84 | - All of the code in `pkg/blobstore/configuration` is already fairly 85 | messy because it needs to distinguish between the CAS and AC. This is 86 | only going to get worse once ICAS support is added. Work will be 87 | performed to let the generic configuration code call into 88 | Blob{Access,Replicator}Creator interfaces. All of the existing backend 89 | specific code will be moved into {AC,CAS}Blob{Access,Replicator}Creator 90 | implementations. 91 | 92 | With those changes in place, it should become easier to add support for 93 | arbitrary storage backends; not just the ICAS. To add support for the 94 | ICAS, the following classes will be added: 95 | 96 | - ICASStorageType, which will be used by most generic BlobAccess 97 | implementations to construct ICAS specific Buffer objects. 98 | - ICASBlob{Access,Replicator}Creator, which will be used to construct 99 | ICAS specific BlobAccess instances from Jsonnet configuration. 100 | - ICASBlobAccess and IndirectContentAddressableStorageServer, which act 101 | as gRPC clients and servers for the ICAS protocol. This protocol will 102 | be similar to REv2's ActionCache and ContentAddressableStorage. 103 | 104 | The above will only bring in support for the ICAS, but won't add any 105 | logic to let workers fall back to reading ICAS entries when objects are 106 | absent from the CAS. To realise that, two new implementations of 107 | BlobAccess are added: 108 | 109 | - ReferenceExpandingBlobAccess, which effectively converts an ICAS 110 | BlobAccess to a CAS BlobAccess. It loads references from the ICAS, 111 | followed by loading the referenced object from its target location. 112 | - ReadFallbackBlobAccess, which which forwards all requests to a primary 113 | backend, followed by sending requests to a secondary backend on 114 | absence. 115 | 116 | Below is an example blobstore configuration that can be used to let 117 | workers access both the CAS and ICAS: 118 | 119 | ```jsonnet 120 | { 121 | contentAddressableStorage: { 122 | readFallback: { 123 | primary: { 124 | // CAS backend goes here. 125 | grpc: ..., 126 | }, 127 | secondary: { 128 | referenceExpanding: { 129 | // ICAS backend goes here. 130 | grpc: ..., 131 | }, 132 | }, 133 | }, 134 | }, 135 | actionCache: { 136 | // AC backend goes here. 137 | grpc: ..., 138 | }, 139 | } 140 | ``` 141 | 142 | # How do I integrate this into my environment? 143 | 144 | Though care has been taken to make this entire implementation as generic 145 | as possible, the inherent problem is that the process of expanding 146 | references stored in the ICAS is specific to the environment in which 147 | Buildbarn is deployed. We will make sure that the ICAS protocol and 148 | ReferenceExpandingBlobAccess have support for some commonly used 149 | protocols (e.g., HTTP), but it is not unlikely that operators of 150 | Buildbarn clusters need to maintain forks if they want to integrate this 151 | with their own infrastructure. This is likely unavoidable. 152 | 153 | The question is also how entries are written into the ICAS. This is 154 | again specific to the environment. Users will likely need to implement 155 | their own gRPC service that implements the Remote Asset API that parses 156 | environment specific URL schemes and stores them as ICAS entries. 157 | 158 | Unfortunately, Bazel does not implement the Remote Asset API at this 159 | point in time. As a workaround, one may create a custom `bb_worker`-like 160 | process that captures certain types of build actions. Instead of going 161 | through the regular execution process, this worker directly interprets 162 | these requests, treating them as a remote asset request. This can be 163 | achieved by implementing a custom BuildExecutor. 164 | 165 | # Future work 166 | 167 | When we added `Buffer.ToActionResult()`, we discovered that we no longer 168 | needed the ActionCache interface. Now that we have a general purpose 169 | `Buffer.ToProto()` function, it looks like the ContentAddressableStorage 170 | interface also becomes less useful. It may make sense to investigate 171 | what it takes to remove that interface as well. 172 | -------------------------------------------------------------------------------- /0005-local-blob-access-persistency.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #5: Persistency for LocalBlobAccess 2 | 3 | Author: Ed Schouten
4 | Date: 2020-09-12 5 | 6 | # Context 7 | 8 | In addition to all of the storage backends that Buildbarn provides to 9 | access the contents of the Content Addressable Storage (CAS) and Action 10 | Cache (AC) remotely, there are two backends that store data locally: 11 | CircularBlobAccess and LocalBlobAccess. These two storage backends have 12 | various use cases: 13 | 14 | - They can be used to build fast centralized storage, especially when 15 | used in combination with MirroredBlobAccess and ShardingBlobAccess. 16 | See [ADR #2](0002-storage.md) for more details. 17 | - They can be used to set up local caches when used in combination with 18 | ReadCachingBlobAccess. Such caches can speed up access to the CAS in a 19 | given site or on an individual system. 20 | 21 | Buildbarn originally shipped with CircularBlobAccess (written in 2018). 22 | As part of the work on ADR #2, LocalBlobAccess was added. 23 | LocalBlobAccess has a couple of advantages over CircularBlobAccess: 24 | 25 | - When sized appropriately, it is capable of supporting Bazel's 26 | [Builds without the Bytes](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/edit). 27 | - It doesn't suffer from the data corruption bugs that 28 | CircularBlobAccess has. 29 | - It can store its data on raw block devices, which tends to be a lot 30 | faster than using files on a file system. It also ensures that the 31 | maximum amount of space used is bounded. 32 | - Block devices are memory mapped, making reads of blobs that are 33 | already loaded from disk nearly instantaneous. 34 | - It supports random offset reads, while CircularBlobAccess can only 35 | read data sequentially. 36 | 37 | There are, however, a couple of disadvantages to using LocalBlobAccess: 38 | 39 | - It only supports raw block devices. This can be problematic in case 40 | Buildbarn is deployed in environments where raw block device access is 41 | unsupported (e.g., inside unprivileged containers). 42 | - It doesn't persist data across restarts, as the digest-location map, 43 | the hash table that indexes data, is only stored in memory. 44 | 45 | Let's work towards bringing these missing features to LocalBlobAccess, 46 | so that we can finally remove CircularBlobAccess from the tree. 47 | 48 | # Supporting file based storage 49 | 50 | A naïve way of adding file based storage to LocalBlobAccess is to 51 | provide a new implementation of the BlockAllocator type that is backed 52 | by a filesystem.Directory. Each of the blocks created through the 53 | NewBlock() function could be backed by an individual file. The downside 54 | of this approach is that releasing blocks can now fail, as it's not 55 | guaranteed that we can always successfully delete files on disk. 56 | 57 | Instead, let's generalize the code in [bb-storage/pkg/blockdevice](https://pkg.go.dev/github.com/buildbarn/bb-storage/pkg/blockdevice) 58 | to support both raw block devices and fixed-size files. Because we only 59 | care about supporting fixed-size files, we can reuse the existing memory 60 | mapping logic. 61 | 62 | More concretely, let's replace MemoryMapBlockDevice() with two separate 63 | NewBlockDeviceFrom{Device,File}() functions, mapping raw block devices 64 | and regular files, respectively. Let's also introduce a generic 65 | configuration message, so that any component in Buildbarn that uses 66 | block devices supports both modes of access. 67 | 68 | # Supporting persistency 69 | 70 | The first step to add persistency to LocalBlobAccess is to allow the 71 | digest-location map to be backed by a file. To achieve that, we can 72 | import PR [bb-storage#62](https://github.com/buildbarn/bb-storage/pull/62), 73 | which adds a FileBasedLocationRecordArray. While there, we should modify 74 | it to be built on top of bb-storage/pkg/blockdevice, so that it can both 75 | use raw block devices and files. Let's call this type 76 | BlockDeviceBackedLocationRecordArray instead. 77 | 78 | Storing the digest-location map on disk isn't sufficient to make 79 | LocalBlobAccess persistent. One crucial piece of information that is 80 | missing is the list of Block objects that LocalBlobAccess extracted from 81 | the underlying BlockAllocator, and the write cursors that 82 | LocalBlobAccess tracks for them. PR [bb-storage#63](https://github.com/buildbarn/bb-storage/pull/63) 83 | attempted to solve this by letting LocalBlobAccess write this state into 84 | a text file, so that it can be reloaded on startup. 85 | 86 | The main concern with this approach is that it increases the complexity 87 | of LocalBlobAccess, while that class is already bigger than it should 88 | be. It is hard to get decent unit testing coverage of LocalBlobAccess, 89 | and this change would make that even harder. Let's solve this by moving 90 | the bottom half that is responsible for managing Block objects and write 91 | cursors from LocalBlobAccess into a separate interface, called 92 | BlockList: 93 | 94 | ```go 95 | type BlockList interface { 96 | // Managing the blocks in the list. 97 | PopFront() 98 | PushBack() error 99 | // Reading data from blocks. 100 | Get(blockIndex int, ...) buffer.Buffer 101 | // Writing data to blocks. 102 | HasSpace(blockIndex int, sizeBytes int64) bool 103 | Put(blockIndex int, sizeBytes int64) BlockListPutWriter 104 | } 105 | ``` 106 | 107 | For non-persistent workloads, we can add a basic implementation called 108 | VolatileBlockList that maintains the existing behaviour. For persistent 109 | workloads, we can provide a more complex implementation named 110 | PersistentBlockList. 111 | 112 | With the changes in PR bb-storage#63 applied, LocalBlobAccess writes a 113 | text file to disk every time data is stored. First of all, we should 114 | consider using a Protobuf for this, as this makes it easier to add 115 | structure and extend the format where needed. Second, we should only 116 | emit the state file periodically and asynchronously, so that I/O and 117 | lock contention remain minimal. Let's solve this by adding the following 118 | two interfaces: 119 | 120 | ```go 121 | // This interface needs to be implemented by PersistentBlockList, so 122 | // that its layout can be extracted. 123 | type PersistentStateSource interface { 124 | // Returns a channel on which the caller can wait for writes to 125 | // occur. This ensures that system can remain fully idle if no 126 | // I/O takes place. 127 | GetBlockPutWakeup() <-chan struct{} 128 | // Returns the current layout of the PersistentBlockList. 129 | GetPersistentState() *pb.PersistentState 130 | ... 131 | } 132 | 133 | // An implementation of this interface could, for example, write the 134 | // Protobuf message to a file on disk. 135 | type PersistentStateStore interface { 136 | WritePersistentState(persistentState *pb.PersistentState) error 137 | } 138 | ``` 139 | 140 | A PeriodicSyncer helper type can provide a run loop that takes data from 141 | PersistentStateSource and passes it to PersistentStateStore. 142 | 143 | # Persistency with data integrity 144 | 145 | With the changes proposed above, we provide persistency, but don't 146 | guarantee integrity of data in case of unclean system shutdowns. Data 147 | corruption can happen when digest-location map entries are flushed to 148 | disk before all of the data corresponding to the blob is. This may be 149 | acceptable when LocalBlobAccess is used to hold the CAS (because of 150 | checksum validation), but for the AC it could cause us to interpret 151 | invalid Protobuf data. 152 | 153 | One way to solve this issue is to call `fsync()` (or Linux's own 154 | `sync_file_range()`) between writes of data and digest-location map 155 | insertions, but doing so will decimate performance. To prevent the need 156 | for that, we can let a background job (i.e., the PeriodicSyncer that was 157 | described previously) call `fsync()` periodically. By incrementing a 158 | counter value before every `fsync()` call, and embedding its value into 159 | both digest-location map entries and the state file, we can accurately 160 | track which entries are valid and which ones potentially point to 161 | corrupted data. Let's call this counter the 'epoch ID'. References to 162 | blocks in digest-location map entries can now be expressed as follows: 163 | 164 | ```go 165 | type BlockReference struct { 166 | EpochID uint32 167 | // To be subtracted from the index of the last block used at the 168 | // provided epoch. 169 | BlocksFromLast uint16 170 | } 171 | ``` 172 | 173 | One issue with this approach is that once restarts take place and 174 | execution continues at the last persisted epoch ID, the digest-location 175 | map may still contain entries for future epoch IDs. These would need to 176 | be removed, as they would otherwise cause (potentially) corrupted blobs 177 | to become visible again by the time the epoch ID equals the value stored 178 | in the digest-location map entry. 179 | 180 | Removing these entries could be done explicitly by scrubbing the 181 | key-location map during startup. The disadvantage is that this 182 | introduces a startup delay, whose duration is proportional to the size 183 | of the digest-location map. For larger setups where the digest-location 184 | map is tens of gigabytes in size, this is unacceptable. We should 185 | therefore aim to design a solution that doesn't require any scrubbing. 186 | 187 | To prevent the need for scrubbing, we can associate every epoch ID with 188 | a random number, and enforce that entries in the digest-location map are 189 | based on the same value. This makes it possible to distinguish valid 190 | entries from ones that were written prior to unclean system shutdown. 191 | Instead of storing this value literally as part of digest-location map 192 | entries, it could be used as a seed for a record checksum. This has the 193 | advantage of hardening the digest-location map against silent data 194 | corruption (e.g., bit flips). 195 | 196 | An approach like this does require us to store all these 'hash seeds' 197 | for every epoch with one or more valid blocks as part of the persistent 198 | state file. Fortunately, the number of epochs can only grow relatively 199 | slowly. If LocalBlobAccess is configured to synchronize data to disk 200 | once every minute and at least some writes take place every minute, 201 | there will still only be 526k epochs in one year. This number can only 202 | be reached if not a single block rotation were to happen during this 203 | timeframe, which is extremely unlikely. There is therefore no need to 204 | put a limit on the maximum number of epochs for the time being. 205 | 206 | Higher levels of LocalBlobAccess code will not be able to operate on 207 | BlockReferences directly. Logic in both HashingDigestLocationMap and 208 | LocalBlobAccess depend on being able to compare block numbers along a 209 | total order. This is why we should make sure that BlockReference is only 210 | used at the storage level, namely inside LocationRecordArray 211 | implementations. To allow these types to look up hash seeds and convert 212 | BlockReferences to integer block indices, we can add the following 213 | helper type: 214 | 215 | ```go 216 | type BlockReferenceResolver interface { 217 | BlockReferenceToBlockIndex(blockReference BlockReference) (int, uint64, bool) 218 | BlockIndexToBlockReference(blockIndex int) (BlockReference, uint64) 219 | } 220 | ``` 221 | 222 | By letting BlockReferenceToBlockIndex() return a boolean value 223 | indicating whether the BlockReference is still valid, we can remove the 224 | existing LocationValidator type, which served a similar purpose. 225 | 226 | One thing that the solution above still lacks, is that there is no 227 | feedback mechanism between PeriodicSyncer and PersistentBlockList. If 228 | PeriodicSyncer can't keep up with the changes made to the 229 | PersistentBlockList, there is a chance that PersistentBlockList starts 230 | to recycle blocks that are still referenced by older epochs that are 231 | listed in the persistent state file. This could also cause data 232 | corruption after a restart. 233 | 234 | To address this, PersistentBlockList's PushBack() should be made to fail 235 | explicitly when blocks are recycled too quickly. To reduce the 236 | probability of running into this situation, PersistentBlocklist should 237 | offer a GetBlockReleaseWakeup() that allows PeriodicSyncer to update the 238 | persistent state file without any delay. A 239 | NotifyPersistentStateWritten() method should be added to allow 240 | PeriodicSyncer to notify the PersistentBlockList that it's safe to 241 | recycle blocks that were removed as part of the previous 242 | GetPersistentState() call. 243 | 244 | # Refactoring 245 | 246 | While we're making these drastic changes to LocalBlobAccess, let's spend 247 | some effort to make further cleanups: 248 | 249 | - With the addition of CompactDigest, almost all code of LocalBlobAccess 250 | has been turned into to a generic key-value store. It no longer assumes 251 | keys are REv2 digests. Let's rename CompactDigest to Key, and 252 | DigestLocationMap to KeyLocationMap. 253 | 254 | - LocalBlobAccess can be decomposed even further. All of the code that 255 | implements the "old" vs. "current" vs. "new" block distinction 256 | effectively provides a location-to-blob map. Let's move all of this 257 | behind a LocationBlobMap interface, similar to how we have 258 | KeyLocationMap. 259 | 260 | - All of the code in LocalBlobAccess that ties the KeyLocationMap to a 261 | LocationBlobMap can easily be moved into its own helper type. Let's 262 | move that code behind a KeyBlobMap interface, as it simply provides a 263 | mapping from a Key to a blob. 264 | 265 | - What remains in LocalBlobAccess isn't much more than an implementation 266 | of BlobAccess that converts REv2 digests to Keys before calling into 267 | KeyBlobMap. The name LocalBlobAccess now makes little sense, as there 268 | is no requirement that data is stored locally. Let's rename this type 269 | to KeyBlobMapBackedBlobAccess. 270 | 271 | - Now that we have a BlockDeviceBackedLocationRecordArray, let's rename 272 | PartitioningBlockAllocator to BlockDeviceBackedBlockAllocator for 273 | consistency. 274 | 275 | - The handles returned by bb-storage/pkg/blockdevice now need to provide 276 | a Sync() function to flush data to storage. The existing handle type 277 | is named ReadWriterAt, which is a poor fit for an interface that also 278 | provides a Sync() function. Let's rename this to BlockDevice. 279 | 280 | With all of these changes made, it should be a lot easier to achieve 281 | decent unit testing coverage of the code. 282 | 283 | # Architectural diagram of the old and new LocalBlobAccess 284 | 285 | LocalBlobAccess before the proposed changes are made: 286 | 287 |

288 | LocalBlobAccess before the proposed changes are made 289 |

290 | 291 | LocalBlobAccess after the proposed changes are made, but with 292 | persistency still disabled: 293 | 294 |

295 | LocalBlobAccess after the proposed changes are made, but with persistency still disabled 296 |

297 | 298 | The classes colored in green perform a similar role to the old 299 | LocalBlobAccess type that they replace. 300 | 301 | LocalBlobAccess after the proposed changes are made, but with 302 | persistency enabled: 303 | 304 |

305 | LocalBlobAccess after the proposed changes are made, but with persistency enabled 306 |

307 | 308 | Note that these are not UML class diagrams. The arrows in these diagrams 309 | indicate how objects reference each other. The labels on the arrows 310 | correspond with the interface type that is used. 311 | 312 | # Thanks 313 | 314 | The author would like to thank Tom Coldrick from Codethink Ltd. for his 315 | contributions. His PRs gave good insight in what was needed to add 316 | persistency. 317 | 318 | # Future work 319 | 320 | With bb-storage/pkg/blockdevice generalized to support file storage, 321 | there is no longer any need to keep bb-remote-execution's 322 | DirectoryBackedFilePool, as BlockDeviceBackedFilePool can be used in its 323 | place. We should consider removing DirectoryBackedFilePool. 324 | 325 | Letting the key-location map be backed by a disk may increase access 326 | latency significantly. Should we add a configuration option for calling 327 | `mlock()` on the memory map, so that its contents are guaranteed to 328 | remain in memory? 329 | -------------------------------------------------------------------------------- /0006-operation-logging-and-monetary-resource-usage.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #6: Completed Action Logger and Monetary Resource Usage 2 | 3 | Authors: Travis Takai, Ed Schouten 4 | 5 | Date: 2021-03-23 6 | 7 | # Context 8 | 9 | Action results presented through bb-browser's frontend provide a great way to gain insight into the behavior of remote execution by looking at a single action at a time. The next logical step to this is analyzing build results in aggregate and in real-time. There are many reasons for wanting to analyze remote execution behavior such as identifying build duration trends, estimating the total cost of a build, identifying computationally expensive build targets, evaluating hardware utilization, or correlating completed build actions with data in the [Build Event Protocol](https://docs.bazel.build/versions/master/build-event-protocol.html). bb-browser displays all of the necessary info, but does not allow for programmatic analysis of action results for a number of reasons: 10 | 11 | * Most of the useful information is stored within the [ExecutedActionMetadata's auxiliary_metadata field](https://github.com/bazelbuild/remote-apis/blob/0943dc4e70e1414735a85a3167557392c177ff45/build/bazel/remote/execution/v2/remote_execution.proto#L945-L948) but is not exposed by the bazel client or displayed in the Build Event Protocol output. 12 | 13 | * There aren't any convenient ways for exploring build information via bb-browser as the instance name, action digest, and size of the action result are all needed before querying for an action result. 14 | 15 | * The goal of bb-storage is to offer short-term data storage, which does not allow for any form of querying for historical data. 16 | 17 | * bb-scheduler does not provide a persisted list of build actions and bb-storage has no way of providing access to a sequence of action results in an efficient way. 18 | 19 | An alternative approach that allows for a flexible configuration for clients is to allow for streaming action results along with their associated metadata to an external service. This allows for both long-term data persistence and real-time build action analysis. We should work towards creating this streaming service, which we can call Completed Action Logger. 20 | 21 | # Completed Action Logger Service 22 | 23 | The protocol will be used by bb_worker to stream details of completed actions, known as CompletedActions, to an external logging service: 24 | 25 | ```protobuf 26 | service CompletedActionLogger { 27 | // Send a CompletedAction to another service as soon as a build action has 28 | // completed. Receiving a message from the return stream indicates that the 29 | // service successfully received the CompletedAction. 30 | rpc LogCompletedActions(stream CompletedAction) 31 | returns (stream google.protobuf.Empty); 32 | } 33 | ``` 34 | 35 | Each worker will have the ability to stream CompletedActions to the logging service once the action result has been created and all metadata has been attached. CompletedActions will take the form of: 36 | 37 | ```protobuf 38 | // CompletedAction wraps a finished build action in order to transmit to 39 | // an external service. 40 | message CompletedAction { 41 | // A wrapper around the action's digest and REv2 ActionResult, which contains 42 | // the action's associated metadata. 43 | buildbarn.cas.HistoricalExecuteResponse historical_execute_response = 1; 44 | 45 | // A unique identifier associated with the CompletedAction, which is 46 | // generated by the build executor. This provides a means by which the 47 | // external logging service may be able to deduplicate incoming 48 | // CompletedActions. The usage of this field is left to the external 49 | // logging service to determine. 50 | string uuid = 2; 51 | 52 | // The REv2 instance name of the remote cluster that workers are returning 53 | // the action result from. 54 | string instance_name = 3; 55 | } 56 | ``` 57 | 58 | The HistoricalExecuteResponse message is simply bb-storage's UncachedActionResult that will be renamed in order to be more used more generally. Apart from the Action digest included in the historical_execute_response field, no information about the Action is part of CompletedAction. Implementations of the CompletedActionLogger service can load objects from the Content Addressable Storage in case they need to inspect details of the Action. 59 | 60 | # Monetary Resource Usage 61 | 62 | Now that we've defined what the Completed Action Logger is, let's go ahead and implement one of the desired uses for action analysis: metadata for measuring the cost of build execution. Defining a new message, which we'll call MonetaryResourceUsage, provides a nice way of calculating how much a given build cost based on the amount of time spent executing the action and will take the form of: 63 | 64 | ```protobuf 65 | // A representation of unique factors that may be aggregated to 66 | // compute a given build action's total price. 67 | message MonetaryResourceUsage { 68 | message Expense { 69 | // The type of currency the cost is measured in. Required to be in 70 | // ISO 4217 format: https://en.wikipedia.org/wiki/ISO_4217#Active_codes 71 | string currency = 1; 72 | 73 | // The value of a specific expense for a build action. 74 | double cost = 2; 75 | } 76 | 77 | // A mapping of expense categories to their respective costs. 78 | map expenses = 1; 79 | } 80 | ``` 81 | 82 | This will be appended to auxiliary_metadata that is part of the REv2 ExecutedActionMetadata message at the end of execution, which will automatically ensure it is cached along with the other build metadata in bb-storage. While expenses are not significant on a per-action basis, when combined with the CompletedActionLogger service we now have a way to quantify how much a given build invocation or target costs and see how that changes over time. Implementations of the CompletedActionLogger service are responsible for aggregating these MonetaryResourceUsage messages. It is possible to aggregate this data by making use of the fields within the RequestMetadata message, such as [tool_invocation_id](https://github.com/bazelbuild/remote-apis/blob/0943dc4e70e1414735a85a3167557392c177ff45/build/bazel/remote/execution/v2/remote_execution.proto#L1760-L1762) or the [recently added target_id field](https://github.com/bazelbuild/remote-apis/pull/186/files), as the RequestMetadata data is always appended to the auxiliary_metadata message. 83 | -------------------------------------------------------------------------------- /0007-nested-invocations.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #7: Nested invocations 2 | 3 | Author: Ed Schouten
4 | Date: 2021-12-27 5 | 6 | # Context 7 | 8 | On 2020-11-16, bb\_scheduler was extended to take the Bazel invocation 9 | ID into account ([commit](https://github.com/buildbarn/bb-remote-execution/commit/1a8407a0e62bf559abcd4006e0ecc9d3de9838c7)). 10 | With this change applied, bb\_scheduler no longer places all incoming 11 | operations in a single FIFO-like queue (disregarding priorities). 12 | Instead, every invocation of Bazel gets its own queue. All of these 13 | queues combined are placed in another queue that determines the order of 14 | execution. The goal of this change was to improve fairness, by making 15 | the scheduler give every invocation an equal number of workers. 16 | 17 | Under the hood, [`InMemoryBuildQueue` calls into `ActionRouter`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L378), which then [calls into `invocation.KeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/routing/simple_action_router.go#L40) 18 | to obtain an [`invocation.Key`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/key.go). 19 | The latter acts as [a map key](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L1201) 20 | for all invocations known to the scheduler. 21 | [The default implementation](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/request_metadata_key_extractor.go) 22 | is one that extracts [the `tool_invocation_id` field from the REv2 `RequestMetadata` message](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1836). 23 | 24 | What is pretty elegant about this design is that `InMemoryBuildQueue` 25 | doesn't really care about what kind of logic is used to compute 26 | invocation keys. One may, for example, provide a custom 27 | `invocation.KeyExtractor` that causes operations to be grouped by 28 | username, or by [the `correlated_invocations_id` field](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1840). 29 | 30 | On 2021-07-16, bb\_scheduler was extended once more to use Bazel 31 | invocation IDs to provide worker locality ([commit](https://github.com/buildbarn/bb-remote-execution/commit/902aab278baa45c92d48cad4992c445cfc588e32)). 32 | This change causes the scheduler to keep track of invocation ID of the 33 | last operation to run on a worker. When scheduling successive tasks, the 34 | scheduler will try to reuse a worker with a matching invocation ID. This 35 | increases the probability of input root cache hits, as it is not 36 | uncommon for clients to run many actions having similar input root 37 | contents. 38 | 39 | At the same time, this change introduced [the notion of 'stickiness'](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/proto/configuration/bb_scheduler/bb_scheduler.proto#L177-L196), 40 | where a cluster operator can effectively quantify the overhead of 41 | switching between actions belonging to different invocations. Such 42 | overhead could include the startup time of virtual machines, download 43 | times of container images specified in REv2 platform properties. In the 44 | case embedded hardware testing, it could include the time it takes to 45 | flash an image to EEPROM/NAND and boot from it. By using a custom 46 | `invocation.KeyExtractor` that extracts these properties from incoming 47 | execution requests, the scheduler can reorder operations in such a way 48 | that the number of restarts, reboots, flash erasure cycles, etc. is 49 | reduced. 50 | 51 | One major limitation of how this feature is implemented right now, is 52 | that invocations are a flat namespace. The scheduler only provides 53 | fairness by keying execution requests by a single property. For example, 54 | there is no way to configure the scheduler to provide the following: 55 | 56 | - Fairness by username, followed by fairness by Bazel invocation ID for builds 57 | with an equal username. 58 | - Fairness by team/project/department, followed by fairness by username. 59 | - Fairness both by correlated invocations ID and tool invocation ID. 60 | - Fairness/stickiness by worker container image, followed by fairness by 61 | Bazel invocation ID. 62 | 63 | In this ADR we propose to remove this limitation, with the goal of 64 | improving fairness of multi-tenant Buildbarn setups. 65 | 66 | # Proposed changes 67 | 68 | This ADR mainly proposes that operations no longer have a single 69 | `invocation.Key`, but a list of them. The flat map of invocations 70 | managed by `InMemoryBuildQueue` will be changed to a 71 | [trie](https://en.wikipedia.org/wiki/Trie), where every level of the 72 | trie corresponds to a single `invocation.Key`. More concretely, if the 73 | scheduler is configured to not apply any fairness whatsoever, all 74 | operations will simply be queued within the root invocation. If the 75 | `ActionRouter` is configured to always return two `invocation.Key`s, the 76 | trie will have height two, having operations queued only at the leaves. 77 | 78 | When changing this data structure, we'll also need to rework all of the 79 | algorithms that are applied against it. For example, we currently use a 80 | [binary heap](https://en.wikipedia.org/wiki/Binary_heap) to store all 81 | invocations that have one or more queued operations, which is used for 82 | selecting the next operation to run. This will be replaced with a heap 83 | of heaps that mimics the layout of the trie. These algorithms will thus 84 | all become a bit more complex than before. 85 | 86 | Fortunately, there are also some places where the algorithms become 87 | simpler. For the worker locality, we currently have to deal with the 88 | case where the last task that ran on a worker was associated with 89 | multiple invocations (due to in-flight deduplication). Or it may not be 90 | associated with any invocation if the worker was created recently. In 91 | the new model we can simply pick the 92 | [lowest common ancestor](https://en.wikipedia.org/wiki/Lowest_common_ancestor) 93 | of all invocations in case of in-flight deduplication, and pick the root 94 | invocation for newly created workers. This ensures that every worker is 95 | always associated with exactly one invocation. 96 | 97 | For `InMemoryBuildQueue` to obtain multiple `invocation.Key`s for a 98 | given operation, we will need to alter `ActionRouter` to return a list 99 | instead of a single instance. The underlying `invocation.KeyExtractor` 100 | interface can remain as is, though we can remove 101 | [`invocation.EmptyKeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/empty_key_extractor.go). 102 | The same behaviour can now be achieved by not using any key extractors 103 | at all. 104 | 105 | Finally we have to extend the worker invocation stickiness logic to work 106 | with this nested model. As every level of the invocation tree now has a 107 | different cost when switching to a different branch, we now need to let 108 | the configuration accept a list of durations, each indicating the cost 109 | at that level. 110 | 111 | All of the changes proposed above have been implemented in 112 | [this commit](https://github.com/buildbarn/bb-remote-execution/commit/744c9a65215179f19f20523acd522937a03aad56). 113 | 114 | # Future work 115 | 116 | With `invocation.EmptyKeyExtractor` removed, only 117 | `invocation.RequestMetadataKeyExtractor` remains. This type is capable 118 | of creating an `invocation.Key` based on the `tool_invocation_id`. We 119 | should add more types, such as one capable of extracting the 120 | `correlated_invocations_id` or authentication related properties (e.g., 121 | username). 122 | 123 | Even though the scheduling policy of `InMemoryBuildQueue` is fair, there 124 | is no support for [preemption](https://en.wikipedia.org/wiki/Preemption_(computing)). 125 | At no point will the scheduler actively move operations from the 126 | `EXECUTING` back to the `QUEUED` stage if it prevents other invocations 127 | from making progress. 128 | -------------------------------------------------------------------------------- /0009-nfsv4.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #9: An NFSv4 server for bb\_worker 2 | 3 | Author: Ed Schouten
4 | Date: 2022-05-23 5 | 6 | # Context 7 | 8 | Buildbarn's worker process (bb\_worker) can be configured to populate 9 | build directories of actions in two different ways: 10 | 11 | - By instantiating it as [a native directory on a local file system](https://github.com/buildbarn/bb-remote-execution/blob/78e082c6793904705001a9674bc865f91d7624bf/pkg/proto/configuration/bb_worker/bb_worker.proto#L79-L83). 12 | To speed up this process, it may use a directory where recently used 13 | files are cached. Files are hardlinked from/to this cache directory. 14 | - By instantiating it in-memory, while [using FUSE to make it accessible to the build action](https://github.com/buildbarn/bb-remote-execution/blob/78e082c6793904705001a9674bc865f91d7624bf/pkg/proto/configuration/bb_worker/bb_worker.proto#L85-L98). 15 | An instance of the LocalBlobAccess storage backend needs to be used to 16 | cache file contents. 17 | 18 | While the advantage of the former is that it does not introduce any 19 | overhead while executing, the process may be slow for large input roots, 20 | especially if only a fraction gets used in practice. The FUSE file 21 | system has the advantage that data is loaded lazily, meaning files and 22 | directories of the input root are only downloaded from the CAS if their 23 | contents are read during execution. This is particularly useful for 24 | actions that ship their own SDKs. 25 | 26 | An issue with FUSE is that it remains fairly Linux specific. Other 27 | operating systems also ship with implementations of FUSE or allow it to 28 | be installed as a kernel module/extension, but these implementations 29 | tend to vary in terms of quality and conformance. For example, for macOS 30 | there is [macFUSE (previously called OSXFUSE)](https://osxfuse.github.io). 31 | Though bb\_worker can be configured to work with macFUSE, it does tend 32 | to cause system lockups under high load. Fixing this is not easy, as 33 | macFUSE is no longer Open Source Software. 34 | 35 | For this reason we would like to offer an alternative to FUSE, namely an 36 | integrated NFSv4 server that listens on `localhost` and is mounted on 37 | the same system. FUSE will remain supported and recommended for use on 38 | Linux; NFSv4 should only be used on systems where the use of FUSE is 39 | undesirable and a high-quality NFSv4 client is available. 40 | 41 | We will focus on implementing NFSv4.0 as defined in [RFC 7530](https://datatracker.ietf.org/doc/html/rfc7530#section-16.10). 42 | Implementing newer versions such as NFSv4.1 ([RFC 8881](https://www.rfc-editor.org/rfc/rfc8881#name-obsolete-locking-infrastruc)) 43 | and NFSv4.2 ([RFC 7862](https://datatracker.ietf.org/doc/html/rfc7862)) 44 | is of little use, as clients such as the one shipped with macOS don't 45 | support them. We should also not be using NFSv3 ([RFC 1813](https://datatracker.ietf.org/doc/html/rfc1813)), 46 | as due to its lack of [compound operations](https://datatracker.ietf.org/doc/html/rfc7530#section-1.4.2), 47 | it is far more 'chatty' than NFSv4. This would lead to unnecessary 48 | context switching between bb\_worker and build actions. 49 | 50 | # Reusing as much code as possible between FUSE and NFSv4 51 | 52 | [The code for our existing FUSE file system](https://github.com/buildbarn/bb-remote-execution/tree/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse) 53 | is about 7500 lines of code. We have already invested heavily in it, and 54 | it has received many bugfixes for issues that we have observed in 55 | production use. The last thing we want to do is to add a brand new 56 | `pkg/filesystem/nfsv4` package that has to reimplement all of this for 57 | NFSv4. Not only will this be undesirable from a maintenance perspective, 58 | it also puts us at risk of introducing behavioral differences between 59 | the FUSE and NFSv4 implementations, making it hard to switch between the 60 | two. 61 | 62 | As FUSE and NFSv4 are conceptually identical (i.e., request-response 63 | based services for accessing a POSIX-like file system), we should try to 64 | move to a shared codebase. This ADR therefore proposes that the existing 65 | `pkg/filesystem/fuse` package is decomposed into three new packages: 66 | 67 | - `pkg/filesystem/virtual`, which will contain the vast majority of code 68 | that can be made independent of [go-fuse](https://github.com/hanwen/go-fuse). 69 | - `pkg/filesystem/virtual/fuse`, which contains the coupling with 70 | go-fuse. 71 | - `pkg/filesystem/virtual/configuration`, which contains the code for 72 | instantiating a virtual file system based on configuration settings 73 | and exposing (mounting) it. 74 | 75 | This decomposition does require us to make various refactoring changes. 76 | Most notably, the [`Directory`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/directory.go#L38-L73) 77 | and [`Leaf`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/leaf.go#L13-L23) 78 | interfaces currently depend on many data types that are part of go-fuse, 79 | such as status codes, directory entry structures, and file attribute 80 | structures. All of these need to be replaced with equivalents that are 81 | generic enough to support both the semantics of FUSE and NFSv4. 82 | Differences between these two protocols that need to be bridged include 83 | the following: 84 | 85 | - FUSE's READDIR operation is stateful, in that it's surrounded by 86 | OPENDIR and RELEASEDIR operations. This means that we currently load 87 | all of the directory contents at once, and paginate the results as 88 | part of READDIR. With NFSv4 this operation needs to be stateless. 89 | We solve this by pushing down the pagination into the `Directory` 90 | implementations. The new `VirtualReadDir()` method we add takes a 91 | starting offset and pushes directory entries into a 92 | `DirectoryEntryReporter` until space has run out. This makes both FUSE 93 | and NFSv4 use a stateless approach. 94 | 95 | - FUSE has the distinction between READDIR and READDIRPLUS. The former 96 | only returns filenames, file types and inode numbers, while the latter 97 | returns full `stat` information. NFSv4 only provides a single READDIR 98 | operation, but just like with the GETATTR operation, the caller can 99 | provide a bitmask of attributes that it's interested in receiving. As 100 | FUSE's semantics can be emulated on top of NFSv4's, we'll change our 101 | API to use an `AttributesMask` type as well. 102 | 103 | - When FUSE creates a regular file for reading/writing, it calls CREATE, 104 | providing it the identifier of the parent directory and the filename. 105 | This differs from opening an existing file, which is done through 106 | OPEN, providing it the identifier of the file. NFSv4, however, only 107 | provides a single OPEN operation that is used in all cases. We'll 108 | therefore replace `Directory.FUSECreate()` with 109 | `Directory.VirtualOpenChild()`, which can be used by FUSE CREATE and 110 | NFSv4 OPEN. `Leaf.FUSEOpen()` will remain available under the name 111 | `Leaf.VirtualOpenSelf()`, to be used by FUSE OPEN. 112 | 113 | With all of these changes landed, we'll be able to instantiate a virtual 114 | file system as part of bb\_worker that can both be interacted with from 115 | within FUSE and NFSv4 servers. 116 | 117 | **Status:** This work has been completed as part of [commit `c4bbd24`](https://github.com/buildbarn/bb-remote-execution/commit/c4bbd24a8d272267f40b234533f62c8ccf24d568). 118 | The new virtual file system layer can be found in [`pkg/filesystem/virtual`](https://github.com/buildbarn/bb-remote-execution/tree/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual). 119 | 120 | # Indexing the virtual file system by file handle 121 | 122 | NFSv4 clients refer to files and directories on the server by file 123 | handle. File handles are byte arrays that can be between 0 and 128 bytes 124 | in size (0 and 64 bytes for NFSv3). Many NFS servers on UNIX-like 125 | platforms construct file handles by concatenating an inode number with a 126 | file generation count to ensure that file handles remain distinct, even 127 | if inodes are recycled. As handles are opaque to the client, a server 128 | can choose any format it desires. 129 | 130 | As NFSv4 is intended to be a (mostly) stateless protocol, the server has 131 | absolutely no information on when a client is going to stop interacting 132 | with a file handle. At any point in time, a request can contain a file 133 | handle that was returned previously. Our virtual file system must 134 | therefore not just allow resolution by path, but also by file handle. 135 | This differs fundamentally from FUSE, where the kernel and userspace 136 | service share knowledge on which subset of the file system is in play 137 | between the two. The kernel issues FORGET calls when purging file 138 | entries from its cache, allowing the userspace service to release the 139 | resource as well. 140 | 141 | Where NFSv4's semantics become problematic is not necessarily in 142 | bb\_worker, but in bb\_clientd. This service provides certain 143 | directories that are infinitely big (i.e., `/cas/*`, which allow you to 144 | access arbitrary CAS contents). Unlike build directories, files in these 145 | directories have an infinite lifetime, meaning that any dynamic 146 | allocation scheme for file handles would result in memory leaks. For 147 | files in these directories we will need to dump their state (i.e., the 148 | REv2 digest) into the file handle itself, allowing the file to be 149 | reconstructed when needed. 150 | 151 | To achieve the above, we will add a new `HandleAllocator` API that is 152 | capable of decorating `Directory` and `Leaf` objects to give them their 153 | own identity and perform lifecycle management. In practice, this means 154 | that a plain `Directory` or `Leaf` implementation will no longer have an 155 | inode number or link count. Only by wrapping the object using 156 | `HandleAllocator` will this information become available through 157 | `VirtualGetAttributes()`. In the case of FUSE, `HandleAllocator` will do 158 | little more than compute an inode number, just like [`InodeNumberTree`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/inode_number_tree.go) 159 | already did. For NFSv4, it will additionally generate a file handle and 160 | store the object in a global map, so that the object can be resolved by 161 | the NFSv4 server if the client performs a PUTFH operation. 162 | 163 | The `HandleAllocator` API will distinguish between three types of 164 | allocations: 165 | 166 | - **Stateful objects:** Files or directories that are mutable. Each 167 | instance has its own dynamically allocated inode number/file handle. 168 | - **Stateless objects:** Files or directories that are immutable, but 169 | have a state that cannot be reproduced from just a file handle. 170 | Examples may include symbolic links created by build actions. As a 171 | symlink's target can be larger than 128 bytes, there is no way to 172 | embed a symlink's state into a file handle. This means that even 173 | though the symlink may have a deterministic inode number/file handle, 174 | its lifecycle should be tracked explicitly. 175 | - **Resolvable objects:** Files or directories that are immutable, and 176 | have state that is small enough to embed into the file handle. 177 | Examples include CAS backed files, as a SHA-256 sum, file size and 178 | executable bit can easily fit in a file handle. 179 | 180 | **Status:** This work has also been completed as part of [commit `c4bbd24`](https://github.com/buildbarn/bb-remote-execution/commit/c4bbd24a8d272267f40b234533f62c8ccf24d568). 181 | Implementations of `HandleAllocator` have been added for [FUSE](https://github.com/buildbarn/bb-remote-execution/blob/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual/fuse_handle_allocator.go) 182 | and [NFSv4](https://github.com/buildbarn/bb-remote-execution/blob/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual/nfs_handle_allocator.go). 183 | 184 | # XDR description to Go compilation 185 | 186 | All versions of NFS are built on top of ONC RPCv2 ([RFC 5531](https://datatracker.ietf.org/doc/html/rfc5531)), 187 | also known as "Sun RPC". Like most protocols built on top of ONC RPCv2, 188 | NFS uses XDR ([RFC 4506](https://datatracker.ietf.org/doc/html/rfc4506)) 189 | for encoding request and response payloads. The XDR description for 190 | NFSv4.0 can be found in [RFC 7531](https://datatracker.ietf.org/doc/html/rfc7531). 191 | 192 | A nice feature of schema languages like XDR is that they can be used to 193 | perform code generation. For each of the types described in the RFC, we 194 | may emit an equivalent native type in Go, together with serialization 195 | and deserialization methods. In addition to making our server's code 196 | more readable, it is less error prone than attempting to read/write raw 197 | bytes from/to a socket. 198 | 199 | A disadvantage of this approach is that it does add overhead. When 200 | converted to native Go types, requests are no longer stored contiguously 201 | in some buffer, but may be split up into multiple objects, which may 202 | become heap allocated (thus garbage collector backed). Though this is a 203 | valid concern, we will initially assume that this overhead is 204 | acceptable. In-kernel implementations tend to make a different trade-off 205 | in this regard, but this is likely a result of memory management in 206 | kernel space being far more restricted. 207 | 208 | A couple of implementations of XDR for Go exist: 209 | 210 | - [github.com/davecgh/go-xdr](https://github.com/davecgh/go-xdr) 211 | - [github.com/stellar/go/xdr](https://github.com/stellar/go/tree/master/xdr) 212 | - [github.com/xdrpp/goxdr](https://github.com/xdrpp/goxdr) 213 | 214 | Unfortunately, none of these implementations are complete enough to be 215 | of use for this specific use case. We will therefore design our own 216 | implementation, which we will release as a separate project that does 217 | not depend on any Buildbarn code. 218 | 219 | **Status:** [The XDR to Go compiler has been released on GitHub.](https://github.com/buildbarn/go-xdr) 220 | 221 | # The actual NFSv4 server 222 | 223 | With all of the previous tasks completed, we have all of the building 224 | blocks in place to be able to add an NFSv4 server to the 225 | bb-remote-execution codebase. All that is left is to write an 226 | implementation of [program `NFS4_PROGRAM`](https://datatracker.ietf.org/doc/html/rfc7531#page-36), 227 | for which the XDR to Go compiler automatically generates the following 228 | interface: 229 | 230 | ```go 231 | type Nfs4Program interface { 232 | NfsV4Nfsproc4Null(context.Context) error 233 | NfsV4Nfsproc4Compound(context.Context, *Compound4args) (*Compound4res, error) 234 | } 235 | ``` 236 | 237 | This implementation, which we will call `baseProgram`, needs to process 238 | the operations provided in `Compound4args` by translating them to calls 239 | against instances of `Directory` and `Leaf`. 240 | 241 | For most NFSv4 operations this implementation will be relatively simple. 242 | For example, for RENAME it involves little more than extracting the 243 | directory objects and filenames from the request, followed by calling 244 | `Directory.VirtualRename()`. 245 | 246 | Most of the complexity of `baseProgram` will lie in how operations like 247 | OPEN, CLOSE, LOCK, and LOCKU are implemented. These operations establish 248 | and alter state on the server, meaning that they need to be guarded 249 | against replays and out-of-order execution. To solve this, NFSv4.0 250 | requires that these operations are executed in the context of 251 | open-owners and lock-owners. Each open-owner and lock-owner has a 252 | sequence ID associated with it, which gets incremented whenever an 253 | operation succeeds. The server can thus detect replays of previous 254 | requests by comparing the sequence ID in the client's request with the 255 | value stored on the server. If the sequence ID corresponds to the last 256 | operation to execute, a cached response is returned. A new transaction 257 | will only be performed if the sequence ID is one larger than the last 258 | observed. 259 | 260 | The exact semantics of this sequencing model is fairly complex. It 261 | is covered extensively in [chapter 9 of RFC 7530](https://datatracker.ietf.org/doc/html/rfc7530#section-9), 262 | which is about 40 pages long. The following attempts to summarize which 263 | data types we have declared as part of `baseProgram`, and how they map 264 | to the NFSv4.0 sequencing model. 265 | 266 | - `baseProgram`: In the case of bb\_worker, zero or more build 267 | directories are declared to be exposed through an NFSv4 server. 268 | - `clientState`: Zero or more clients may be connected to the NFSv4 269 | server. 270 | - `clientConfirmationState`: The client may have one or more client 271 | records, which are created through SETCLIENTID. Multiple client 272 | records can exist if the client loses all state and reconnects 273 | (e.g., due to a reboot). 274 | - `confirmedClientState`: Up to one of these client records can be 275 | confirmed using SETCLIENTID\_CONFIRM. This structure stores the 276 | state of a healthy client that is capable of opening files and 277 | acquiring locks. 278 | - `openOwnerState`: Confirmed clients may have zero or more 279 | open-owners. This structure stores the current sequence 280 | number of the open-owner. It also holds the response of the last 281 | CLOSE, OPEN, OPEN\_CONFIRM or OPEN\_DOWNGRADE call for replays. 282 | - `openOwnerFileState`: An open-owner may have zero or more 283 | open files. The first time a file is opened through this 284 | open-owner, the client needs to call OPEN\_CONFIRM. 285 | - `lockOwnerState`: Confirmed clients may have zero or more 286 | lock-owners. This structure stores the current sequence number 287 | of the lock-owner. It also holds the response of the last LOCK or 288 | LOCKU call for replays. 289 | - `lockOwnerFileState`: A lock-owner may have one or more 290 | files with lock state. 291 | - `ByteRangeLock`: A lock state may hold locks on 292 | byte ranges in the file. 293 | 294 | Even though NFSv4.0 does provide a RELEASE\_LOCKOWNER operation for 295 | removing lock-owners, no facilities are provided for removing unused 296 | open-owners and client records. `baseProgram` will be implemented in 297 | such a way that these objects are removed automatically if a 298 | configurable amount of time passes. This is done as part of general lock 299 | acquisition, meaning clients are collectively responsible for cleaning 300 | stale state. 301 | 302 | A disadvantage of NFSv4.0's sequencing model is that open-owners are not 303 | capable of sending OPEN requests in parallel. It is not expected that 304 | this causes a bottleneck in our situation, as running the NFSv4 server 305 | on the worker itself means that latency is virtually nonexistent. It is 306 | worth noting that [NFSv4.1 has completely overhauled this part of the protocol](https://www.rfc-editor.org/rfc/rfc8881.html#section-8.8), 307 | thereby removing this restriction. Implementing this model is, as 308 | explained earlier on, out of scope. 309 | 310 | **Status:** This work has been completed as part of [commit `f00b857`](https://github.com/buildbarn/bb-remote-execution/commit/f00b8570c2cb0223f1faedcd83f84709a268b457). 311 | The new NFSv4 server can be found in [`pkg/filesystem/virtual/nfsv4`](https://github.com/buildbarn/bb-remote-execution/blob/f00b8570c2cb0223f1faedcd83f84709a268b457/pkg/filesystem/virtual/nfsv4/base_program.go). 312 | 313 | # Changes to the configuration schema 314 | 315 | The FUSE file system is currently configured through 316 | [the `MountConfiguration` message](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/proto/configuration/fuse/fuse.proto#L9-L91). 317 | Our plan is to split this message up, moving all FUSE specific options 318 | into a `FUSEMountConfiguration` message. This message can then be placed 319 | in a `oneof` together with an `NFSv4MountConfiguration` message that 320 | enables the use of the NFSv4 server. Switching back and forth between 321 | FUSE and NFSv4 should thus be trivial. 322 | 323 | **Status:** This work has been completed as part of [commit `f00b857`](https://github.com/buildbarn/bb-remote-execution/commit/f00b8570c2cb0223f1faedcd83f84709a268b457). 324 | The new `MountConfiguration` message with separate FUSE and NFSv4 backends can 325 | be found [here](https://github.com/buildbarn/bb-remote-execution/blob/f00b8570c2cb0223f1faedcd83f84709a268b457/pkg/proto/configuration/filesystem/virtual/virtual.proto#L13-L31). 326 | 327 | # Future work 328 | 329 | In this ADR we have mainly focused on the use of NFSv4 for bb\_worker. 330 | These changes will also make it possible to launch bb\_clientd with an 331 | NFSv4 server. bb\_clientd's use case differs from bb\_worker's, in that 332 | the use of the [Remote Output Service](https://github.com/buildbarn/bb-clientd#-to-perform-remote-builds-without-the-bytes) 333 | heavily depends on being able to quickly invalidate directory entries 334 | when lazy-loading files are inserted into the file system. FUSE is 335 | capable of facilitating this by sending FUSE\_NOTIFY\_INVAL\_ENTRY 336 | messages to the kernel. A similar feature named CB\_NOTIFY is present in 337 | NFSv4.1 and later, but rarely implemented by clients. 338 | 339 | Maybe bb\_clientd can be made to work by disabling client-side directory 340 | caching. Would performance still be acceptable in that case? 341 | -------------------------------------------------------------------------------- /0010-file-system-access-cache.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #10: The File System Access Cache 2 | 3 | Author: Ed Schouten
4 | Date: 2023-01-20 5 | 6 | # Context 7 | 8 | Buildbarn's workers can be configured to expose input roots of actions 9 | through a virtual file system, using either FUSE or NFSv4. One major 10 | advantage of this feature is that it causes workers to only download 11 | objects from the Content Addressable Storage (CAS) if they are accessed 12 | by actions as part of their execution. For example, a compilation action 13 | that ships its own SDK will only cause the worker to download the 14 | compiler binary and standard libraries/headers that are used by the code 15 | to be compiled, as opposed to downloading the entire SDK. 16 | 17 | Using the virtual file system generally to make actions run faster. We 18 | have observed that in aggregate, it allows actions to run at >95% of 19 | their original speed, while eliminating the downloading phase that 20 | normally leads up to execution. Also, the reduction in bandwidth against 21 | storage nodes allows clusters to scale further. 22 | 23 | Unfortunately, there are also certain workloads for which the virtual 24 | file system performs poorly. As most traditional software reads files 25 | from disk using synchronous APIs, actions that observe poor cache hit 26 | rates can easily become blocked when the worker needs to make many round 27 | trips to the CAS. Increased latency (distance) between storage and 28 | workers only amplifies these delays. 29 | 30 | A more traditional worker that loads all data up front would not be 31 | affected by this issue, for the reason that it can use parallelism to 32 | load many objects from the CAS. This means that one could avoid this 33 | issue by creating a cluster consisting of both workers with, and without 34 | a virtual file system. Tags could be added to build targets declared in 35 | the user's repository to indicate which kind of worker should be used 36 | for a given action, based on observed performance characteristics. 37 | However, such a solution would provide a poor user experience. 38 | 39 | # Introducing the File System Access Cache (FSAC) 40 | 41 | What if there was a way for workers to load data up front, but somehow 42 | restrict this process to data that the worker **thinks** is going to be 43 | used? This would be optimal, both in terms of throughput and bandwidth 44 | usage. Assuming the number of inaccuracies is low, this shouldn't cause 45 | any harm: 46 | 47 | - **False negatives:** In case a file or directory is not loaded up 48 | front, but does get accessed by the action, it can be loaded on demand 49 | using the logic that is already provided by the virtual file system 50 | and storage layer. 51 | 52 | - **False positives:** In case a file or directory is loaded up front, 53 | but does not get used, it merely leads to some wasted network traffic 54 | and worker-level cache usage. 55 | 56 | Even though it is possible to use heuristics for certain kinds of 57 | actions to derive which paths they will access as part of their 58 | execution, this cannot be done in the general case. Because of that, we 59 | will depend on file system access profiles that we obtained from 60 | historical executions of similar actions. These profiles will be stored 61 | in a new data store, named the File System Access Cache (FSAC). 62 | 63 | When we added the Initial Size Class Cache (ISCC) for the automatic 64 | worker size selection, we observed that when combined, 65 | [the Command digest](https://github.com/bazelbuild/remote-apis/blob/f6089187c6580bc27cee25b01ff994a74ae8658d/build/bazel/remote/execution/v2/remote_execution.proto#L458-L461) 66 | and [platform properties](https://github.com/bazelbuild/remote-apis/blob/f6089187c6580bc27cee25b01ff994a74ae8658d/build/bazel/remote/execution/v2/remote_execution.proto#L514-L522) 67 | are very suitable keys for grouping similar actions together. The FSAC 68 | will thus use the same 'reduced Action digests' as its keys as the ISCC. 69 | 70 | As the set of paths accessed can for some actions be large, we're going 71 | to apply two techniques to ensure the profiles stored in the FSAC remain 72 | small: 73 | 74 | 1. Instead of storing all paths in literal string form, we're going to 75 | store them as a Bloom filter. Bloom filters are probabilistic data 76 | structures, which have a bounded and configurable false positive 77 | rate. Using Bloom filters, we only need 14 bits of space per path to 78 | achieve a 0.1% false positive rate. 79 | 1. For actions that acess a very large number of paths, we're going to 80 | limit the maximum size of the Bloom filter. Even though this will 81 | increase the false positive rate, this is likely acceptable. It is 82 | uncommon to have actions that access many different paths, **and** 83 | leave a large portion of the input root unaccessed. 84 | 85 | # Tying the FSAC into bb\_worker 86 | 87 | To let bb\_worker make use of the FSAC, we're going to add a 88 | PrefetchingBuildExecutor that does two things in parallel: 89 | 90 | 1. Immediately call into an underlying BuildExecutor to launch the 91 | action as usual. 92 | 1. Load the file system access profile from the FSAC, and fetch files in 93 | the input root that are matched by the Bloom filter contained in the 94 | profile. 95 | 96 | The worker and the action will thus 'race' with each other to see who 97 | can fetch the input files first. If the action wins, we terminate the 98 | prefetching. 99 | 100 | Furthermore, we will extend the virtual file system layer to include the 101 | hooks that are necessary to measure which parts of the input root are 102 | being read. We can wrap the existing InitialContentsFetcher and Leaf 103 | types to detect reads against directories and files, respectively. As 104 | these types are strongly coupled to the virtual file system layer, we 105 | will add new {Unread,Read}DirectoryMonitor interfaces that 106 | PrefetchingBuildExecutor can implement to listen for file system 107 | activity in a loosely coupled fashion. 108 | 109 | Once execution has completed and PrefetchingBuildExecutor has captured 110 | all of the paths that are being accessed, it can write a new file system 111 | access profile into the FSAC. This only needs to be done if the profile 112 | differs from the one that was fetched from the FSAC at the start of the 113 | build. 114 | 115 | # Tying the FSAC into bb\_browser 116 | 117 | The file system access profiles in the FSAC may also be of use for 118 | people designing their own build rules. Namely, the profiles may be used 119 | to determine whether the input root contains data that can be omitted. 120 | Because of that, we will extend bb\_browser to overlay this information 121 | on top of the input root that's displayed on the action's page. 122 | 123 | # Future work 124 | 125 | Another interesting use case for the profiles stored in the FSAC is that 126 | they could be used to perform weaker lookups against the Action Cache. 127 | This would allow us to skip execution of actions in cases where we know 128 | that changes have only been made against files that are not going to be 129 | accessed anyway. 130 | -------------------------------------------------------------------------------- /0011-rendezvous-hashing.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Record #11: Rendezvous Hashing 2 | 3 | Author: Benjamin Ingberg
4 | Date: 2025-04-08 5 | 6 | # Context 7 | 8 | Resharding a Buildbarn cluster, that is changing the number, order or weight of 9 | shards is a very disruptive process today. It essentially shuffles around all 10 | the blobs to new shards forcing you to either drop all old state. Or, have a 11 | period with a read fallback configuration where you can fetch the old state. 12 | 13 | Buildbarn supports using drained backends for sharding and even adding an unused 14 | placeholder backend at the end of it's sharding array to mitigate situation. It 15 | is however a rare operation to deal with which means that, without a significant 16 | amount of automation, there is a significant risk it's performed incorrectly. 17 | 18 | This ADR describes how we can make that more accessible by changing the 19 | underlying algorithm to one that is stable between resharding. 20 | 21 | # Issues During Resharding 22 | 23 | You might not have the ability to spin up a secondary set of storage shards to 24 | perform a read fallback configuration over. This is a common situation in an 25 | on-prem environment, where running two copies of your production environment may 26 | not be feasible. 27 | 28 | This is not necessarily a blocker. You can reuse shards from the old topology 29 | in your new topology. However, this has a risk of significantly reducing your 30 | retention time since data must be stored according to the addressing schema of 31 | both the new and the old topology simultaneously. 32 | 33 | While it is possible to reduce the amount of address space that is resharded 34 | with drained backends this requires advance planning. 35 | 36 | # Improvements 37 | 38 | In this ADR we attempt to improve the resharding experience by minimizing the 39 | difference between different sharding topologies. 40 | 41 | ## Better Overlap Between Sharding Topologies 42 | 43 | Currently, two different sharding topologies, even if they share nodes, will 44 | have a small overlap between addressing schemas. This can be significantly 45 | improved by using a different sharding algorithm. 46 | 47 | For this purpose we replace the implementation of 48 | `ShardingBlobAccessConfiguration` with one that uses [Rendezvous 49 | hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing 50 | is a lightweight and stateless technique for distributed hash tables. It has a 51 | low overhead with minimal disruption during resharding. 52 | 53 | Sharding with Rendezvous hashing gives us the following properties: 54 | * Removing a shard is _guaranteed_ to only require resharding for the blobs 55 | that resolved to the removed shard. 56 | * Adding a shard will reshard any blob to the new shard with a probability of 57 | `weight/total_weight`. 58 | 59 | This effectively means adding or removing a shard triggers a predictable, 60 | minimal amount of resharding, eliminating the need for drained backends. 61 | 62 | ```proto 63 | message ShardingBlobAccessConfiguration { 64 | message Shard { 65 | // unchanged 66 | BlobAccessConfiguration backend = 1; 67 | // unchanged 68 | uint32 weight = 2; 69 | } 70 | // NEW: 71 | // Shards identified by a key within the context of this sharding 72 | // configuration. The key is a freeform string which describes the identity 73 | // of the shard in the context of the current sharding configuration. 74 | // Shards are chosen via Rendezvous hashing based on the digest, weight and 75 | // key of the configuration. 76 | // 77 | // When removing a shard from the map it is guaranteed that only blobs 78 | // which resolved to the removed shard will get a different shard. When 79 | // adding shards there is a weight/total_weight probability that any given 80 | // blob will be resolved to the new shards. 81 | map shards = 2; 82 | // NEW: 83 | message Legacy { 84 | // Order of the shards for the legacy schema. Each key here refers to 85 | // a corresponding key in the 'shard_map' or null for drained backends. 86 | repeated string shard_order = 1; 87 | // Hash initialization seed used for legacy schema. 88 | uint64 hash_initialization = 2; 89 | } 90 | // NEW: 91 | // A temporary legacy mode which allows clients to use storage backends which 92 | // are sharded with the old sharding topology implementation. Consumers are 93 | // expected to migrate in a timely fashion and support for the legacy schema 94 | // will be removed by 2025-12-31. 95 | Legacy legacy = 3; 96 | } 97 | ``` 98 | 99 | ## Migration Instructions 100 | 101 | If you are fine with dropping your data and repopulating your cache you can 102 | start using the new sharding algorithm right away. For clarity we have an 103 | example on how to perform a non-destructive migration. 104 | 105 | ### Configuration before Migration 106 | 107 | Given the following configuration showing an unmirrored sharding topology of two 108 | shards: 109 | 110 | ```jsonnet 111 | { 112 | ... 113 | contentAddressableStorage: { 114 | sharding: { 115 | hashInitialization: 11946695773637837490, 116 | shards: [ 117 | { 118 | backend: { 119 | grpc: { 120 | address: 'storage-0:8980', 121 | }, 122 | }, 123 | weight: 1, 124 | }, 125 | { 126 | backend: { 127 | grpc: { 128 | address: 'storage-1:8980', 129 | }, 130 | }, 131 | weight: 1, 132 | }, 133 | ], 134 | }, 135 | }, 136 | ... 137 | } 138 | ``` 139 | 140 | ### Equivalent Configuration in New Format 141 | 142 | We first want to non-destructively represent the exact same topology with the 143 | new configuration. The legacy parameter used here will make Buildbarn internally 144 | use the old sharding algorithm, the key can be any arbitrary string value. 145 | 146 | ```jsonnet 147 | { 148 | ... 149 | contentAddressableStorage: { 150 | sharding: { 151 | shards: { 152 | "0": { 153 | backend: { grpc: { address: 'storage-0:8980' } }, 154 | weight: 1, 155 | }, 156 | "1": { 157 | backend: { grpc: { address: 'storage-1:8980' } }, 158 | weight: 1, 159 | }, 160 | }, 161 | legacy: { 162 | shardOrder: ["0", "1"], 163 | hashInitialization: 11946695773637837490, 164 | }, 165 | }, 166 | }, 167 | ... 168 | } 169 | ``` 170 | 171 | ### Intermediate State with Fallback (Optional) 172 | 173 | If you want to preserve cache content during migration you can add this 174 | intermediate step that uses the new adressing schema as it's primary 175 | configuration but falls back to the old adressing schema for missing blobs. 176 | 177 | If you are willing to drop the cache this step can be skipped. 178 | 179 | ```jsonnet 180 | { 181 | ... 182 | contentAddressableStorage: { 183 | readFallback: { 184 | primary: { 185 | sharding: { 186 | shards: { 187 | "0": { 188 | backend: { grpc: { address: 'storage-0:8980' } }, 189 | weight: 1, 190 | }, 191 | "1": { 192 | backend: { grpc: { address: 'storage-1:8980' } }, 193 | weight: 1, 194 | }, 195 | }, 196 | }, 197 | }, 198 | secondary: { 199 | sharding: { 200 | shards: { 201 | "0": { 202 | backend: { grpc: { address: 'storage-0:8980' } }, 203 | weight: 1, 204 | }, 205 | "1": { 206 | backend: { grpc: { address: 'storage-1:8980' } }, 207 | weight: 1, 208 | }, 209 | }, 210 | legacy: { 211 | shardOrder: ["0", "1"], 212 | hashInitialization: 11946695773637837490, 213 | }, 214 | }, 215 | }, 216 | replicator: { ... } 217 | }, 218 | }, 219 | ... 220 | } 221 | ``` 222 | 223 | ### Final Configuration 224 | 225 | When the cluster has used the read fallback configuration for an acceptable 226 | amount of time you can drop the fallback and have only the new configuration. 227 | 228 | ```jsonnet 229 | { 230 | ... 231 | contentAddressableStorage: { 232 | sharding: { 233 | shards: { 234 | "0": { 235 | backend: { grpc: { address: 'storage-0:8980' } }, 236 | weight: 1, 237 | }, 238 | "1": { 239 | backend: { grpc: { address: 'storage-1:8980' } }, 240 | weight: 1, 241 | }, 242 | }, 243 | }, 244 | }, 245 | ... 246 | } 247 | ``` 248 | 249 | ## Implementation 250 | 251 | Rendezvous hashing was chosen for it's simplicity of implementation, 252 | mathematical elegance and not needing to be parameterized in configuration. 253 | 254 | > Of all shards pick the one with the largest $-\frac{weight}{log(h)}$ where `h` is a 255 | hash of the key/node pair mapped to (0, 1). 256 | 257 | The implementation uses an optimized fixed point integer calculation. Checking a 258 | server takes approximately 5ns on a Ryzen 7 7840U. Running integration 259 | benchmarks on large `FindMissingBlobs` calls shows us that the performance is 260 | within variance until we reach the 10k+ shards mark. 261 | 262 | ``` 263 | goos: linux 264 | goarch: amd64 265 | cpu: AMD Ryzen 7 7840U w/ Radeon 780M Graphics 266 | BenchmarkSharding10-16 17619 677008 ns/op 267 | BenchmarkSharding100-16 6465 1636342 ns/op 268 | BenchmarkSharding1000-16 1488 8842389 ns/op 269 | BenchmarkSharding10000-16 783 18426834 ns/op 270 | BenchmarkLegacy10-16 16959 704753 ns/op 271 | BenchmarkLegacy100-16 6562 1574817 ns/op 272 | BenchmarkLegacy1000-16 1558 8045423 ns/op 273 | BenchmarkLegacy10000-16 958 12146523 ns/op 274 | ``` 275 | 276 | ## Other Algorithms 277 | 278 | There are some minor advantages and disadvantages to the other algorithms, in 279 | particular with regards to how computationally heavy they are. However, as noted 280 | above, the performance characteristics are not relevant for clusters below 10k 281 | shards. 282 | 283 | Other algorithms considered were: [Consistent 284 | hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and 285 | [Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). 286 | 287 | ### Consistent Hashing 288 | 289 | Consistent Hashing can be reduced into a special case of Rendezvous with the 290 | advantage that it can precomputed into a sorted list. There we can binary search 291 | for the target in `log(W)` time. This is in contrast to to Rendezvous hashing 292 | which takes `N` time to search. 293 | 294 | * `N` is the number of shards 295 | * `W` is the sum of weights in its reduced form (i.e. smallest weight is 1 and 296 | all weights are integer) 297 | 298 | There are some concrete disadvantages to Consistent Hashing. A major one is that 299 | it is only unbiased on average, any given sharding configuration with Consistent 300 | Hashing will have a bias towards one of the shards. 301 | 302 | The performance advantages of Consistent Hashing are also unlikely to 303 | materialize in practice. Straight linear search with Rendezvous is often faster 304 | than binary searching for reasonable N. Rendezvous is also independent on the 305 | weight of the individual shards which improves the relative performance of 306 | Rendezvous for shards that are not evenly weighted. 307 | 308 | ### Maglev 309 | 310 | Maglev in turn uses an iterative method that precomputes an evenly distributed 311 | lookup table of any arbitrary size. After the precomputational step the target 312 | shard is a simple lookup of the key modulo the size of the table. 313 | 314 | The disadvantage of Maglev is that it aims for minimal resharding rather than 315 | optimal resharding. When removing a shard from the set of all shard, the 316 | algorithm tolerates a small amount of unecessary resharding. This is the order 317 | of a few percentiles depending on the number of shards and size of the lookup 318 | table. 319 | 320 | Precomputing the lookup table is also a fairly expensive operation which impacts 321 | startup time of components or requires managing the table state externally. 322 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Buildbarn Architecture Decision Records 2 | 3 | This repository contains a set of 4 | [Architecture Decision Records](https://github.com/joelparkerhenderson/architecture_decision_record) 5 | that apply to Buildbarn. The idea is that every time we make a 6 | large-scale change to Buildbarn, we add an article to this repository 7 | that explains the reasoning behind it. 8 | 9 | These articles don't act as user documentation, but at least explain the 10 | reasoning that went into designing some of Buildbarn's features. They 11 | are also not updated over time, meaning that each of the articles is 12 | written in the context of what Buildbarn looked like at the time. 13 | -------------------------------------------------------------------------------- /images/0005/before.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/before.png -------------------------------------------------------------------------------- /images/0005/persistent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/persistent.png -------------------------------------------------------------------------------- /images/0005/volatile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/volatile.png --------------------------------------------------------------------------------