├── 0001-buffer.md
├── 0002-storage.md
├── 0003-cas-decomposition.md
├── 0004-icas.md
├── 0005-local-blob-access-persistency.md
├── 0006-operation-logging-and-monetary-resource-usage.md
├── 0007-nested-invocations.md
├── 0009-nfsv4.md
├── 0010-file-system-access-cache.md
├── 0011-rendezvous-hashing.md
├── README.md
└── images
    └── 0005
        ├── before.png
        ├── persistent.png
        └── volatile.png


/0001-buffer.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #1: Buffer layer
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2020-01-09
  5 | 
  6 | # Context
  7 | 
  8 | The `BlobAccess` interface that Buildbarn currently uses to abstract
  9 | away different kinds of backing stores for the CAS and AC (Redis, S3,
 10 | gRPC, etc.) is a bit simplistic, in that contents are always transferred
 11 | through `io.ReadCloser` handles. This is causing a couple of problems:
 12 | 
 13 | - There is an unnecessary amount of copying of data in memory. In likely
 14 |   the worst case (`bb_storage` with a gRPC AC storage backend), data may
 15 |   be converted from ActionResult Protobuf message → byte slice →
 16 |   `io.ReadCloser` → byte slice → ActionResult Protobuf message.
 17 | 
 18 | - It makes implementing a replicating `BlobAccess` harder, as
 19 |   `io.ReadCloser`s may not be duplicated. A replicating `BlobAccess`
 20 |   could manually copy blobs into a byte slice itself, but this would
 21 |   only contribute to more unnecessary copying of data.
 22 | 
 23 | - Tracking I/O completion and hooking into errors is not done
 24 |   accurately and consistently. `MetricsBlobAccess` currently only counts
 25 |   the amount of time spent in `Get()`, which won't always include the
 26 |   time actually spent transferring data. `ReadCachingBlobAccess` is
 27 |   capable of responding to errors returned by `Get()`, but not ones that
 28 |   occur during the actual transfer. A generic mechanism for hooking into
 29 |   I/O errors and completion is absent.
 30 | 
 31 | - To implement an efficient FUSE file system for `bb_worker`, we need to
 32 |   support efficient random access I/O, as userspace applications are
 33 |   capable of accessing files at random. This is not supported by
 34 |   `BlobAccess`. An alternative would be to not let a FUSE file system
 35 |   use `BlobAccess` directly, but this would lead to reduced operational
 36 |   flexibility.
 37 | 
 38 | - Data integrity checking (checksumming) is currently done through
 39 |   `MerkleBlobAccess`. To prevent foot-shooting, this `BlobAccess`
 40 |   decorator is injected into the configuration automatically. The
 41 |   problem is that `MerkleBlobAccess` is currently only inserted at one
 42 |   specific level of the system. When Buildbarn is configured to use a
 43 |   Redis remote data store in combination with local on-disk caching,
 44 |   there is no way to enable checksum validation for both Redis → local
 45 |   cache and local cache → consumption.
 46 | 
 47 | - Stretch goal: There is no framework in place to easily implement
 48 |   decomposition of larger blobs into smaller ones (e.g., using
 49 |   [VSO hashing](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md)).
 50 |   Supporting this would allow us to use data stores optimized for small
 51 |   blobs exclusively (e.g., Redis).
 52 | 
 53 | # Decision
 54 | 
 55 | The decision is to add a new abstraction to Buildbarn, called the buffer
 56 | layer, stored in Go package
 57 | `github.com/buildbarn/bb-storage/pkg/blobstore/buffer`. Below is a
 58 | massively simplified version of what the API will look like:
 59 | 
 60 | ```go
 61 | func NewBufferFromActionResult(*remoteexecution.ActionResult) Buffer {}
 62 | func NewBufferFromByteSlice([]byte) Buffer                           {}
 63 | func NewBufferFromReader(io.ReadCloser) Buffer                       {}
 64 | 
 65 | type Buffer interface {
 66 | 	ToActionResult() *remoteexecution.ActionResult
 67 | 	ToByteSlice() []byte
 68 | 	ToReader() io.ReadCloser
 69 | }
 70 | ```
 71 | 
 72 | It can be thought of as a union/variant type that automatically does
 73 | conversions from one format to another, but only when strictly
 74 | necessary. Calling one of the `To*()` functions extracts the data from
 75 | the buffer, thereby destroying it.
 76 | 
 77 | To facilitate support for replication, the `Buffer` interface may
 78 | contain a `Clone()` function. For buffers created from a byte slice,
 79 | this function may be a no-op, causing the underlying slice to be shared.
 80 | For buffers created from other kinds of sources, the implementation may
 81 | be more complex (e.g., converting it to a byte slice on the spot).
 82 | 
 83 | To track I/O completion and to support retrying, every buffer may have a
 84 | series of `ErrorHandler`s associated with it:
 85 | 
 86 | ```go
 87 | type ErrorHandler interface {
 88 | 	OnError(err error) (Buffer, error)
 89 | 	Done()
 90 | }
 91 | ```
 92 | 
 93 | During a transmission, a buffer may call `OnError()` whenever it runs
 94 | into an I/O error, asking the `ErrorHandler` to either capture the error
 95 | (as done by `MetricsBlobAccess`), substitute the error (as done by
 96 | `ExistencePreconditionBlobAccess`) or to continue the transmission using
 97 | a different buffer. The latter may be a copy of the same data obtained
 98 | from another source (high availability).
 99 | 
100 | To facilitate fast random access I/O, the `Buffer` interface may
101 | implement `io.ReaderAt` to extract just a part of the data. It is likely
102 | the case that only a subset of the buffer types are capable of
103 | implementing this efficiently, but that is to be expected.
104 | 
105 | Data integrity checking could be achieved by having special flavors of
106 | the buffer construction functions:
107 | 
108 | ```go
109 | func NewACBufferFromReader(io.ReadCloser, RepairStrategy) Buffer                {}
110 | func NewCASBufferFromReader(*util.Digest, io.ReadCloser, RepairStrategy) Buffer {}
111 | ```
112 | 
113 | Buffers created through these functions may enforce that their contents
114 | are valid prior to returning them to their consumer. When detecting
115 | inconsistencies, the provided `RepairStrategy` may contain a callback
116 | that the storage backend can use to repair or delete the inconsistent
117 | object.
118 | 
119 | Support for decomposing large blobs into smaller ones and recombining
120 | them may be realized by adding more functions to the `Buffer` interface
121 | or by adding decorator types. The exact details of that are outside the
122 | scope of this ADR.
123 | 


--------------------------------------------------------------------------------
/0002-storage.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #2: Towards fast and reliable storage
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2020-01-09
  5 | 
  6 | # Context
  7 | 
  8 | Initial versions of Buildbarn as published on GitHub made use of a
  9 | combination of Redis and S3 to hold the contents of the Action Cache
 10 | (AC) and Content Addressable Storage (CAS). Due to their performance
 11 | characteristics, Redis was used to store small objects, while S3 was
 12 | used to store large ones (i.e., objects in excess of 2 MiB). This
 13 | partitioning was facilitated by the SizeDistinguishingBlobAccess storage
 14 | class.
 15 | 
 16 | Current versions of Buildbarn still support Redis and S3, but the
 17 | example deployments use bb\_storage with the "circular" storage backend
 18 | instead. Though there were valid reasons for this change at the time,
 19 | these reasons were not adequately communicated to the community, which
 20 | is why users are still tempted to use Redis and S3 up until this day.
 21 | 
 22 | The goal of this document is as follows:
 23 | 
 24 | - To describe the requirements that Bazel and Buildbarn have on storage.
 25 | - To present a comparative overview of existing storage backends.
 26 | - To propose changes to Buildbarn's storage architecture that allow us
 27 |   to construct a storage stack that satisfies the needs for most large
 28 |   users of Buildbarn.
 29 | 
 30 | # Storage requirements of Bazel and Buildbarn
 31 | 
 32 | In the common case, Bazel performs the following operations against
 33 | Buildbarn when executing a build action:
 34 | 
 35 | 1. ActionCache.GetActionResult() is called to check whether a build
 36 |    action has already been executed previously. This call extracts an
 37 |    ActionResult message from the AC. If such a message is found, Bazel
 38 |    continues with step 5.
 39 | 2. Bazel constructs a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree)
 40 |    of Action, Command and Directory messages and associated input files.
 41 |    It then calls ContentAddressableStorage.FindMissingBlobs() to
 42 |    determine which parts of the Merkle tree are not present in the CAS.
 43 | 3. Any missing nodes of the Merkle tree are uploaded into the CAS using
 44 |    ByteStream.Write().
 45 | 4. Execution of the build action is triggered through
 46 |    Execution.Execute(). Upon successful completion, this function
 47 |    returns an ActionResult message.
 48 | 5. Bazel downloads all of the output files referenced by the
 49 |    ActionResult message from the CAS to local disk using
 50 |    ByteStream.Read().
 51 | 
 52 | By letting Bazel download all output files from the CAS to local disk,
 53 | there is a guarantee that Bazel is capable of making forward progress.
 54 | If any of the objects referenced by the build action were to disappear
 55 | during these steps, the RPCs will either fail with code `NOT_FOUND` or
 56 | `FAILED_PRECONDITION`. This instructs Bazel to reupload inputs from
 57 | local disk and retry. Bazel therefore never has to perform any
 58 | backtracking on its build graph, an assumption that is part of its
 59 | design.
 60 | 
 61 | This implementation places relatively weak requirements on Buildbarn in
 62 | terms of consistency and persistence. Given a sufficient amount of
 63 | storage space, Buildbarn will generally make sure that input files don't
 64 | disappear between ContentAddressableStorage.FindMissingBlobs() and the
 65 | completion of Execution.Execute(). It will also generally make sure that
 66 | output files remain present at least long enough for Bazel to download
 67 | them. Violating these weak requirements only affects performance; not
 68 | reliability.
 69 | 
 70 | Bazel recently gained a feature called
 71 | ['Builds without the Bytes'](https://github.com/bazelbuild/bazel/issues/6862).
 72 | By enabling this feature using the `--remote_download_minimal` command
 73 | line flag, Bazel will no longer attempt to download output files to
 74 | local disk. This feature causes a significant drop in build times and
 75 | network bandwidth consumed. This is especially noticeable for workloads
 76 | that yield large output files. Buildbarn should attempt to support those
 77 | workloads.
 78 | 
 79 | Even with 'Builds without the Bytes' enabled, Bazel assumes it is always
 80 | capable of making forward progress. This strengthens the storage
 81 | requirements on Buildbarn as follows:
 82 | 
 83 | 1. Bazel no longer downloads output files referenced by
 84 |    ActionResult messages, but may use them as inputs for other build
 85 |    actions at any later point during the build. This means that
 86 |    ActionCache.GetActionResult() and Execution.Execute() may never return
 87 |    ActionResult messages that refer to objects that are either not
 88 |    present or not guaranteed to remain present during the remainder of
 89 |    the build.
 90 | 2. Technically speaking, it only makes sense for Bazel to call
 91 |    ContentAddressableStorage.FindMissingBlobs() against objects that
 92 |    Bazel is capable of uploading. With 'Builds without the Bytes'
 93 |    enabled, this set no longer has to contain output files of dependent
 94 |    build actions. However, this specific aspect is not implemented by
 95 |    Bazel. It still passes in the full set. Even worse: returning
 96 |    these objects as absent tricks Bazel into uploading files that are
 97 |    not present locally, causing a failure in the Bazel client.
 98 | 
 99 | To tackle the first requirement, Buildbarn recently gained a
100 | CompletenessCheckingBlobAccess decorator that causes bb\_storage to only
101 | return ActionResult entries from the AC in case all output files are
102 | present in the CAS. Presence is checked by calling
103 | BlobAccess.FindMissing() against the CAS, which happens to be the same
104 | function used to implement ContentAddressableStorage.FindMissingBlobs().
105 | 
106 | This places strong requirements on the behaviour of
107 | BlobAccess.FindMissing(). To satisfy the first requirement,
108 | BlobAccess.FindMissing() may not under-report the absence of objects. To
109 | satisfy the second requirement, it may not over-report. In other words,
110 | it has to be *exactly right*.
111 | 
112 | Furthermore, BlobAccess.FindMissing() now has an additional
113 | responsibility. Instead of only reporting absence, it now has to touch
114 | objects that are present, ensuring that they don't get evicted during
115 | the remainder of the build. Because Buildbarn itself has no notion of
116 | full build processes (just individual build actions), this generally
117 | means that Buildbarn needs to hold on to the data as long as possible.
118 | This implies that underlying storage needs to use an LRU or pseudo-LRU
119 | eviction policy.
120 | 
121 | # Comparison of existing storage backends
122 | 
123 | Now that we've had an overview of our storage requirements, let's take a
124 | look at the existing offering of Buildbarn storage backends. In addition
125 | to consistency requirements, we take maintenance aspects into account.
126 | 
127 | ## Redis
128 | 
129 | Advantages:
130 | 
131 | - Lightweight, bandwidth efficient protocol.
132 | - An efficient server implementation exists.
133 | - Major Cloud providers offer managed versions: Amazon ElastiCache,
134 |   Google Cloud Memorystore, Azure Cache for Redis, etc.
135 | - Pipelining permits efficiently implementing FindMissing() by sending a
136 |   series of `EXISTS` requests.
137 | - [Supports LRU caching.](https://redis.io/topics/lru-cache)
138 | 
139 | Disadvantages:
140 | 
141 | - The network protocol isn't multiplexed, meaning that clients need to
142 |   open an excessive number of network sockets, sometimes up to dozens
143 |   of sockets per worker, per storage backend. Transfers of large objects
144 |   (files) may hold up requests for small objects (Action, Command and
145 |   Directory messages).
146 | - Redis is not designed to store large objects. The maximum permitted
147 |   object size is 512 MiB. If Buildbarn is used to generate installation
148 |   media (e.g., DVD images), it is desirable to generate artifacts that
149 |   are multiple gigabytes in size.
150 | - The client library that is currently being used by Buildbarn,
151 |   [go-redis](https://github.com/go-redis/redis), does not use the
152 |   [context](https://golang.org/pkg/context/) package, meaning that
153 |   handling of cancelation, timeouts and retries is inconsistent with how
154 |   the rest of Buildbarn works. This may cause unnecessary propagation of
155 |   transient error conditions and exhaustion of connection pools. Of
156 |   [all known client library implementations](https://redis.io/clients#go),
157 |   only [redispipe](https://github.com/joomcode/redispipe) uses context.
158 | - go-redis does not support streaming of objects, meaning that large
159 |   objects need to at some point be stored in memory contiguously, which
160 |   causes heavy fluctuations in memory usage. Only the unmaintained
161 |   [shipwire/redis](https://github.com/shipwire/redis) library has an
162 |   architecture that would support streaming.
163 | 
164 | Additional disadvantages of Redis setups that have replication enabled,
165 | include:
166 | 
167 | - They violate consistency requirements. A `PUT` operation does not
168 |   block until the object is replicated to read replicas, meaning that a
169 |   successive `GET` may fail. Restarts of read replicas may cause objects
170 |   to be reported as absent, even though they are present on the master.
171 | - It has been observed that performance of replicated setups decreases
172 |   heavily under high eviction rates.
173 | 
174 | ## Amazon S3, Google Cloud Storage, Microsoft Azure Blob, etc.
175 | 
176 | Advantages:
177 | 
178 | - Virtually free of maintenance.
179 | - Decent SLA.
180 | - Relatively inexpensive.
181 | 
182 | Disadvantages:
183 | 
184 | - Amazon S3 is an order of magnitude slower than Amazon ElastiCache
185 |   (Redis).
186 | - Amazon S3 does not provide the right consistency model, as
187 |   [it only guarantees eventual consistency](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel)
188 |   for our access pattern.
189 |   [Google Cloud Storage offers stronger consistency.](https://cloud.google.com/blog/products/gcp/how-google-cloud-storage-offers-strongly-consistent-object-listing-thanks-to-spanner)
190 | - There are no facilities for performing bulk existence queries, meaning
191 |   it is hard to efficiently implement FindMissing().
192 | - Setting TTLs on objects will make a bucket behave like a FIFO cache,
193 |   which is incompatible with 'Builds without the Bytes'.
194 |   Turning this into an LRU cache requires triggering `COPY` requests
195 |   upon access. Again, due to the lack of bulk queries, calls like
196 |   FindMissing() will become even slower.
197 | 
198 | ## bb\_storage with the "circular" storage backend
199 | 
200 | Advantages:
201 | 
202 | - By using gRPC, handling of cancelation, timeouts and retries is either
203 |   straightforward or dealt with automatically.
204 | - [gRPC has excellent support for OpenTracing](https://github.com/grpc-ecosystem/grpc-opentracing),
205 |   meaning that it will be possible to trace requests all the way from
206 |   the user to the storage backend.
207 | - gRPC has excellent support for multiplexing requests, meaning the
208 |   number of network sockets used is reduced dramatically.
209 | - By using gRPC all across the board, there is rich and non-ambiguous
210 |   error message propagation.
211 | - bb\_storage exposes decent Prometheus metrics.
212 | - The "circular" storage backend provides high storage density. All
213 |   objects are concatenated into a single file without any form of
214 |   padding.
215 | - The "circular" storage backend has a hard limit on the amount of
216 |   storage space it uses, meaning it is very unlikely storage nodes will
217 |   run out of memory or disk space, even under high load.
218 | - The performance of the "circular" storage backend doesn't noticeably
219 |   degrade over time. The data file doesn't become fragmented. The hash
220 |   table that is used to look up objects uses a scheme that is inspired
221 |   by [cuckoo hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing). It
222 |   prefers displacing older entries over newer ones. This makes the hash
223 |   table self-cleaning.
224 | 
225 | Disadvantages:
226 | 
227 | - The "circular" storage backend implements a FIFO eviction scheme,
228 |   which is incompatible with 'Builds without the Bytes'.
229 | - The "circular" storage backend is known to have some design bugs that
230 |   may cause it to return corrupted data when reading blobs close to the
231 |   write cursors.
232 | - When running Buildbarn in the Cloud, a setup like this has more
233 |   administrative overhead than Redis and S3. Scheduling bb\_storage on
234 |   top of Kubernetes using stateful sets and persistent volume claims may
235 |   require administrators to periodically deal with stuck pods and
236 |   corrupted volumes. Though there are likely various means to automate
237 |   these steps (e.g., by running recovery cron jobs), it is not free in
238 |   terms of effort.
239 | 
240 | ## bb\_storage with the "inMemory" storage backend
241 | 
242 | Advantages:
243 | 
244 | - The eviction policy can be set to FIFO, LRU or random eviction, though
245 |   LRU should be preferred for centralized storage.
246 | - Due to there being no persistent state, it is possible to run
247 |   bb\_storage through a simple Kubernetes deployment, meaning long
248 |   amounts of downtime are unlikely.
249 | - Given a sufficient amount of memory, it deals well with objects that
250 |   are gigabytes in size. Both the client and server are capable of
251 |   streaming results, meaning memory usage is predictable.
252 | 
253 | In addition to the above, it shared the gRPC-related advantages of
254 | bb\_storage with the "circular" storage backend.
255 | 
256 | Disadvantages:
257 | 
258 | - There is no persistent storage, meaning all data is lost upon crashes
259 |   and power failure. This will cause running builds to fail.
260 | - By placing all objects in separate memory allocations, it puts a lot
261 |   of pressure on the garbage collector provided by the Go runtime. For
262 |   worker-level caches that are relatively small, this is acceptable. For
263 |   centralized caching, this becomes problematic. Because Go does not
264 |   ship with a defragmenting garbage collector, heap fragmentation grows
265 |   over time. It seems impossible to set storage limits anywhere close to
266 |   the available amount of RAM.
267 | 
268 | # A new storage stack for Buildbarn
269 | 
270 | Based on the comparison above, replicated Redis, S3 and bb\_storage with
271 | the "circular" storage backend should **not** be used for Buildbarn's
272 | centralized storage. They do not provide the right consistency
273 | guarantees to satisfy Bazel. That said, there may still be valid use
274 | cases for these backends in the larger Buildbarn ecosystem. For example,
275 | S3 may be a good fit for archiving historical build actions, so that
276 | they may remain accessible through bb\_browser. Such use cases are out
277 | of scope for this document.
278 | 
279 | This means that only non-replicated Redis and bb\_storage with the
280 | "inMemory" storage backend remain viable from a consistency standpoint,
281 | though both of these also have enough practical disadvantages that they
282 | cannot be used (no support for large objects and excessive heap
283 | fragmentation, respectively).
284 | 
285 | This calls for the creation of a new storage stack for Buildbarn. Let's
286 | first discuss what such a stack could look like for simple, single-node
287 | setups.
288 | 
289 | ## Single-node storage
290 | 
291 | For single-node storage, let's start off with designing a new storage
292 | backend named "local" that is initially aimed at replacing the
293 | "inMemory" and later on the "circular" backends. Instead of coming up
294 | with a brand-new design for such a storage backend, let's reuse the
295 | overall architecture of "circular", but fix some of the design flaws it
296 | had:
297 | 
298 | - **Preventing data corruption:** Data corruption in the "circular"
299 |   backend stems from the fact that we overwrite old blobs. Those old
300 |   blobs may still be in the process of being downloaded by a user. An
301 |   easy way to prevent data corruption is thus to stop overwriting
302 |   existing data. Instead, we can let it keep track of its data by
303 |   storing it in a small set of blocks (e.g., 10 blocks that are 1/10th
304 |   the size of total storage). Whenever all blocks become full, the
305 |   oldest one is discarded and replaced by a brand new block. By
306 |   implementing reference counting (or relying on the garbage collector),
307 |   requests for old data may complete without experiencing any data
308 |   corruption.
309 | - **Pseudo-LRU:** Whereas "circular" uses FIFO eviction, we can let our
310 |   new backend provide LRU-like eviction by copying blobs from older
311 |   blocks to newer ones upon access. To both provide adequate performance
312 |   and reduce redundancy, this should only be performed on blobs coming
313 |   from blocks that are beyond a certain age. By only applying this
314 |   scheme to the oldest 1/4 of data, only up to 1/3 of data may be stored
315 |   twice.
316 | - **Preventing 'tidal waves':** The "circular" backend would be prone to
317 |   'tidal waves' of writes. When objects disappear from storage, it is
318 |   highly probable that many related files disappear at the same time.
319 |   With the pseudo-LRU design, the same thing applies to refreshing
320 |   blobs: when BlobAccess.FindMissing() needs to refresh a single file
321 |   that is part of an SDK, it will likely need to refresh the entire SDK.
322 |   We can amortize the cost of this process by smearing writes across
323 |   multiple blocks, so that they do not need to be refreshed at the same
324 |   time.
325 | 
326 | Other Buildbarn components (e.g., bb\_worker) communicate to the
327 | single-node storage server by using the following blobstore
328 | configuration:
329 | 
330 | ```jsonnet
331 | {
332 |   local config = { grpc: { address: 'bb-storage.example.com:12345' } },
333 |   contentAddressableStorage: config,
334 |   actionCache: config,
335 | }
336 | ```
337 | 
338 | ## Adding scalability
339 | 
340 | For larger setups it may be desired to store more data in cache than
341 | fits in memory of a single system. For these setups it is suggested that
342 | the already existing ShardingBlobAccess is used:
343 | 
344 | ```jsonnet
345 | {
346 |   local config = {
347 |     sharding: {
348 |       hashInitialization: 3151213777095999397,  // A random 64-bit number.
349 |       shards: [
350 |         {
351 |           backend: { grpc: { address: 'bb-storage-0.example.com:12345' } },
352 |           weight: 1,
353 |         },
354 |         {
355 |           backend: { grpc: { address: 'bb-storage-1.example.com:12345' } },
356 |           weight: 1,
357 |         },
358 |         {
359 |           backend: { grpc: { address: 'bb-storage-2.example.com:12345' } },
360 |           weight: 1,
361 |         },
362 |       ],
363 |     },
364 |   },
365 |   contentAddressableStorage: config,
366 |   actionCache: config,
367 | }
368 | ```
369 | 
370 | In the example above, the keyspace is divided across three storage
371 | backends that will each receive approximately 33% of traffic.
372 | 
373 | ## Adding fault tolerance
374 | 
375 | The setup described above is able to recover from failures rapidly, due
376 | to all bb\_storage processes being stateful, but non-persistent. When an
377 | instance becomes unavailable, a system like Kubernetes will be able to
378 | quickly spin up a replacement, allowing new builds to take place once
379 | again. Still, there are two problems:
380 | 
381 | - Builds running at the time of the failure all have to terminate, as
382 |   results from previous build actions may have disappeared.
383 | - Due to the random nature of object digests, the loss of a single shard
384 |   likely means that most cached build actions end up being incomplete,
385 |   due to them depending on output files and logs stored across different
386 |   shards.
387 | 
388 | To mitigate this, we can introduce a new BlobAccess decorator named
389 | MirroredBlobAccess that supports a basic replication strategy between
390 | pairs of servers, similar to [RAID 1](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_1).
391 | When combined with ShardingBlobAccess, it allows the creation of a
392 | storage stack that roughly resembles [RAID 10](https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_(RAID_1+0)).
393 | 
394 | - BlobAccess.Get() operations may be tried against both backends. If it
395 |   is detected that only one of the backends possesses a copy of the
396 |   object, it is replicated on the spot.
397 | - BlobAccess.Put() operations are performed against both backends.
398 | - BlobAccess.FindMissing() operations are performed against both
399 |   backends. Any inconsistencies in the results of both backends are
400 |   resolved by replicating objects in both directions. The intersection of
401 |   the results is then returned.
402 | 
403 | The BlobAccess configuration would look something like this:
404 | 
405 | ```jsonnet
406 | {
407 |   local config = {
408 |     sharding: {
409 |       hashInitialization: 3151213777095999397,  // A random 64-bit number.
410 |       shards: [
411 |         {
412 |           backend: {
413 |             mirrored: {
414 |               backendA: { grpc: { address: 'bb-storage-a0.example.com:12345' } },
415 |               backendB: { grpc: { address: 'bb-storage-b0.example.com:12345' } },
416 |             },
417 |           },
418 |           weight: 1,
419 |         },
420 |         {
421 |           backend: {
422 |             mirrored: {
423 |               backendA: { grpc: { address: 'bb-storage-a1.example.com:12345' } },
424 |               backendB: { grpc: { address: 'bb-storage-b1.example.com:12345' } },
425 |             },
426 |           },
427 |           weight: 1,
428 |         },
429 |         {
430 |           backend: {
431 |             mirrored: {
432 |               backendA: { grpc: { address: 'bb-storage-a2.example.com:12345' } },
433 |               backendB: { grpc: { address: 'bb-storage-b2.example.com:12345' } },
434 |             },
435 |           },
436 |           weight: 1,
437 |         },
438 |       ],
439 |     },
440 |   },
441 |   contentAddressableStorage: config,
442 |   actionCache: config,
443 | }
444 | ```
445 | 
446 | With these semantics in place, it is completely safe to replace one half
447 | of each pair of servers with an empty instance without causing builds to
448 | fail. This also permits rolling upgrades without serious loss of data by
449 | only upgrading half of the fleet during every release cycle. Between
450 | cycles, one half of systems is capable of repopulating the other half.
451 | 
452 | This strategy may even be used to grow the storage pool without large
453 | amounts of downtime. By changing the order of ShardingBlobAccess and
454 | MirroredBlobAccess, it is possible to temporarily turn this setup into
455 | something that resembles [RAID 01](https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_01_(RAID_0+1)),
456 | allowing the pool to be resharded in two separate stages.
457 | 
458 | To implement this efficiently, a minor extension needs to be made to
459 | Buildbarn's buffer layer. To implement MirroredBlobAccess.Put(), buffer
460 | objects need to be cloned. The existing Buffer.Clone() realises this by
461 | creating a full copy of the buffer, so that writing to each of the
462 | backends can take place at its own pace. For large objects copying may
463 | be expensive, which is why Buffer.Clone() should be replaced by one
464 | flavour that copies and one that multiplexes the underlying stream.
465 | 
466 | # Alternatives considered
467 | 
468 | Instead of building our own storage solution, we considered switching to
469 | distributed database systems, such as
470 | [CockroachDB](https://www.cockroachlabs.com) and
471 | [FoundationDB](https://www.foundationdb.org). Though they will solve all
472 | consistency problems we're experiencing, they are by no means designed
473 | for use cases that are as bandwidth intensive as ours. These systems are
474 | designed to only process up to about a hundred megabytes of data per
475 | second.  They are also not designed to serve as caches, meaning separate
476 | garbage collector processes need to be run periodically.
477 | 
478 | Data stores that guarantee eventual consistency can be augmented to
479 | provide the required consistency requirements by placing an index in
480 | front of them that is fully consistent. Experiments where a Redis-based
481 | index was placed in front of S3 proved successful, but were by no means
482 | fault-tolerant.
483 | 
484 | # Future work
485 | 
486 | For Buildbarn setups that are not powered on permanently or where
487 | RAM-backed storage is simply too expensive, there may still be a desire
488 | to have disk-backed storage for the CAS and AC. We should extend the new
489 | "local" storage backend to support on-disk storage, just like "circular"
490 | does.
491 | 
492 | To keep the implementation simple and understandable, MirroredBlobAccess
493 | will initially be written in such a way that it requires both backends
494 | to be available. This is acceptable, because unavailable backends can
495 | easily be replaced by new non-persistent nodes. For persistent setups,
496 | this may not be desired. MirroredBlobAccess would need to be extended to
497 | work properly in degraded environments.
498 | 
499 | Many copies of large output files with only minor changes to them cause
500 | cache thrashing. This may be prevented by decomposing large output files
501 | into smaller chunks, so that deduplication on the common parts may be
502 | performed. Switching from SHA-256 to a recursive hashing scheme, such as
503 | [VSO-Hash](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md),
504 | makes it possible to implement this while retaining the Merkle tree
505 | property.
506 | 


--------------------------------------------------------------------------------
/0003-cas-decomposition.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #3: Decomposition of large CAS objects
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2020-04-16
  5 | 
  6 | # Context
  7 | 
  8 | [The Remote Execution protocol](https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/execution/v2/remote_execution.proto)
  9 | that is implemented by Buildbarn defines two data stores: the Action
 10 | Cache (AC) and the Content Addressable Storage (CAS). The CAS is where
 11 | almost all data is stored in terms of size (99%+), as it is used to
 12 | store input files and output files of build actions, but also various
 13 | kinds of serialized Protobuf messages that define the structure of build
 14 | actions.
 15 | 
 16 | As the name implies, the CAS is [content addressed](https://en.wikipedia.org/wiki/Content-addressable_storage),
 17 | meaning that all objects are identified by their content, in this case a
 18 | cryptographic checksum (typically [SHA-256](https://en.wikipedia.org/wiki/SHA-2)).
 19 | This permits Buildbarn to automatically deduplicate identical objects
 20 | and discard objects that are corrupted.
 21 | 
 22 | Whereas most objects stored in the AC are all similar in size
 23 | (kilobytes), sizes of objects stored in the CAS may vary a lot. Small
 24 | [Directory](https://github.com/bazelbuild/remote-apis/blob/b5123b1bb2853393c7b9aa43236db924d7e32d61/build/bazel/remote/execution/v2/remote_execution.proto#L673-L685)
 25 | objects are hundreds of bytes in size, while container layers created
 26 | with [rules\_docker](https://github.com/bazelbuild/rules_docker) may
 27 | consume a gigabyte of space. This is observed to be problematic for
 28 | Buildbarn setups that use sharded storage. Each shard will receive a
 29 | comparable number of requests, but the amount of network traffic and CPU
 30 | load to which those requests translate, fluctuates.
 31 | 
 32 | Buildbarn has code in place to checksum validate data coming from
 33 | untrusted sources. For example, bb\_worker validates all data received
 34 | from bb\_storage. bb\_storage validates all data obtained from clients
 35 | and data read from disk. This is done for security reasons, but also to
 36 | ensure that bugs in storage code don't lead to execution of malformed
 37 | build actions. Unfortunately, there are an increasing number of use
 38 | cases where partial reads need to be performed (lazy-loading file
 39 | systems for bb\_worker, Bazel's ability to resume aborted transfers,
 40 | etc.). Such partial reads currently still cause entire objects to be
 41 | read, simply to satisfy the checksum validation process. This is
 42 | wasteful.
 43 | 
 44 | To be able to address these issues, this ADR proposes the addition to
 45 | some new features to Buildbarn's codebase to be able to eliminate the
 46 | existence of such large CAS objects altogether.
 47 | 
 48 | # Methods for decomposing large objects
 49 | 
 50 | In a nutshell, the idea is to partition files whose size exceed a
 51 | certain threshold into smaller blocks. Each of the blocks will have a
 52 | fixed size, except the last block in a file, which may be smaller. The
 53 | question then becomes what naming scheme is used to store them in the
 54 | CAS.
 55 | 
 56 | The simplest naming scheme would be to give all blocks a suffix that
 57 | indicates the offset in the file. For example, a file that is decomposed
 58 | into three blocks could use digests containing hashes `<HASH>-0`,
 59 | `<HASH>-1` and `<HASH>-2`, where `<HASH>` refers to the original file
 60 | hash. Calls to `BlobAccess.Get()`, `Put()` and `FindMissing()` could be
 61 | implemented to expand to a series of digests containing the block number
 62 | suffix.
 63 | 
 64 | Unfortunately, this approach is too simplistic. Because `<HASH>` refers
 65 | to the checksum of the file as a whole, there is no way the consistency
 66 | of individual blocks can be validated. If blocks are partitioned across
 67 | shards, there is no way an individual shard can verify the integrity of
 68 | its own data.
 69 | 
 70 | A solution would be to store each of the blocks using their native
 71 | digest, meaning a hash of each of them is computed. This on its own,
 72 | however, would only allow storing data. There would be no way to look up
 73 | blocks afterwards, as only the combined hash is known. To be able to
 74 | retrieve the data, we'd to store a separate manifest that contains a
 75 | list of the digests of all blocks in the file.
 76 | 
 77 | Even though such an approach would allow us to validate the integrity of
 78 | individual blocks of data, conventional hashing algorithms wouldn't
 79 | allow us to verify manifests themselves. There is no way SHA-256 hashes
 80 | of individual blocks may be combined into a SHA-256 hash of the file as
 81 | a whole. SHA-256's [Merkle-Damgård construction](https://en.wikipedia.org/wiki/Merkle–Damgård_construction)
 82 | does not allow it.
 83 | 
 84 | # VSO-Hash: Paged File Hash
 85 | 
 86 | As part of their [BuildXL infrastructure](https://github.com/microsoft/BuildXL),
 87 | Microsoft published a hashing algorithm called ["VSO-Hash"](https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/PagedHash.md)
 88 | or "Paged File Hash". VSO-Hash works by first computing SHA-256 hashes
 89 | for every 64 KiB page of data. Then, 32 page hashes (for 2 MiB of data)
 90 | are taken together and hashed using SHA-256 to form a block hash.
 91 | Finally, all block hashes are combined by [folding](https://en.wikipedia.org/wiki/Fold_(higher-order_function))
 92 | them.
 93 | 
 94 | By letting clients use VSO-Hash as a digest function, Buildbarn could be
 95 | altered to decompose blobs either in 64 KiB or 2 MiB blocks, or both
 96 | (i.e., decomposing into 64 KiB chunks, but having two levels of
 97 | manifests). This would turn large files into [Merkle trees](https://en.wikipedia.org/wiki/Merkle_tree)
 98 | of depth two or three. Pages, blocks and their manifests could all be
 99 | stored in the CAS and validated trivially.
100 | 
101 | Some work has already been done in the Bazel ecosystem to
102 | facilitate this. For example, the Remote Execution protocol already
103 | contains an enumeration value for VSO-Hash, which was
104 | [added by Microsoft in 2019](https://github.com/bazelbuild/remote-apis/pull/84).
105 | There are, however, various reasons not to use VSO-Hash:
106 | 
107 | - Microsoft seemingly only added support for VSO-Hash as a migration
108 |   aid. There are efforts to migrate BuildXL from VSO-Hash to plain
109 |   SHA-256. The author has mentioned that support for VSO-Hash will likely
110 |   be retracted from the protocol once this migration is complete.
111 | - Bazel itself doesn't implement it.
112 | - It is somewhat inefficient. Because the full SHA-256 algorithm is used
113 |   at every level, the Merkle-Damgård construction is used excessively.
114 |   The algorithm could have been simpler by defining its scheme directly
115 |   on top of the SHA-256 compression function.
116 | - Algorithms for validating files as a whole, (manifests of) pages and
117 |   (manifests of) blocks are all different. Each of these needs a
118 |   separate implementation, as well as a distinct digest format, so that
119 |   storage nodes know which validation algorithm to use upon access.
120 | - It effectively leaks the decomposition strategy into the client.
121 |   VSO-Hash only allows decomposition at the 64 KiB and 2 MiB level.
122 |   Decomposition at any other power of two would require the use of an
123 |   entirely different hashing algorithm.
124 | 
125 | # BLAKE3
126 | 
127 | At around the time this document was written, a new hashing algorithm
128 | was published, called [BLAKE3](https://github.com/BLAKE3-team/BLAKE3/).
129 | As the name implies, BLAKE3 is a successor of BLAKE, an algorithm that
130 | took part in the [SHA-3 competition](https://en.wikipedia.org/wiki/NIST_hash_function_competition).
131 | In the end BLAKE lost to Keccak, not because it was in any way insecure,
132 | but because it was too similar to SHA-2. The goal of the SHA-3
133 | competition was to have a backup in case fundamental security issues in
134 | SHA-2 are found.
135 | 
136 | An interesting improvement of BLAKE3 over its predecessor BLAKE2 is
137 | that it no longer uses the Merkle-Damgård construction. Instead,
138 | chaining is only used for chunks up to 1 KiB in size. When hashing
139 | objects that are larger than 1 KiB, a binary Merkle tree of 1 KiB chunks
140 | is constructed. The final hash value is derived from the Merkle tree's
141 | root node. If clients were to use BLAKE3, Buildbarn would thus be able
142 | to decompose files into blocks that are any power of two, at least 1 KiB
143 | in size.
144 | 
145 | BLAKE3 uses an Extendable-Output Function (XOF) to post-process the
146 | state of the hasher into an output sequence of arbitrary length. Because
147 | this process invokes the BLAKE3 compression function, it is not
148 | reversible. This means that manifests cannot contain hashes of its
149 | blocks, as that would not allow independent integrity checking of
150 | the manifest. Instead, manifest entries should hold the state of the
151 | hasher (i.e., the raw Merkle tree node). That way it's possible both to
152 | validate and convert to a block digest.
153 | 
154 | For chunk nodes (that represent 1 KiB of data or less), the hasher state
155 | can be stored in 97 bytes (256 bit initialization vector, 512 bit message,
156 | 7 bit size, 1 bit chunk-start flag). For parent nodes (that
157 | represent more than 1 KiB of data), the hasher state is only 64 bytes
158 | (512 bit message), as other parameters of the compression function are
159 | implied. Because of that, decomposition of files into 1 KiB blocks
160 | should be discouraged. Buildbarn should only support decomposition into
161 | blocks that are at least 2 KiB in size. That way manifests only contain
162 | 64 byte entries, except for the last entry, which may be 64 or 97 bytes
163 | in size.
164 | 
165 | # BLAKE3ZCC: BLAKE3 with a Zero Chunk Counter
166 | 
167 | The BLAKE3 compression function takes a `t` argument that is for two
168 | different purposes. First of all, it is used as a counter for every 512
169 | bits of hash output generated by the XOF. Secondly, it contains the
170 | Chunk Counter when compressing input data. Effectively, the Chunk
171 | Counter causes every 1 KiB chunk of data to be hashed in a different way
172 | depending on the offset at which it is stored within the file.
173 | 
174 | For our specific use case, it is not desirable that hashing of data is
175 | offset dependent. It would require that decomposed blocks contain
176 | additional metadata that specify at which offset the data was stored in
177 | the original file. Otherwise, there would be no way to validate the
178 | integrity of the block independently. It also rules out the possibility
179 | of deduplicating large sections of repetitive data (e.g., deduplicating
180 | a large file that contains only null bytes to just a single chunk).
181 | 
182 | According to section 7.5 of the BLAKE3 specification, the Chunk Counter
183 | is not strictly necessary for security, but discourages optimizations
184 | that would introduce timing attacks. Though timing attacks are a serious
185 | problem, we can argue that in the case of the Remote Execution protocol
186 | such timing attacks already exist. For example, identical files stored
187 | in a single Directory hierarchy will only be uploaded to/downloaded from
188 | the CAS once.
189 | 
190 | For this reason, this ADR proposes adding support for a custom hashing
191 | algorithm to Buildbarn, which we will call BLAKE3ZCC. BLAKE3ZCC is
192 | identical to regular BLAKE3, except that it uses a Zero Chunk Counter.
193 | BLAKE3ZCC generates exactly the same hash output as BLAKE3 for files
194 | that are 1 KiB or less in size, as those files fit in a single chunk.
195 | 
196 | # Changes to the Remote Execution protocol
197 | 
198 | All of the existing hashing algorithms supported by the Remote Execution
199 | protocol have a different hash size. Buildbarn makes use of this
200 | assumption, meaning that if it encounters a [Digest](https://github.com/bazelbuild/remote-apis/blob/b5123b1bb2853393c7b9aa43236db924d7e32d61/build/bazel/remote/execution/v2/remote_execution.proto#L779-L786)
201 | message whose `hash` value is 64 characters in size, it assumes it
202 | refers to an object that is SHA-256 hashed. Adding support for BLAKE3ZCC
203 | would break this assumption. BLAKE3 hashes may have any size, which
204 | makes matching by length impossible. The default hash length of 256 bits
205 | would also collide with SHA-256.
206 | 
207 | To solve this, we could extend Digest to make it explicit which hashing
208 | algorithm is used. The existing `string hash = 1` field could be
209 | replaced with the following:
210 | 
211 | ```protobuf
212 | message Digest {
213 |   oneof hash {
214 |     // Used by all of the existing hashing algorithms.
215 |     string other = 1;
216 |     // Used to address BLAKE3ZCC hashed files, or individual blocks in
217 |     // case the file has been decomposed.
218 |     bytes blake3zcc = 3;
219 |     // Used to address manifests of BLAKE3ZCC hashed files, containing
220 |     // Merkle tree nodes of BLAKE3ZCC hashed blocks. Only to be used by
221 |     // Buildbarn, as Bazel will only request files as a whole.
222 |     bytes blake3zcc_manifest = 4;
223 |   }
224 |   ...
225 | }
226 | ```
227 | 
228 | By using type `bytes` for these new fields instead of storing a base16
229 | encoded hash in a `string`, we cut down the size of Digest objects by
230 | almost 50%. This causes a significant space reduction for Action,
231 | Directory and Tree objects.
232 | 
233 | Unfortunately, `oneof` cannot be used here, because Protobuf
234 | implementations such as [go-protobuf](https://github.com/golang/protobuf/issues/395)
235 | don't guarantee that fields are serialized in tag order when `oneof`
236 | fields are used. This property is required by the Remote Execution
237 | protocol. To work around this, we simply declare three separate fields,
238 | where implementations should ensure only one field is set.
239 | 
240 | ```protobuf
241 | message Digest {
242 |   string hash_other = 1;
243 |   bytes hash_blake3zcc = 3;
244 |   bytes hash_blake3zcc_manifest = 4;
245 |   ...
246 | }
247 | ```
248 | 
249 | Digests are also encoded into pathname strings used by the ByteStream
250 | API. To distinguish BLAKE3ZCC files and manifests from other hashing
251 | algorithms, `B3Z:` and `B3ZM:` prefixes are added to the base16 encoded
252 | hash values, respectively.
253 | 
254 | In addition to that, a `BLAKE3ZCC` constant is added to the
255 | DigestFunction enum, so that Buildbarn can announce support for
256 | BLAKE3ZCC to clients.
257 | 
258 | # Changes to Buildbarn
259 | 
260 | Even though BLAKE3 only came out recently,
261 | [one library for computing BLAKE3 hashes in Go exists](https://github.com/lukechampine/blake3).
262 | Though this library is of good quality, it cannot be used to compute
263 | BLAKE3ZCC hashes without making local code changes. Furthermore, this
264 | library only provides public interfaces for converting a byte stream to
265 | a hash value. In our case we need separate interfaces for converting
266 | chunks to Merkle tree nodes, computing the root node for a larger Merkle
267 | tree and obtaining Merkle tree nodes to hash values. Without such
268 | features, we'd be unable to generate and parse manifests. We will
269 | therefore design our own BLAKE3ZCC hashing library, which we will
270 | package at `github.com/buildbarn/bb-storage/pkg/digest/blake3zcc`.
271 | 
272 | Buildbarn has an internal `util.Digest` data type that extends upon the
273 | Remote Execution Digest message by storing the instance name, thereby
274 | making it a fully qualified identifier of an object. It also has many
275 | operations that allow computing, deriving and transforming them. Because
276 | supporting BLAKE3ZCC and decomposition makes this type more complex, it
277 | should first be moved out of `pkg/util` into its own `pkg/digest`
278 | package.
279 | 
280 | In addition to being extended to support BLAKE3ZCC hashing, the Digest
281 | data type will gain a new method:
282 | 
283 | ```go
284 | func (d Digest) ToManifest(blockSizeBytes int64) (manifestDigest Digest, parser ManifestParser, ok bool) {}
285 | ```
286 | 
287 | When called on a digest object that uses BLAKE3ZCC that may be
288 | decomposed into multiple blocks, this function returns the digest of its
289 | manifest. In addition to that, it returns a ManifestParser object that
290 | may be used to extract block digests from manifest payloads or construct
291 | them. ManifestParser will look like this:
292 | 
293 | ```go
294 | type ManifestParser interface {
295 | 	// Perform lookups of blocks on existing manifests.
296 | 	GetBlockDigest(manifest []byte, off int64) (blockDigest Digest, actualOffset int64)
297 | 	// Construct new manifests.
298 | 	AppendBlockDigest(manifest *[]byte, block []byte) Digest
299 | }
300 | ```
301 | 
302 | One implementation of ManifestParser for BLAKE3ZCC shall be provided.
303 | 
304 | The `Digest.ToManifest()` function and its resulting ManifestParser
305 | shall be used by a new BlobAccess decorator, called
306 | DecomposingBlobAccess. The operations of this decorator shall be
307 | implemented as follows:
308 | 
309 | - `BlobAccess.Get()` will simply forward the call if the provided digest
310 |   does not correspond to a blob that can be decomposed. Otherwise, it
311 |   will load the associated manifest from storage. It will then return a
312 |   Buffer object that dynamically loads individual blocks from the CAS
313 |   when accessed.
314 | - `BlobAccess.Put()` will simply forward the call if the provided digest
315 |   does not correspond to a blob that can be decomposed. Otherwise, it
316 |   will decompose the input buffer into smaller buffers for every block
317 |   and write those into the CAS. In addition to that, it will write a
318 |   manifest object into the CAS.
319 | - `BlobAccess.FindMissing()` will forward the call, except that digests
320 |   of composed objects will be translated to the digests of their
321 |   manifests. Manifests that are present will then be loaded from the
322 |   CAS, followed by checking the existence of each of the individual
323 |   blocks.
324 | 
325 | To be able to implement DecomposingBlobAccess, the Buffer layer will
326 | need two minor extensions:
327 | 
328 | - `BlobAccess.Get()` needs to return a buffer that needs to be backed by
329 |   the concatenation of a sequence of block buffers. A new buffer type
330 |   that implements this functionality shall be added, which can be
331 |   created by calling `NewCASConcatenatingBuffer()`.
332 | - `BlobAccess.Put()` needs to read the input buffer in chunks equal to
333 |   the block size. The `Buffer.ToChunkReader()` function currently takes
334 |   a maximum chunk size argument, but provides no facilities for
335 |   specifying a lower bound. This mechanism should be extended, so that
336 |   `ToChunkReader()` can be used to read input one block at a time.
337 | 
338 | With all of these changes in place, Buildbarn will have basic support
339 | for decomposing large objects in place.
340 | 
341 | # Future work
342 | 
343 | With the features proposed above, support for decomposition should be
344 | complete enough to provide better spreading of load for sharded setups.
345 | In addition to that, workers with lazy-loading file systems will be able
346 | to perform I/O on large files without faulting them in entirely. There
347 | are still two areas where performance may be improved further:
348 | 
349 | - Using BatchedStoreBlobAccess, workers currently call `FindMissing()`
350 |   before uploading output files. This is done at the output file level,
351 |   which is suboptimal. When decomposition is enabled, it causes workers
352 |   to load manifests from storage, even though those could be derived
353 |   locally. A single absent block will cause the entire output file to be
354 |   re-uploaded. To prevent this, `BlobAccess.Put()` should likely be
355 |   decomposed into two flavours: one for streaming uploads and one for
356 |   uploads of local files.
357 | - When decomposition is enabled, `FindMissing()` will likely become
358 |   slower, due to it requiring more roundtrips to storage. This may be
359 |   solved by adding yet another BlobAccess decorator that does short-term
360 |   caching of `FindMissing()` results.
361 | 


--------------------------------------------------------------------------------
/0004-icas.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #4: Indirect Content Addressable Storage
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2020-06-17
  5 | 
  6 | # Context
  7 | 
  8 | In an ideal build environment, build actions scheduled through the REv2
  9 | protocol are fully isolated from the outer world. They either rely on
 10 | data that is provided by the client (source files and resources declared
 11 | in Bazel's `WORKSPACE` file), or use files that are outputs of previous
 12 | build actions. In addition to improving reproducibility of work, it
 13 | makes this easier for Buildbarn and the build client to distinguish
 14 | actual build failures (compiler errors) from infrastructure failures.
 15 | 
 16 | One pain point of this model has always been network bandwidth
 17 | consumption. Buildbarn clusters can be run in data centers that have
 18 | excellent network connectivity, while build clients may run on a laptop
 19 | that uses public WiFi connectivity. These clients are still responsible
 20 | for downloading external artifacts from their upstream location,
 21 | followed by uploading them into Buildbarn's Content Addressable Storage
 22 | (CAS).
 23 | 
 24 | People within the Bazel community are working on solving this problem by
 25 | adding a new [Remote Asset API](https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/asset/v1/remote_asset.proto).
 26 | When used, Bazel is capable of sending RPCs to a dedicated service to
 27 | download external artifacts, optionally extract them and upload them
 28 | into the CAS. Bazel is then capable of passing the resulting objects by
 29 | reference to build actions that consume them.
 30 | 
 31 | Though it makes a lot of sense to fully ingest remote assets into the
 32 | CAS in the general case, there are certain workloads where doing this is
 33 | undesirable. If objects are already stored on a file share or cloud
 34 | storage bucket close to a Buildbarn cluster, copying these objects into
 35 | Buildbarn's centralized storage based on LocalBlobAccess may be
 36 | wasteful. It may be smarter to let workers fetch these assets remotely
 37 | only as part of the input root population. This ADR proposes changes to
 38 | Buildbarn to facilitate this.
 39 | 
 40 | # Changes proposed at the architectural level
 41 | 
 42 | Ideally, we would like to extend the CAS to be able to store references
 43 | to remote assets. A CAS object would then either contain a blob or a
 44 | reference describing where the asset is actually located. When a worker
 45 | attempts to read an object from the CAS and receives a reference, it
 46 | retries the operation against the remote service. This is unfortunately
 47 | hard to realise for a couple of reasons:
 48 | 
 49 | - None of the Buildbarn code was designed with this in mind.
 50 |   [The Buffer interface](0001-buffer.md) assumes that every data store
 51 |   is homogeneous, in that all objects stored within have the same type.
 52 |   The CAS may only hold content addressed blobs, while the AC may only
 53 |   contain ActionResult messages.
 54 | - The ByteStream and ContentAddressableStorage protocols have no
 55 |   mechanisms for returning references. This is problematic, because we
 56 |   also use these protocols inside of our clusters.
 57 | 
 58 | This is why we propose adding a separate data store that contains these
 59 | references, called the Indirect Content Addressable Storage (ICAS). When
 60 | attempting to load an object, a worker will first query the CAS, falling
 61 | back to loading a reference from the ICAS and following it. References
 62 | are stored in the form of Protobuf messages.
 63 | 
 64 | In terms of semantics, the ICAS can be seen as a mixture between the CAS
 65 | and the Action Cache (AC). Like the AC, it holds Protobuf messages. Like
 66 | the CAS, references expand to data that is content addressed and needs
 67 | checksum validation. It also needs to support efficient bulk existence
 68 | queries (i.e., `FindMissingBlobs()`).
 69 | 
 70 | # Changes proposed at the implementation level
 71 | 
 72 | Because the ICAS is similar to both the CAS and AC, we should integrate
 73 | support for it into bb-storage's `pkg/blobstore`. This would allow us to
 74 | reuse most of the existing storage backends, including LocalBlobAccess,
 75 | MirroredBlobAccess and ShardingBlobAccess. This requires two major
 76 | changes:
 77 | 
 78 | - The Buffer interface has only been designed with the CAS and the AC as
 79 |   data sources and sinks in mind. In order to support ICAS references as
 80 |   sources, the `NewACBuffer*()` functions will be replaced with generic
 81 |   `NewProtoBuffer*()` functions that take arbitrary Protobuf messages.
 82 |   To support ICAS references as sinks, the `Buffer.ToActionResult()`
 83 |   method will be replaced with generic method `Buffer.ToProto()`.
 84 | - All of the code in `pkg/blobstore/configuration` is already fairly
 85 |   messy because it needs to distinguish between the CAS and AC. This is
 86 |   only going to get worse once ICAS support is added. Work will be
 87 |   performed to let the generic configuration code call into
 88 |   Blob{Access,Replicator}Creator interfaces. All of the existing backend
 89 |   specific code will be moved into {AC,CAS}Blob{Access,Replicator}Creator
 90 |   implementations.
 91 | 
 92 | With those changes in place, it should become easier to add support for
 93 | arbitrary storage backends; not just the ICAS. To add support for the
 94 | ICAS, the following classes will be added:
 95 | 
 96 | - ICASStorageType, which will be used by most generic BlobAccess
 97 |   implementations to construct ICAS specific Buffer objects.
 98 | - ICASBlob{Access,Replicator}Creator, which will be used to construct
 99 |   ICAS specific BlobAccess instances from Jsonnet configuration.
100 | - ICASBlobAccess and IndirectContentAddressableStorageServer, which act
101 |   as gRPC clients and servers for the ICAS protocol. This protocol will
102 |   be similar to REv2's ActionCache and ContentAddressableStorage.
103 | 
104 | The above will only bring in support for the ICAS, but won't add any
105 | logic to let workers fall back to reading ICAS entries when objects are
106 | absent from the CAS. To realise that, two new implementations of
107 | BlobAccess are added:
108 | 
109 | - ReferenceExpandingBlobAccess, which effectively converts an ICAS
110 |   BlobAccess to a CAS BlobAccess. It loads references from the ICAS,
111 |   followed by loading the referenced object from its target location.
112 | - ReadFallbackBlobAccess, which which forwards all requests to a primary
113 |   backend, followed by sending requests to a secondary backend on
114 |   absence.
115 | 
116 | Below is an example blobstore configuration that can be used to let
117 | workers access both the CAS and ICAS:
118 | 
119 | ```jsonnet
120 | {
121 |   contentAddressableStorage: {
122 |     readFallback: {
123 |       primary: {
124 |         // CAS backend goes here.
125 |         grpc: ...,
126 |       },
127 |       secondary: {
128 |         referenceExpanding: {
129 |           // ICAS backend goes here.
130 |           grpc: ...,
131 |         },
132 |       },
133 |     },
134 |   },
135 |   actionCache: {
136 |     // AC backend goes here.
137 |     grpc: ...,
138 |   },
139 | }
140 | ```
141 | 
142 | # How do I integrate this into my environment?
143 | 
144 | Though care has been taken to make this entire implementation as generic
145 | as possible, the inherent problem is that the process of expanding
146 | references stored in the ICAS is specific to the environment in which
147 | Buildbarn is deployed. We will make sure that the ICAS protocol and
148 | ReferenceExpandingBlobAccess have support for some commonly used
149 | protocols (e.g., HTTP), but it is not unlikely that operators of
150 | Buildbarn clusters need to maintain forks if they want to integrate this
151 | with their own infrastructure. This is likely unavoidable.
152 | 
153 | The question is also how entries are written into the ICAS. This is
154 | again specific to the environment. Users will likely need to implement
155 | their own gRPC service that implements the Remote Asset API that parses
156 | environment specific URL schemes and stores them as ICAS entries.
157 | 
158 | Unfortunately, Bazel does not implement the Remote Asset API at this
159 | point in time. As a workaround, one may create a custom `bb_worker`-like
160 | process that captures certain types of build actions. Instead of going
161 | through the regular execution process, this worker directly interprets
162 | these requests, treating them as a remote asset request. This can be
163 | achieved by implementing a custom BuildExecutor.
164 | 
165 | # Future work
166 | 
167 | When we added `Buffer.ToActionResult()`, we discovered that we no longer
168 | needed the ActionCache interface. Now that we have a general purpose
169 | `Buffer.ToProto()` function, it looks like the ContentAddressableStorage
170 | interface also becomes less useful. It may make sense to investigate
171 | what it takes to remove that interface as well.
172 | 


--------------------------------------------------------------------------------
/0005-local-blob-access-persistency.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #5: Persistency for LocalBlobAccess
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2020-09-12
  5 | 
  6 | # Context
  7 | 
  8 | In addition to all of the storage backends that Buildbarn provides to
  9 | access the contents of the Content Addressable Storage (CAS) and Action
 10 | Cache (AC) remotely, there are two backends that store data locally:
 11 | CircularBlobAccess and LocalBlobAccess. These two storage backends have
 12 | various use cases:
 13 | 
 14 | - They can be used to build fast centralized storage, especially when
 15 |   used in combination with MirroredBlobAccess and ShardingBlobAccess.
 16 |   See [ADR #2](0002-storage.md) for more details.
 17 | - They can be used to set up local caches when used in combination with
 18 |   ReadCachingBlobAccess. Such caches can speed up access to the CAS in a
 19 |   given site or on an individual system.
 20 | 
 21 | Buildbarn originally shipped with CircularBlobAccess (written in 2018).
 22 | As part of the work on ADR #2, LocalBlobAccess was added.
 23 | LocalBlobAccess has a couple of advantages over CircularBlobAccess:
 24 | 
 25 | - When sized appropriately, it is capable of supporting Bazel's
 26 |   [Builds without the Bytes](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/edit).
 27 | - It doesn't suffer from the data corruption bugs that
 28 |   CircularBlobAccess has.
 29 | - It can store its data on raw block devices, which tends to be a lot
 30 |   faster than using files on a file system. It also ensures that the
 31 |   maximum amount of space used is bounded.
 32 | - Block devices are memory mapped, making reads of blobs that are
 33 |   already loaded from disk nearly instantaneous.
 34 | - It supports random offset reads, while CircularBlobAccess can only
 35 |   read data sequentially.
 36 | 
 37 | There are, however, a couple of disadvantages to using LocalBlobAccess:
 38 | 
 39 | - It only supports raw block devices. This can be problematic in case
 40 |   Buildbarn is deployed in environments where raw block device access is
 41 |   unsupported (e.g., inside unprivileged containers).
 42 | - It doesn't persist data across restarts, as the digest-location map,
 43 |   the hash table that indexes data, is only stored in memory.
 44 | 
 45 | Let's work towards bringing these missing features to LocalBlobAccess,
 46 | so that we can finally remove CircularBlobAccess from the tree.
 47 | 
 48 | # Supporting file based storage
 49 | 
 50 | A naïve way of adding file based storage to LocalBlobAccess is to
 51 | provide a new implementation of the BlockAllocator type that is backed
 52 | by a filesystem.Directory. Each of the blocks created through the
 53 | NewBlock() function could be backed by an individual file. The downside
 54 | of this approach is that releasing blocks can now fail, as it's not
 55 | guaranteed that we can always successfully delete files on disk.
 56 | 
 57 | Instead, let's generalize the code in [bb-storage/pkg/blockdevice](https://pkg.go.dev/github.com/buildbarn/bb-storage/pkg/blockdevice)
 58 | to support both raw block devices and fixed-size files. Because we only
 59 | care about supporting fixed-size files, we can reuse the existing memory
 60 | mapping logic.
 61 | 
 62 | More concretely, let's replace MemoryMapBlockDevice() with two separate
 63 | NewBlockDeviceFrom{Device,File}() functions, mapping raw block devices
 64 | and regular files, respectively. Let's also introduce a generic
 65 | configuration message, so that any component in Buildbarn that uses
 66 | block devices supports both modes of access.
 67 | 
 68 | # Supporting persistency
 69 | 
 70 | The first step to add persistency to LocalBlobAccess is to allow the
 71 | digest-location map to be backed by a file. To achieve that, we can
 72 | import PR [bb-storage#62](https://github.com/buildbarn/bb-storage/pull/62),
 73 | which adds a FileBasedLocationRecordArray. While there, we should modify
 74 | it to be built on top of bb-storage/pkg/blockdevice, so that it can both
 75 | use raw block devices and files. Let's call this type
 76 | BlockDeviceBackedLocationRecordArray instead.
 77 | 
 78 | Storing the digest-location map on disk isn't sufficient to make
 79 | LocalBlobAccess persistent. One crucial piece of information that is
 80 | missing is the list of Block objects that LocalBlobAccess extracted from
 81 | the underlying BlockAllocator, and the write cursors that
 82 | LocalBlobAccess tracks for them. PR [bb-storage#63](https://github.com/buildbarn/bb-storage/pull/63)
 83 | attempted to solve this by letting LocalBlobAccess write this state into
 84 | a text file, so that it can be reloaded on startup.
 85 | 
 86 | The main concern with this approach is that it increases the complexity
 87 | of LocalBlobAccess, while that class is already bigger than it should
 88 | be. It is hard to get decent unit testing coverage of LocalBlobAccess,
 89 | and this change would make that even harder. Let's solve this by moving
 90 | the bottom half that is responsible for managing Block objects and write
 91 | cursors from LocalBlobAccess into a separate interface, called
 92 | BlockList:
 93 | 
 94 | ```go
 95 | type BlockList interface {
 96 |         // Managing the blocks in the list.
 97 |         PopFront()
 98 |         PushBack() error
 99 |         // Reading data from blocks.
100 |         Get(blockIndex int, ...) buffer.Buffer
101 |         // Writing data to blocks.
102 |         HasSpace(blockIndex int, sizeBytes int64) bool
103 |         Put(blockIndex int, sizeBytes int64) BlockListPutWriter
104 | }
105 | ```
106 | 
107 | For non-persistent workloads, we can add a basic implementation called
108 | VolatileBlockList that maintains the existing behaviour. For persistent
109 | workloads, we can provide a more complex implementation named
110 | PersistentBlockList.
111 | 
112 | With the changes in PR bb-storage#63 applied, LocalBlobAccess writes a
113 | text file to disk every time data is stored. First of all, we should
114 | consider using a Protobuf for this, as this makes it easier to add
115 | structure and extend the format where needed. Second, we should only
116 | emit the state file periodically and asynchronously, so that I/O and
117 | lock contention remain minimal. Let's solve this by adding the following
118 | two interfaces:
119 | 
120 | ```go
121 | // This interface needs to be implemented by PersistentBlockList, so
122 | // that its layout can be extracted.
123 | type PersistentStateSource interface {
124 |         // Returns a channel on which the caller can wait for writes to
125 |         // occur. This ensures that system can remain fully idle if no
126 |         // I/O takes place.
127 |         GetBlockPutWakeup() <-chan struct{}
128 |         // Returns the current layout of the PersistentBlockList.
129 |         GetPersistentState() *pb.PersistentState
130 |         ...
131 | }
132 | 
133 | // An implementation of this interface could, for example, write the
134 | // Protobuf message to a file on disk.
135 | type PersistentStateStore interface {
136 |         WritePersistentState(persistentState *pb.PersistentState) error
137 | }
138 | ```
139 | 
140 | A PeriodicSyncer helper type can provide a run loop that takes data from
141 | PersistentStateSource and passes it to PersistentStateStore.
142 | 
143 | # Persistency with data integrity
144 | 
145 | With the changes proposed above, we provide persistency, but don't
146 | guarantee integrity of data in case of unclean system shutdowns. Data
147 | corruption can happen when digest-location map entries are flushed to
148 | disk before all of the data corresponding to the blob is. This may be
149 | acceptable when LocalBlobAccess is used to hold the CAS (because of
150 | checksum validation), but for the AC it could cause us to interpret
151 | invalid Protobuf data.
152 | 
153 | One way to solve this issue is to call `fsync()` (or Linux's own
154 | `sync_file_range()`) between writes of data and digest-location map
155 | insertions, but doing so will decimate performance. To prevent the need
156 | for that, we can let a background job (i.e., the PeriodicSyncer that was
157 | described previously) call `fsync()` periodically. By incrementing a
158 | counter value before every `fsync()` call, and embedding its value into
159 | both digest-location map entries and the state file, we can accurately
160 | track which entries are valid and which ones potentially point to
161 | corrupted data. Let's call this counter the 'epoch ID'. References to
162 | blocks in digest-location map entries can now be expressed as follows:
163 | 
164 | ```go
165 | type BlockReference struct {
166 |         EpochID        uint32
167 |         // To be subtracted from the index of the last block used at the
168 |         // provided epoch.
169 |         BlocksFromLast uint16
170 | }
171 | ```
172 | 
173 | One issue with this approach is that once restarts take place and
174 | execution continues at the last persisted epoch ID, the digest-location
175 | map may still contain entries for future epoch IDs. These would need to
176 | be removed, as they would otherwise cause (potentially) corrupted blobs
177 | to become visible again by the time the epoch ID equals the value stored
178 | in the digest-location map entry.
179 | 
180 | Removing these entries could be done explicitly by scrubbing the
181 | key-location map during startup. The disadvantage is that this
182 | introduces a startup delay, whose duration is proportional to the size
183 | of the digest-location map. For larger setups where the digest-location
184 | map is tens of gigabytes in size, this is unacceptable. We should
185 | therefore aim to design a solution that doesn't require any scrubbing.
186 | 
187 | To prevent the need for scrubbing, we can associate every epoch ID with
188 | a random number, and enforce that entries in the digest-location map are
189 | based on the same value. This makes it possible to distinguish valid
190 | entries from ones that were written prior to unclean system shutdown.
191 | Instead of storing this value literally as part of digest-location map
192 | entries, it could be used as a seed for a record checksum. This has the
193 | advantage of hardening the digest-location map against silent data
194 | corruption (e.g., bit flips).
195 | 
196 | An approach like this does require us to store all these 'hash seeds'
197 | for every epoch with one or more valid blocks as part of the persistent
198 | state file. Fortunately, the number of epochs can only grow relatively
199 | slowly. If LocalBlobAccess is configured to synchronize data to disk
200 | once every minute and at least some writes take place every minute,
201 | there will still only be 526k epochs in one year. This number can only
202 | be reached if not a single block rotation were to happen during this
203 | timeframe, which is extremely unlikely. There is therefore no need to
204 | put a limit on the maximum number of epochs for the time being.
205 | 
206 | Higher levels of LocalBlobAccess code will not be able to operate on
207 | BlockReferences directly. Logic in both HashingDigestLocationMap and
208 | LocalBlobAccess depend on being able to compare block numbers along a
209 | total order. This is why we should make sure that BlockReference is only
210 | used at the storage level, namely inside LocationRecordArray
211 | implementations. To allow these types to look up hash seeds and convert
212 | BlockReferences to integer block indices, we can add the following
213 | helper type:
214 | 
215 | ```go
216 | type BlockReferenceResolver interface {
217 |         BlockReferenceToBlockIndex(blockReference BlockReference) (int, uint64, bool)
218 |         BlockIndexToBlockReference(blockIndex int) (BlockReference, uint64)
219 | }
220 | ```
221 | 
222 | By letting BlockReferenceToBlockIndex() return a boolean value
223 | indicating whether the BlockReference is still valid, we can remove the
224 | existing LocationValidator type, which served a similar purpose.
225 | 
226 | One thing that the solution above still lacks, is that there is no
227 | feedback mechanism between PeriodicSyncer and PersistentBlockList. If
228 | PeriodicSyncer can't keep up with the changes made to the
229 | PersistentBlockList, there is a chance that PersistentBlockList starts
230 | to recycle blocks that are still referenced by older epochs that are
231 | listed in the persistent state file. This could also cause data
232 | corruption after a restart.
233 | 
234 | To address this, PersistentBlockList's PushBack() should be made to fail
235 | explicitly when blocks are recycled too quickly. To reduce the
236 | probability of running into this situation, PersistentBlocklist should
237 | offer a GetBlockReleaseWakeup() that allows PeriodicSyncer to update the
238 | persistent state file without any delay. A
239 | NotifyPersistentStateWritten() method should be added to allow
240 | PeriodicSyncer to notify the PersistentBlockList that it's safe to
241 | recycle blocks that were removed as part of the previous
242 | GetPersistentState() call.
243 | 
244 | # Refactoring
245 | 
246 | While we're making these drastic changes to LocalBlobAccess, let's spend
247 | some effort to make further cleanups:
248 | 
249 | - With the addition of CompactDigest, almost all code of LocalBlobAccess
250 |   has been turned into to a generic key-value store. It no longer assumes
251 |   keys are REv2 digests. Let's rename CompactDigest to Key, and
252 |   DigestLocationMap to KeyLocationMap.
253 | 
254 | - LocalBlobAccess can be decomposed even further. All of the code that
255 |   implements the "old" vs. "current" vs. "new" block distinction
256 |   effectively provides a location-to-blob map. Let's move all of this
257 |   behind a LocationBlobMap interface, similar to how we have
258 |   KeyLocationMap.
259 | 
260 | - All of the code in LocalBlobAccess that ties the KeyLocationMap to a
261 |   LocationBlobMap can easily be moved into its own helper type. Let's
262 |   move that code behind a KeyBlobMap interface, as it simply provides a
263 |   mapping from a Key to a blob.
264 | 
265 | - What remains in LocalBlobAccess isn't much more than an implementation
266 |   of BlobAccess that converts REv2 digests to Keys before calling into
267 |   KeyBlobMap. The name LocalBlobAccess now makes little sense, as there
268 |   is no requirement that data is stored locally. Let's rename this type
269 |   to KeyBlobMapBackedBlobAccess.
270 | 
271 | - Now that we have a BlockDeviceBackedLocationRecordArray, let's rename
272 |   PartitioningBlockAllocator to BlockDeviceBackedBlockAllocator for
273 |   consistency.
274 | 
275 | - The handles returned by bb-storage/pkg/blockdevice now need to provide
276 |   a Sync() function to flush data to storage. The existing handle type
277 |   is named ReadWriterAt, which is a poor fit for an interface that also
278 |   provides a Sync() function. Let's rename this to BlockDevice.
279 | 
280 | With all of these changes made, it should be a lot easier to achieve
281 | decent unit testing coverage of the code.
282 | 
283 | # Architectural diagram of the old and new LocalBlobAccess
284 | 
285 | LocalBlobAccess before the proposed changes are made:
286 | 
287 | <p align="center">
288 |   <img src="images/0005/before.png" alt="LocalBlobAccess before the proposed changes are made"/>
289 | </p>
290 | 
291 | LocalBlobAccess after the proposed changes are made, but with
292 | persistency still disabled:
293 | 
294 | <p align="center">
295 |   <img src="images/0005/volatile.png" alt="LocalBlobAccess after the proposed changes are made, but with persistency still disabled"/>
296 | </p>
297 | 
298 | The classes colored in green perform a similar role to the old
299 | LocalBlobAccess type that they replace.
300 | 
301 | LocalBlobAccess after the proposed changes are made, but with
302 | persistency enabled:
303 | 
304 | <p align="center">
305 |   <img src="images/0005/persistent.png" alt="LocalBlobAccess after the proposed changes are made, but with persistency enabled"/>
306 | </p>
307 | 
308 | Note that these are not UML class diagrams. The arrows in these diagrams
309 | indicate how objects reference each other. The labels on the arrows
310 | correspond with the interface type that is used.
311 | 
312 | # Thanks
313 | 
314 | The author would like to thank Tom Coldrick from Codethink Ltd. for his
315 | contributions. His PRs gave good insight in what was needed to add
316 | persistency.
317 | 
318 | # Future work
319 | 
320 | With bb-storage/pkg/blockdevice generalized to support file storage,
321 | there is no longer any need to keep bb-remote-execution's
322 | DirectoryBackedFilePool, as BlockDeviceBackedFilePool can be used in its
323 | place. We should consider removing DirectoryBackedFilePool.
324 | 
325 | Letting the key-location map be backed by a disk may increase access
326 | latency significantly. Should we add a configuration option for calling
327 | `mlock()` on the memory map, so that its contents are guaranteed to
328 | remain in memory?
329 | 


--------------------------------------------------------------------------------
/0006-operation-logging-and-monetary-resource-usage.md:
--------------------------------------------------------------------------------
 1 | # Buildbarn Architecture Decision Record #6: Completed Action Logger and Monetary Resource Usage
 2 | 
 3 | Authors: Travis Takai, Ed Schouten
 4 | 
 5 | Date: 2021-03-23
 6 | 
 7 | # Context
 8 | 
 9 | Action results presented through bb-browser's frontend provide a great way to gain insight into the behavior of remote execution by looking at a single action at a time. The next logical step to this is analyzing build results in aggregate and in real-time. There are many reasons for wanting to analyze remote execution behavior such as identifying build duration trends, estimating the total cost of a build, identifying computationally expensive build targets, evaluating hardware utilization, or correlating completed build actions with data in the [Build Event Protocol](https://docs.bazel.build/versions/master/build-event-protocol.html). bb-browser displays all of the necessary info, but does not allow for programmatic analysis of action results for a number of reasons:
10 | 
11 | * Most of the useful information is stored within the [ExecutedActionMetadata's auxiliary_metadata field](https://github.com/bazelbuild/remote-apis/blob/0943dc4e70e1414735a85a3167557392c177ff45/build/bazel/remote/execution/v2/remote_execution.proto#L945-L948) but is not exposed by the bazel client or displayed in the Build Event Protocol output.
12 | 
13 | * There aren't any convenient ways for exploring build information via bb-browser as the instance name, action digest, and size of the action result are all needed before querying for an action result.
14 | 
15 | * The goal of bb-storage is to offer short-term data storage, which does not allow for any form of querying for historical data.
16 | 
17 | * bb-scheduler does not provide a persisted list of build actions and bb-storage has no way of providing access to a sequence of action results in an efficient way.
18 | 
19 | An alternative approach that allows for a flexible configuration for clients is to allow for streaming action results along with their associated metadata to an external service. This allows for both long-term data persistence and real-time build action analysis. We should work towards creating this streaming service, which we can call Completed Action Logger.
20 | 
21 | # Completed Action Logger Service
22 | 
23 | The protocol will be used by bb_worker to stream details of completed actions, known as CompletedActions, to an external logging service:
24 | 
25 | ```protobuf
26 | service CompletedActionLogger {
27 |   // Send a CompletedAction to another service as soon as a build action has
28 |   // completed. Receiving a message from the return stream indicates that the
29 |   // service successfully received the CompletedAction.
30 |   rpc LogCompletedActions(stream CompletedAction)
31 |       returns (stream google.protobuf.Empty);
32 | }
33 | ```
34 | 
35 | Each worker will have the ability to stream CompletedActions to the logging service once the action result has been created and all metadata has been attached. CompletedActions will take the form of:
36 | 
37 | ```protobuf
38 | // CompletedAction wraps a finished build action in order to transmit to
39 | // an external service.
40 | message CompletedAction {
41 |   // A wrapper around the action's digest and REv2 ActionResult, which contains
42 |   // the action's associated metadata.
43 |   buildbarn.cas.HistoricalExecuteResponse historical_execute_response = 1;
44 | 
45 |   // A unique identifier associated with the CompletedAction, which is 
46 |   // generated by the build executor. This provides a means by which the
47 |   // external logging service may be able to deduplicate incoming
48 |   // CompletedActions. The usage of this field is left to the external
49 |   // logging service to determine.
50 |   string uuid = 2;
51 | 
52 |   // The REv2 instance name of the remote cluster that workers are returning
53 |   // the action result from.
54 |   string instance_name = 3;
55 | }
56 | ```
57 | 
58 | The HistoricalExecuteResponse message is simply bb-storage's UncachedActionResult that will be renamed in order to be more used more generally. Apart from the Action digest included in the historical_execute_response field, no information about the Action is part of CompletedAction. Implementations of the CompletedActionLogger service can load objects from the Content Addressable Storage in case they need to inspect details of the Action.
59 | 
60 | # Monetary Resource Usage
61 | 
62 | Now that we've defined what the Completed Action Logger is, let's go ahead and implement one of the desired uses for action analysis: metadata for measuring the cost of build execution. Defining a new message, which we'll call MonetaryResourceUsage, provides a nice way of calculating how much a given build cost based on the amount of time spent executing the action and will take the form of:
63 | 
64 | ```protobuf
65 | // A representation of unique factors that may be aggregated to
66 | // compute a given build action's total price.
67 | message MonetaryResourceUsage {
68 |   message Expense {
69 |     // The type of currency the cost is measured in. Required to be in
70 |     // ISO 4217 format: https://en.wikipedia.org/wiki/ISO_4217#Active_codes
71 |     string currency = 1;
72 | 
73 |     // The value of a specific expense for a build action.
74 |     double cost = 2;
75 |   }
76 | 
77 |   // A mapping of expense categories to their respective costs.
78 |   map<string, Expense> expenses = 1;
79 | }
80 | ```
81 | 
82 | This will be appended to auxiliary_metadata that is part of the REv2 ExecutedActionMetadata message at the end of execution, which will automatically ensure it is cached along with the other build metadata in bb-storage. While expenses are not significant on a per-action basis, when combined with the CompletedActionLogger service we now have a way to quantify how much a given build invocation or target costs and see how that changes over time. Implementations of the CompletedActionLogger service are responsible for aggregating these MonetaryResourceUsage messages. It is possible to aggregate this data by making use of the fields within the RequestMetadata message, such as [tool_invocation_id](https://github.com/bazelbuild/remote-apis/blob/0943dc4e70e1414735a85a3167557392c177ff45/build/bazel/remote/execution/v2/remote_execution.proto#L1760-L1762) or the [recently added target_id field](https://github.com/bazelbuild/remote-apis/pull/186/files), as the RequestMetadata data is always appended to the auxiliary_metadata message.
83 | 


--------------------------------------------------------------------------------
/0007-nested-invocations.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #7: Nested invocations
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2021-12-27
  5 | 
  6 | # Context
  7 | 
  8 | On 2020-11-16, bb\_scheduler was extended to take the Bazel invocation
  9 | ID into account ([commit](https://github.com/buildbarn/bb-remote-execution/commit/1a8407a0e62bf559abcd4006e0ecc9d3de9838c7)).
 10 | With this change applied, bb\_scheduler no longer places all incoming
 11 | operations in a single FIFO-like queue (disregarding priorities).
 12 | Instead, every invocation of Bazel gets its own queue. All of these
 13 | queues combined are placed in another queue that determines the order of
 14 | execution. The goal of this change was to improve fairness, by making
 15 | the scheduler give every invocation an equal number of workers.
 16 | 
 17 | Under the hood, [`InMemoryBuildQueue` calls into `ActionRouter`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L378), which then [calls into `invocation.KeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/routing/simple_action_router.go#L40)
 18 | to obtain an [`invocation.Key`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/key.go).
 19 | The latter acts as [a map key](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L1201)
 20 | for all invocations known to the scheduler.
 21 | [The default implementation](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/request_metadata_key_extractor.go)
 22 | is one that extracts [the `tool_invocation_id` field from the REv2 `RequestMetadata` message](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1836).
 23 | 
 24 | What is pretty elegant about this design is that `InMemoryBuildQueue`
 25 | doesn't really care about what kind of logic is used to compute
 26 | invocation keys. One may, for example, provide a custom
 27 | `invocation.KeyExtractor` that causes operations to be grouped by
 28 | username, or by [the `correlated_invocations_id` field](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1840).
 29 | 
 30 | On 2021-07-16, bb\_scheduler was extended once more to use Bazel
 31 | invocation IDs to provide worker locality ([commit](https://github.com/buildbarn/bb-remote-execution/commit/902aab278baa45c92d48cad4992c445cfc588e32)).
 32 | This change causes the scheduler to keep track of invocation ID of the
 33 | last operation to run on a worker. When scheduling successive tasks, the
 34 | scheduler will try to reuse a worker with a matching invocation ID. This
 35 | increases the probability of input root cache hits, as it is not
 36 | uncommon for clients to run many actions having similar input root
 37 | contents.
 38 | 
 39 | At the same time, this change introduced [the notion of 'stickiness'](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/proto/configuration/bb_scheduler/bb_scheduler.proto#L177-L196),
 40 | where a cluster operator can effectively quantify the overhead of
 41 | switching between actions belonging to different invocations. Such
 42 | overhead could include the startup time of virtual machines, download
 43 | times of container images specified in REv2 platform properties. In the
 44 | case embedded hardware testing, it could include the time it takes to
 45 | flash an image to EEPROM/NAND and boot from it. By using a custom
 46 | `invocation.KeyExtractor` that extracts these properties from incoming
 47 | execution requests, the scheduler can reorder operations in such a way
 48 | that the number of restarts, reboots, flash erasure cycles, etc. is
 49 | reduced.
 50 | 
 51 | One major limitation of how this feature is implemented right now, is
 52 | that invocations are a flat namespace. The scheduler only provides
 53 | fairness by keying execution requests by a single property. For example,
 54 | there is no way to configure the scheduler to provide the following:
 55 | 
 56 | - Fairness by username, followed by fairness by Bazel invocation ID for builds
 57 |   with an equal username.
 58 | - Fairness by team/project/department, followed by fairness by username.
 59 | - Fairness both by correlated invocations ID and tool invocation ID.
 60 | - Fairness/stickiness by worker container image, followed by fairness by
 61 |   Bazel invocation ID.
 62 | 
 63 | In this ADR we propose to remove this limitation, with the goal of
 64 | improving fairness of multi-tenant Buildbarn setups.
 65 | 
 66 | # Proposed changes
 67 | 
 68 | This ADR mainly proposes that operations no longer have a single
 69 | `invocation.Key`, but a list of them. The flat map of invocations
 70 | managed by `InMemoryBuildQueue` will be changed to a
 71 | [trie](https://en.wikipedia.org/wiki/Trie), where every level of the
 72 | trie corresponds to a single `invocation.Key`. More concretely, if the
 73 | scheduler is configured to not apply any fairness whatsoever, all
 74 | operations will simply be queued within the root invocation. If the
 75 | `ActionRouter` is configured to always return two `invocation.Key`s, the
 76 | trie will have height two, having operations queued only at the leaves.
 77 | 
 78 | When changing this data structure, we'll also need to rework all of the
 79 | algorithms that are applied against it. For example, we currently use a
 80 | [binary heap](https://en.wikipedia.org/wiki/Binary_heap) to store all
 81 | invocations that have one or more queued operations, which is used for
 82 | selecting the next operation to run. This will be replaced with a heap
 83 | of heaps that mimics the layout of the trie. These algorithms will thus
 84 | all become a bit more complex than before.
 85 | 
 86 | Fortunately, there are also some places where the algorithms become
 87 | simpler. For the worker locality, we currently have to deal with the
 88 | case where the last task that ran on a worker was associated with
 89 | multiple invocations (due to in-flight deduplication). Or it may not be
 90 | associated with any invocation if the worker was created recently. In
 91 | the new model we can simply pick the
 92 | [lowest common ancestor](https://en.wikipedia.org/wiki/Lowest_common_ancestor)
 93 | of all invocations in case of in-flight deduplication, and pick the root
 94 | invocation for newly created workers. This ensures that every worker is
 95 | always associated with exactly one invocation.
 96 | 
 97 | For `InMemoryBuildQueue` to obtain multiple `invocation.Key`s for a
 98 | given operation, we will need to alter `ActionRouter` to return a list
 99 | instead of a single instance. The underlying `invocation.KeyExtractor`
100 | interface can remain as is, though we can remove
101 | [`invocation.EmptyKeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/empty_key_extractor.go).
102 | The same behaviour can now be achieved by not using any key extractors
103 | at all.
104 | 
105 | Finally we have to extend the worker invocation stickiness logic to work
106 | with this nested model. As every level of the invocation tree now has a
107 | different cost when switching to a different branch, we now need to let
108 | the configuration accept a list of durations, each indicating the cost
109 | at that level.
110 | 
111 | All of the changes proposed above have been implemented in
112 | [this commit](https://github.com/buildbarn/bb-remote-execution/commit/744c9a65215179f19f20523acd522937a03aad56).
113 | 
114 | # Future work
115 | 
116 | With `invocation.EmptyKeyExtractor` removed, only
117 | `invocation.RequestMetadataKeyExtractor` remains. This type is capable
118 | of creating an `invocation.Key` based on the `tool_invocation_id`. We
119 | should add more types, such as one capable of extracting the
120 | `correlated_invocations_id` or authentication related properties (e.g.,
121 | username).
122 | 
123 | Even though the scheduling policy of `InMemoryBuildQueue` is fair, there
124 | is no support for [preemption](https://en.wikipedia.org/wiki/Preemption_(computing)).
125 | At no point will the scheduler actively move operations from the
126 | `EXECUTING` back to the `QUEUED` stage if it prevents other invocations
127 | from making progress.
128 | 


--------------------------------------------------------------------------------
/0009-nfsv4.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #9: An NFSv4 server for bb\_worker
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2022-05-23
  5 | 
  6 | # Context
  7 | 
  8 | Buildbarn's worker process (bb\_worker) can be configured to populate
  9 | build directories of actions in two different ways:
 10 | 
 11 | - By instantiating it as [a native directory on a local file system](https://github.com/buildbarn/bb-remote-execution/blob/78e082c6793904705001a9674bc865f91d7624bf/pkg/proto/configuration/bb_worker/bb_worker.proto#L79-L83).
 12 |   To speed up this process, it may use a directory where recently used
 13 |   files are cached. Files are hardlinked from/to this cache directory.
 14 | - By instantiating it in-memory, while [using FUSE to make it accessible to the build action](https://github.com/buildbarn/bb-remote-execution/blob/78e082c6793904705001a9674bc865f91d7624bf/pkg/proto/configuration/bb_worker/bb_worker.proto#L85-L98).
 15 |   An instance of the LocalBlobAccess storage backend needs to be used to
 16 |   cache file contents.
 17 | 
 18 | While the advantage of the former is that it does not introduce any
 19 | overhead while executing, the process may be slow for large input roots,
 20 | especially if only a fraction gets used in practice. The FUSE file
 21 | system has the advantage that data is loaded lazily, meaning files and
 22 | directories of the input root are only downloaded from the CAS if their
 23 | contents are read during execution. This is particularly useful for
 24 | actions that ship their own SDKs.
 25 | 
 26 | An issue with FUSE is that it remains fairly Linux specific. Other
 27 | operating systems also ship with implementations of FUSE or allow it to
 28 | be installed as a kernel module/extension, but these implementations
 29 | tend to vary in terms of quality and conformance. For example, for macOS
 30 | there is [macFUSE (previously called OSXFUSE)](https://osxfuse.github.io).
 31 | Though bb\_worker can be configured to work with macFUSE, it does tend
 32 | to cause system lockups under high load. Fixing this is not easy, as
 33 | macFUSE is no longer Open Source Software.
 34 | 
 35 | For this reason we would like to offer an alternative to FUSE, namely an
 36 | integrated NFSv4 server that listens on `localhost` and is mounted on
 37 | the same system. FUSE will remain supported and recommended for use on
 38 | Linux; NFSv4 should only be used on systems where the use of FUSE is
 39 | undesirable and a high-quality NFSv4 client is available.
 40 | 
 41 | We will focus on implementing NFSv4.0 as defined in [RFC 7530](https://datatracker.ietf.org/doc/html/rfc7530#section-16.10).
 42 | Implementing newer versions such as NFSv4.1 ([RFC 8881](https://www.rfc-editor.org/rfc/rfc8881#name-obsolete-locking-infrastruc))
 43 | and NFSv4.2 ([RFC 7862](https://datatracker.ietf.org/doc/html/rfc7862))
 44 | is of little use, as clients such as the one shipped with macOS don't
 45 | support them. We should also not be using NFSv3 ([RFC 1813](https://datatracker.ietf.org/doc/html/rfc1813)),
 46 | as due to its lack of [compound operations](https://datatracker.ietf.org/doc/html/rfc7530#section-1.4.2),
 47 | it is far more 'chatty' than NFSv4. This would lead to unnecessary
 48 | context switching between bb\_worker and build actions.
 49 | 
 50 | # Reusing as much code as possible between FUSE and NFSv4
 51 | 
 52 | [The code for our existing FUSE file system](https://github.com/buildbarn/bb-remote-execution/tree/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse)
 53 | is about 7500 lines of code. We have already invested heavily in it, and
 54 | it has received many bugfixes for issues that we have observed in
 55 | production use. The last thing we want to do is to add a brand new
 56 | `pkg/filesystem/nfsv4` package that has to reimplement all of this for
 57 | NFSv4. Not only will this be undesirable from a maintenance perspective,
 58 | it also puts us at risk of introducing behavioral differences between
 59 | the FUSE and NFSv4 implementations, making it hard to switch between the
 60 | two.
 61 | 
 62 | As FUSE and NFSv4 are conceptually identical (i.e., request-response
 63 | based services for accessing a POSIX-like file system), we should try to
 64 | move to a shared codebase. This ADR therefore proposes that the existing
 65 | `pkg/filesystem/fuse` package is decomposed into three new packages:
 66 | 
 67 | - `pkg/filesystem/virtual`, which will contain the vast majority of code
 68 |   that can be made independent of [go-fuse](https://github.com/hanwen/go-fuse).
 69 | - `pkg/filesystem/virtual/fuse`, which contains the coupling with
 70 |   go-fuse.
 71 | - `pkg/filesystem/virtual/configuration`, which contains the code for
 72 |   instantiating a virtual file system based on configuration settings
 73 |   and exposing (mounting) it.
 74 | 
 75 | This decomposition does require us to make various refactoring changes.
 76 | Most notably, the [`Directory`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/directory.go#L38-L73)
 77 | and [`Leaf`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/leaf.go#L13-L23)
 78 | interfaces currently depend on many data types that are part of go-fuse,
 79 | such as status codes, directory entry structures, and file attribute
 80 | structures. All of these need to be replaced with equivalents that are
 81 | generic enough to support both the semantics of FUSE and NFSv4.
 82 | Differences between these two protocols that need to be bridged include
 83 | the following:
 84 | 
 85 | - FUSE's READDIR operation is stateful, in that it's surrounded by
 86 |   OPENDIR and RELEASEDIR operations. This means that we currently load
 87 |   all of the directory contents at once, and paginate the results as
 88 |   part of READDIR. With NFSv4 this operation needs to be stateless.
 89 |   We solve this by pushing down the pagination into the `Directory`
 90 |   implementations. The new `VirtualReadDir()` method we add takes a
 91 |   starting offset and pushes directory entries into a
 92 |   `DirectoryEntryReporter` until space has run out. This makes both FUSE
 93 |   and NFSv4 use a stateless approach.
 94 | 
 95 | - FUSE has the distinction between READDIR and READDIRPLUS. The former
 96 |   only returns filenames, file types and inode numbers, while the latter
 97 |   returns full `stat` information. NFSv4 only provides a single READDIR
 98 |   operation, but just like with the GETATTR operation, the caller can
 99 |   provide a bitmask of attributes that it's interested in receiving. As
100 |   FUSE's semantics can be emulated on top of NFSv4's, we'll change our
101 |   API to use an `AttributesMask` type as well.
102 | 
103 | - When FUSE creates a regular file for reading/writing, it calls CREATE,
104 |   providing it the identifier of the parent directory and the filename.
105 |   This differs from opening an existing file, which is done through
106 |   OPEN, providing it the identifier of the file. NFSv4, however, only
107 |   provides a single OPEN operation that is used in all cases. We'll
108 |   therefore replace `Directory.FUSECreate()` with
109 |   `Directory.VirtualOpenChild()`, which can be used by FUSE CREATE and
110 |   NFSv4 OPEN. `Leaf.FUSEOpen()` will remain available under the name
111 |   `Leaf.VirtualOpenSelf()`, to be used by FUSE OPEN.
112 | 
113 | With all of these changes landed, we'll be able to instantiate a virtual
114 | file system as part of bb\_worker that can both be interacted with from
115 | within FUSE and NFSv4 servers.
116 | 
117 | **Status:** This work has been completed as part of [commit `c4bbd24`](https://github.com/buildbarn/bb-remote-execution/commit/c4bbd24a8d272267f40b234533f62c8ccf24d568).
118 | The new virtual file system layer can be found in [`pkg/filesystem/virtual`](https://github.com/buildbarn/bb-remote-execution/tree/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual).
119 | 
120 | # Indexing the virtual file system by file handle
121 | 
122 | NFSv4 clients refer to files and directories on the server by file
123 | handle. File handles are byte arrays that can be between 0 and 128 bytes
124 | in size (0 and 64 bytes for NFSv3). Many NFS servers on UNIX-like
125 | platforms construct file handles by concatenating an inode number with a
126 | file generation count to ensure that file handles remain distinct, even
127 | if inodes are recycled. As handles are opaque to the client, a server
128 | can choose any format it desires.
129 | 
130 | As NFSv4 is intended to be a (mostly) stateless protocol, the server has
131 | absolutely no information on when a client is going to stop interacting
132 | with a file handle. At any point in time, a request can contain a file
133 | handle that was returned previously. Our virtual file system must
134 | therefore not just allow resolution by path, but also by file handle.
135 | This differs fundamentally from FUSE, where the kernel and userspace
136 | service share knowledge on which subset of the file system is in play
137 | between the two. The kernel issues FORGET calls when purging file
138 | entries from its cache, allowing the userspace service to release the
139 | resource as well.
140 | 
141 | Where NFSv4's semantics become problematic is not necessarily in
142 | bb\_worker, but in bb\_clientd. This service provides certain
143 | directories that are infinitely big (i.e., `/cas/*`, which allow you to
144 | access arbitrary CAS contents). Unlike build directories, files in these
145 | directories have an infinite lifetime, meaning that any dynamic
146 | allocation scheme for file handles would result in memory leaks. For
147 | files in these directories we will need to dump their state (i.e., the
148 | REv2 digest) into the file handle itself, allowing the file to be
149 | reconstructed when needed.
150 | 
151 | To achieve the above, we will add a new `HandleAllocator` API that is
152 | capable of decorating `Directory` and `Leaf` objects to give them their
153 | own identity and perform lifecycle management. In practice, this means
154 | that a plain `Directory` or `Leaf` implementation will no longer have an
155 | inode number or link count. Only by wrapping the object using
156 | `HandleAllocator` will this information become available through
157 | `VirtualGetAttributes()`. In the case of FUSE, `HandleAllocator` will do
158 | little more than compute an inode number, just like [`InodeNumberTree`](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/filesystem/fuse/inode_number_tree.go)
159 | already did. For NFSv4, it will additionally generate a file handle and
160 | store the object in a global map, so that the object can be resolved by
161 | the NFSv4 server if the client performs a PUTFH operation.
162 | 
163 | The `HandleAllocator` API will distinguish between three types of
164 | allocations:
165 | 
166 | - **Stateful objects:** Files or directories that are mutable. Each
167 |   instance has its own dynamically allocated inode number/file handle.
168 | - **Stateless objects:** Files or directories that are immutable, but
169 |   have a state that cannot be reproduced from just a file handle.
170 |   Examples may include symbolic links created by build actions. As a
171 |   symlink's target can be larger than 128 bytes, there is no way to
172 |   embed a symlink's state into a file handle. This means that even
173 |   though the symlink may have a deterministic inode number/file handle,
174 |   its lifecycle should be tracked explicitly.
175 | - **Resolvable objects:** Files or directories that are immutable, and
176 |   have state that is small enough to embed into the file handle.
177 |   Examples include CAS backed files, as a SHA-256 sum, file size and
178 |   executable bit can easily fit in a file handle.
179 | 
180 | **Status:** This work has also been completed as part of [commit `c4bbd24`](https://github.com/buildbarn/bb-remote-execution/commit/c4bbd24a8d272267f40b234533f62c8ccf24d568).
181 | Implementations of `HandleAllocator` have been added for [FUSE](https://github.com/buildbarn/bb-remote-execution/blob/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual/fuse_handle_allocator.go)
182 | and [NFSv4](https://github.com/buildbarn/bb-remote-execution/blob/70cca3b240190be22d3c5866ac5635d14515bf24/pkg/filesystem/virtual/nfs_handle_allocator.go).
183 | 
184 | # XDR description to Go compilation
185 | 
186 | All versions of NFS are built on top of ONC RPCv2 ([RFC 5531](https://datatracker.ietf.org/doc/html/rfc5531)),
187 | also known as "Sun RPC". Like most protocols built on top of ONC RPCv2,
188 | NFS uses XDR ([RFC 4506](https://datatracker.ietf.org/doc/html/rfc4506))
189 | for encoding request and response payloads. The XDR description for
190 | NFSv4.0 can be found in [RFC 7531](https://datatracker.ietf.org/doc/html/rfc7531).
191 | 
192 | A nice feature of schema languages like XDR is that they can be used to
193 | perform code generation. For each of the types described in the RFC, we
194 | may emit an equivalent native type in Go, together with serialization
195 | and deserialization methods. In addition to making our server's code
196 | more readable, it is less error prone than attempting to read/write raw
197 | bytes from/to a socket.
198 | 
199 | A disadvantage of this approach is that it does add overhead. When
200 | converted to native Go types, requests are no longer stored contiguously
201 | in some buffer, but may be split up into multiple objects, which may
202 | become heap allocated (thus garbage collector backed). Though this is a
203 | valid concern, we will initially assume that this overhead is
204 | acceptable. In-kernel implementations tend to make a different trade-off
205 | in this regard, but this is likely a result of memory management in
206 | kernel space being far more restricted.
207 | 
208 | A couple of implementations of XDR for Go exist:
209 | 
210 | - [github.com/davecgh/go-xdr](https://github.com/davecgh/go-xdr)
211 | - [github.com/stellar/go/xdr](https://github.com/stellar/go/tree/master/xdr)
212 | - [github.com/xdrpp/goxdr](https://github.com/xdrpp/goxdr)
213 | 
214 | Unfortunately, none of these implementations are complete enough to be
215 | of use for this specific use case. We will therefore design our own
216 | implementation, which we will release as a separate project that does
217 | not depend on any Buildbarn code.
218 | 
219 | **Status:** [The XDR to Go compiler has been released on GitHub.](https://github.com/buildbarn/go-xdr)
220 | 
221 | # The actual NFSv4 server
222 | 
223 | With all of the previous tasks completed, we have all of the building
224 | blocks in place to be able to add an NFSv4 server to the
225 | bb-remote-execution codebase. All that is left is to write an
226 | implementation of [program `NFS4_PROGRAM`](https://datatracker.ietf.org/doc/html/rfc7531#page-36),
227 | for which the XDR to Go compiler automatically generates the following
228 | interface:
229 | 
230 | ```go
231 | type Nfs4Program interface {
232 | 	NfsV4Nfsproc4Null(context.Context) error
233 | 	NfsV4Nfsproc4Compound(context.Context, *Compound4args) (*Compound4res, error)
234 | }
235 | ```
236 | 
237 | This implementation, which we will call `baseProgram`, needs to process
238 | the operations provided in `Compound4args` by translating them to calls
239 | against instances of `Directory` and `Leaf`.
240 | 
241 | For most NFSv4 operations this implementation will be relatively simple.
242 | For example, for RENAME it involves little more than extracting the
243 | directory objects and filenames from the request, followed by calling
244 | `Directory.VirtualRename()`.
245 | 
246 | Most of the complexity of `baseProgram` will lie in how operations like
247 | OPEN, CLOSE, LOCK, and LOCKU are implemented. These operations establish
248 | and alter state on the server, meaning that they need to be guarded
249 | against replays and out-of-order execution. To solve this, NFSv4.0
250 | requires that these operations are executed in the context of
251 | open-owners and lock-owners. Each open-owner and lock-owner has a
252 | sequence ID associated with it, which gets incremented whenever an
253 | operation succeeds. The server can thus detect replays of previous
254 | requests by comparing the sequence ID in the client's request with the
255 | value stored on the server. If the sequence ID corresponds to the last
256 | operation to execute, a cached response is returned. A new transaction
257 | will only be performed if the sequence ID is one larger than the last
258 | observed.
259 | 
260 | The exact semantics of this sequencing model is fairly complex. It
261 | is covered extensively in [chapter 9 of RFC 7530](https://datatracker.ietf.org/doc/html/rfc7530#section-9),
262 | which is about 40 pages long. The following attempts to summarize which
263 | data types we have declared as part of `baseProgram`, and how they map
264 | to the NFSv4.0 sequencing model.
265 | 
266 | - `baseProgram`: In the case of bb\_worker, zero or more build
267 |   directories are declared to be exposed through an NFSv4 server.
268 |   - `clientState`: Zero or more clients may be connected to the NFSv4
269 |     server.
270 |     - `clientConfirmationState`: The client may have one or more client
271 |       records, which are created through SETCLIENTID. Multiple client
272 |       records can exist if the client loses all state and reconnects
273 |       (e.g., due to a reboot).
274 |       - `confirmedClientState`: Up to one of these client records can be
275 |         confirmed using SETCLIENTID\_CONFIRM. This structure stores the
276 |         state of a healthy client that is capable of opening files and
277 |         acquiring locks.
278 |         - `openOwnerState`: Confirmed clients may have zero or more
279 |           open-owners. This structure stores the current sequence
280 |           number of the open-owner. It also holds the response of the last
281 |           CLOSE, OPEN, OPEN\_CONFIRM or OPEN\_DOWNGRADE call for replays.
282 |           - `openOwnerFileState`: An open-owner may have zero or more
283 |             open files. The first time a file is opened through this
284 |             open-owner, the client needs to call OPEN\_CONFIRM.
285 |         - `lockOwnerState`: Confirmed clients may have zero or more
286 |           lock-owners. This structure stores the current sequence number
287 |           of the lock-owner. It also holds the response of the last LOCK or
288 |           LOCKU call for replays.
289 |           - `lockOwnerFileState`: A lock-owner may have one or more
290 |             files with lock state.
291 |             - `ByteRangeLock`: A lock state may hold locks on
292 |               byte ranges in the file.
293 | 
294 | Even though NFSv4.0 does provide a RELEASE\_LOCKOWNER operation for
295 | removing lock-owners, no facilities are provided for removing unused
296 | open-owners and client records. `baseProgram` will be implemented in
297 | such a way that these objects are removed automatically if a
298 | configurable amount of time passes. This is done as part of general lock
299 | acquisition, meaning clients are collectively responsible for cleaning
300 | stale state.
301 | 
302 | A disadvantage of NFSv4.0's sequencing model is that open-owners are not
303 | capable of sending OPEN requests in parallel. It is not expected that
304 | this causes a bottleneck in our situation, as running the NFSv4 server
305 | on the worker itself means that latency is virtually nonexistent. It is
306 | worth noting that [NFSv4.1 has completely overhauled this part of the protocol](https://www.rfc-editor.org/rfc/rfc8881.html#section-8.8),
307 | thereby removing this restriction. Implementing this model is, as
308 | explained earlier on, out of scope.
309 | 
310 | **Status:** This work has been completed as part of [commit `f00b857`](https://github.com/buildbarn/bb-remote-execution/commit/f00b8570c2cb0223f1faedcd83f84709a268b457).
311 | The new NFSv4 server can be found in [`pkg/filesystem/virtual/nfsv4`](https://github.com/buildbarn/bb-remote-execution/blob/f00b8570c2cb0223f1faedcd83f84709a268b457/pkg/filesystem/virtual/nfsv4/base_program.go).
312 | 
313 | # Changes to the configuration schema
314 | 
315 | The FUSE file system is currently configured through
316 | [the `MountConfiguration` message](https://github.com/buildbarn/bb-remote-execution/blob/fa8080bad8a8f761855d828f8dd517ed731bfdd6/pkg/proto/configuration/fuse/fuse.proto#L9-L91).
317 | Our plan is to split this message up, moving all FUSE specific options
318 | into a `FUSEMountConfiguration` message. This message can then be placed
319 | in a `oneof` together with an `NFSv4MountConfiguration` message that
320 | enables the use of the NFSv4 server. Switching back and forth between
321 | FUSE and NFSv4 should thus be trivial.
322 | 
323 | **Status:** This work has been completed as part of [commit `f00b857`](https://github.com/buildbarn/bb-remote-execution/commit/f00b8570c2cb0223f1faedcd83f84709a268b457).
324 | The new `MountConfiguration` message with separate FUSE and NFSv4 backends can
325 | be found [here](https://github.com/buildbarn/bb-remote-execution/blob/f00b8570c2cb0223f1faedcd83f84709a268b457/pkg/proto/configuration/filesystem/virtual/virtual.proto#L13-L31).
326 | 
327 | # Future work
328 | 
329 | In this ADR we have mainly focused on the use of NFSv4 for bb\_worker.
330 | These changes will also make it possible to launch bb\_clientd with an
331 | NFSv4 server. bb\_clientd's use case differs from bb\_worker's, in that
332 | the use of the [Remote Output Service](https://github.com/buildbarn/bb-clientd#-to-perform-remote-builds-without-the-bytes)
333 | heavily depends on being able to quickly invalidate directory entries
334 | when lazy-loading files are inserted into the file system. FUSE is
335 | capable of facilitating this by sending FUSE\_NOTIFY\_INVAL\_ENTRY
336 | messages to the kernel. A similar feature named CB\_NOTIFY is present in
337 | NFSv4.1 and later, but rarely implemented by clients.
338 | 
339 | Maybe bb\_clientd can be made to work by disabling client-side directory
340 | caching. Would performance still be acceptable in that case?
341 | 


--------------------------------------------------------------------------------
/0010-file-system-access-cache.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #10: The File System Access Cache
  2 | 
  3 | Author: Ed Schouten<br/>
  4 | Date: 2023-01-20
  5 | 
  6 | # Context
  7 | 
  8 | Buildbarn's workers can be configured to expose input roots of actions
  9 | through a virtual file system, using either FUSE or NFSv4. One major
 10 | advantage of this feature is that it causes workers to only download
 11 | objects from the Content Addressable Storage (CAS) if they are accessed
 12 | by actions as part of their execution. For example, a compilation action
 13 | that ships its own SDK will only cause the worker to download the
 14 | compiler binary and standard libraries/headers that are used by the code
 15 | to be compiled, as opposed to downloading the entire SDK.
 16 | 
 17 | Using the virtual file system generally to make actions run faster. We
 18 | have observed that in aggregate, it allows actions to run at >95% of
 19 | their original speed, while eliminating the downloading phase that
 20 | normally leads up to execution. Also, the reduction in bandwidth against
 21 | storage nodes allows clusters to scale further.
 22 | 
 23 | Unfortunately, there are also certain workloads for which the virtual
 24 | file system performs poorly. As most traditional software reads files
 25 | from disk using synchronous APIs, actions that observe poor cache hit
 26 | rates can easily become blocked when the worker needs to make many round
 27 | trips to the CAS. Increased latency (distance) between storage and
 28 | workers only amplifies these delays.
 29 | 
 30 | A more traditional worker that loads all data up front would not be
 31 | affected by this issue, for the reason that it can use parallelism to
 32 | load many objects from the CAS. This means that one could avoid this
 33 | issue by creating a cluster consisting of both workers with, and without
 34 | a virtual file system. Tags could be added to build targets declared in
 35 | the user's repository to indicate which kind of worker should be used
 36 | for a given action, based on observed performance characteristics.
 37 | However, such a solution would provide a poor user experience.
 38 | 
 39 | # Introducing the File System Access Cache (FSAC)
 40 | 
 41 | What if there was a way for workers to load data up front, but somehow
 42 | restrict this process to data that the worker **thinks** is going to be
 43 | used? This would be optimal, both in terms of throughput and bandwidth
 44 | usage. Assuming the number of inaccuracies is low, this shouldn't cause
 45 | any harm:
 46 | 
 47 | - **False negatives:** In case a file or directory is not loaded up
 48 |   front, but does get accessed by the action, it can be loaded on demand
 49 |   using the logic that is already provided by the virtual file system
 50 |   and storage layer.
 51 | 
 52 | - **False positives:** In case a file or directory is loaded up front,
 53 |   but does not get used, it merely leads to some wasted network traffic
 54 |   and worker-level cache usage.
 55 | 
 56 | Even though it is possible to use heuristics for certain kinds of
 57 | actions to derive which paths they will access as part of their
 58 | execution, this cannot be done in the general case. Because of that, we
 59 | will depend on file system access profiles that we obtained from
 60 | historical executions of similar actions. These profiles will be stored
 61 | in a new data store, named the File System Access Cache (FSAC).
 62 | 
 63 | When we added the Initial Size Class Cache (ISCC) for the automatic
 64 | worker size selection, we observed that when combined,
 65 | [the Command digest](https://github.com/bazelbuild/remote-apis/blob/f6089187c6580bc27cee25b01ff994a74ae8658d/build/bazel/remote/execution/v2/remote_execution.proto#L458-L461)
 66 | and [platform properties](https://github.com/bazelbuild/remote-apis/blob/f6089187c6580bc27cee25b01ff994a74ae8658d/build/bazel/remote/execution/v2/remote_execution.proto#L514-L522)
 67 | are very suitable keys for grouping similar actions together. The FSAC
 68 | will thus use the same 'reduced Action digests' as its keys as the ISCC.
 69 | 
 70 | As the set of paths accessed can for some actions be large, we're going
 71 | to apply two techniques to ensure the profiles stored in the FSAC remain
 72 | small:
 73 | 
 74 | 1. Instead of storing all paths in literal string form, we're going to
 75 |    store them as a Bloom filter. Bloom filters are probabilistic data
 76 |    structures, which have a bounded and configurable false positive
 77 |    rate. Using Bloom filters, we only need 14 bits of space per path to
 78 |    achieve a 0.1% false positive rate.
 79 | 1. For actions that acess a very large number of paths, we're going to
 80 |    limit the maximum size of the Bloom filter. Even though this will
 81 |    increase the false positive rate, this is likely acceptable. It is
 82 |    uncommon to have actions that access many different paths, **and**
 83 |    leave a large portion of the input root unaccessed.
 84 | 
 85 | # Tying the FSAC into bb\_worker
 86 | 
 87 | To let bb\_worker make use of the FSAC, we're going to add a
 88 | PrefetchingBuildExecutor that does two things in parallel:
 89 | 
 90 | 1. Immediately call into an underlying BuildExecutor to launch the
 91 |    action as usual.
 92 | 1. Load the file system access profile from the FSAC, and fetch files in
 93 |    the input root that are matched by the Bloom filter contained in the
 94 |    profile.
 95 | 
 96 | The worker and the action will thus 'race' with each other to see who
 97 | can fetch the input files first. If the action wins, we terminate the
 98 | prefetching.
 99 | 
100 | Furthermore, we will extend the virtual file system layer to include the
101 | hooks that are necessary to measure which parts of the input root are
102 | being read. We can wrap the existing InitialContentsFetcher and Leaf
103 | types to detect reads against directories and files, respectively. As
104 | these types are strongly coupled to the virtual file system layer, we
105 | will add new {Unread,Read}DirectoryMonitor interfaces that
106 | PrefetchingBuildExecutor can implement to listen for file system
107 | activity in a loosely coupled fashion.
108 | 
109 | Once execution has completed and PrefetchingBuildExecutor has captured
110 | all of the paths that are being accessed, it can write a new file system
111 | access profile into the FSAC. This only needs to be done if the profile
112 | differs from the one that was fetched from the FSAC at the start of the
113 | build.
114 | 
115 | # Tying the FSAC into bb\_browser
116 | 
117 | The file system access profiles in the FSAC may also be of use for
118 | people designing their own build rules. Namely, the profiles may be used
119 | to determine whether the input root contains data that can be omitted.
120 | Because of that, we will extend bb\_browser to overlay this information
121 | on top of the input root that's displayed on the action's page.
122 | 
123 | # Future work
124 | 
125 | Another interesting use case for the profiles stored in the FSAC is that
126 | they could be used to perform weaker lookups against the Action Cache.
127 | This would allow us to skip execution of actions in cases where we know
128 | that changes have only been made against files that are not going to be
129 | accessed anyway.
130 | 


--------------------------------------------------------------------------------
/0011-rendezvous-hashing.md:
--------------------------------------------------------------------------------
  1 | # Buildbarn Architecture Decision Record #11: Rendezvous Hashing
  2 | 
  3 | Author: Benjamin Ingberg<br/>
  4 | Date: 2025-04-08
  5 | 
  6 | # Context
  7 | 
  8 | Resharding a Buildbarn cluster, that is changing the number, order or weight of
  9 | shards is a very disruptive process today. It essentially shuffles around all
 10 | the blobs to new shards forcing you to either drop all old state. Or, have a
 11 | period with a read fallback configuration where you can fetch the old state.
 12 | 
 13 | Buildbarn supports using drained backends for sharding and even adding an unused
 14 | placeholder backend at the end of it's sharding array to mitigate situation. It
 15 | is however a rare operation to deal with which means that, without a significant
 16 | amount of automation, there is a significant risk it's performed incorrectly.
 17 | 
 18 | This ADR describes how we can make that more accessible by changing the
 19 | underlying algorithm to one that is stable between resharding.
 20 | 
 21 | # Issues During Resharding
 22 | 
 23 | You might not have the ability to spin up a secondary set of storage shards to
 24 | perform a read fallback configuration over. This is a common situation in an
 25 | on-prem environment, where running two copies of your production environment may
 26 | not be feasible.
 27 | 
 28 | This is not necessarily a  blocker. You can reuse shards from the old topology
 29 | in your new topology. However, this has a risk of significantly reducing your
 30 | retention time since data must be stored according to the addressing schema of
 31 | both the new and the old topology simultaneously.
 32 | 
 33 | While it is possible to reduce the amount of address space that is resharded
 34 | with drained backends this requires advance planning.
 35 | 
 36 | # Improvements
 37 | 
 38 | In this ADR we attempt to improve the resharding experience by minimizing the
 39 | difference between different sharding topologies.
 40 | 
 41 | ## Better Overlap Between Sharding Topologies
 42 | 
 43 | Currently, two different sharding topologies, even if they share nodes, will
 44 | have a small overlap between addressing schemas. This can be significantly
 45 | improved by using a different sharding algorithm.
 46 | 
 47 | For this purpose we replace the implementation of
 48 | `ShardingBlobAccessConfiguration` with one that uses [Rendezvous
 49 | hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing
 50 | is a lightweight and stateless technique for distributed hash tables. It has a
 51 | low overhead with minimal disruption during resharding.
 52 | 
 53 | Sharding with Rendezvous hashing gives us the following properties:
 54 |  * Removing a shard is _guaranteed_ to only require resharding for the blobs
 55 |    that resolved to the removed shard.
 56 |  * Adding a shard will reshard any blob to the new shard with a probability of
 57 |    `weight/total_weight`.
 58 | 
 59 | This effectively means adding or removing a shard triggers a predictable,
 60 | minimal amount of resharding, eliminating the need for drained backends.
 61 | 
 62 | ```proto
 63 | message ShardingBlobAccessConfiguration {
 64 |   message Shard {
 65 |     // unchanged
 66 |     BlobAccessConfiguration backend = 1;
 67 |     // unchanged
 68 |     uint32 weight = 2;
 69 |   }
 70 |   // NEW:
 71 |   // Shards identified by a key within the context of this sharding
 72 |   // configuration. The key is a freeform string which describes the identity
 73 |   // of the shard in the context of the current sharding configuration.
 74 |   // Shards are chosen via Rendezvous hashing based on the digest, weight and
 75 |   // key of the configuration.
 76 |   //
 77 |   // When removing a shard from the map it is guaranteed that only blobs
 78 |   // which resolved to the removed shard will get a different shard. When
 79 |   // adding shards there is a weight/total_weight probability that any given
 80 |   // blob will be resolved to the new shards.
 81 |   map<string, Shard> shards = 2;
 82 |   // NEW:
 83 |   message Legacy {
 84 |     // Order of the shards for the legacy schema. Each key here refers to
 85 |     // a corresponding key in the 'shard_map' or null for drained backends.
 86 |     repeated string shard_order = 1;
 87 |     // Hash initialization seed used for legacy schema.
 88 |     uint64 hash_initialization = 2;
 89 |   }
 90 |   // NEW:
 91 |   // A temporary legacy mode which allows clients to use storage backends which
 92 |   // are sharded with the old sharding topology implementation. Consumers are
 93 |   // expected to migrate in a timely fashion and support for the legacy schema
 94 |   // will be removed by 2025-12-31.
 95 |   Legacy legacy = 3;
 96 | }
 97 | ```
 98 | 
 99 | ## Migration Instructions
100 | 
101 | If you are fine with dropping your data and repopulating your cache you can
102 | start using the new sharding algorithm right away. For clarity we have an
103 | example on how to perform a non-destructive migration.
104 | 
105 | ### Configuration before Migration
106 | 
107 | Given the following configuration showing an unmirrored sharding topology of two
108 | shards:
109 | 
110 | ```jsonnet
111 | {
112 |   ...
113 |   contentAddressableStorage: {
114 |     sharding: {
115 |       hashInitialization: 11946695773637837490,
116 |       shards: [
117 |         {
118 |           backend: {
119 |             grpc: {
120 |               address: 'storage-0:8980',
121 |             },
122 |           },
123 |           weight: 1,
124 |         },
125 |         {
126 |           backend: {
127 |             grpc: {
128 |               address: 'storage-1:8980',
129 |             },
130 |           },
131 |           weight: 1,
132 |         },
133 |       ],
134 |     },
135 |   },
136 |   ...
137 | }
138 | ```
139 | 
140 | ### Equivalent Configuration in New Format
141 | 
142 | We first want to non-destructively represent the exact same topology with the
143 | new configuration. The legacy parameter used here will make Buildbarn internally
144 | use the old sharding algorithm, the key can be any arbitrary string value.
145 | 
146 | ```jsonnet
147 | {
148 |   ...
149 |   contentAddressableStorage: {
150 |     sharding: {
151 |       shards: {
152 |         "0": {
153 |           backend: { grpc: { address: 'storage-0:8980' } },
154 |           weight: 1,
155 |         },
156 |         "1": {
157 |           backend: { grpc: { address: 'storage-1:8980' } },
158 |           weight: 1,
159 |         },
160 |       },
161 |       legacy: {
162 |         shardOrder: ["0", "1"],
163 |         hashInitialization: 11946695773637837490,
164 |       },
165 |     },
166 |   },
167 |   ...
168 | }
169 | ```
170 | 
171 | ### Intermediate State with Fallback (Optional)
172 | 
173 | If you want to preserve cache content during migration you can add this
174 | intermediate step that uses the new adressing schema as it's primary
175 | configuration but falls back to the old adressing schema for missing blobs.
176 | 
177 | If you are willing to drop the cache this step can be skipped.
178 | 
179 | ```jsonnet
180 | {
181 |   ...
182 |   contentAddressableStorage: {
183 |     readFallback: {
184 |       primary: {
185 |         sharding: {
186 |           shards: {
187 |             "0": {
188 |               backend: { grpc: { address: 'storage-0:8980' } },
189 |               weight: 1,
190 |             },
191 |             "1": {
192 |               backend: { grpc: { address: 'storage-1:8980' } },
193 |               weight: 1,
194 |             },
195 |           },
196 |         },
197 |       },
198 |       secondary: {
199 |         sharding: {
200 |           shards: {
201 |             "0": {
202 |               backend: { grpc: { address: 'storage-0:8980' } },
203 |               weight: 1,
204 |             },
205 |             "1": {
206 |               backend: { grpc: { address: 'storage-1:8980' } },
207 |               weight: 1,
208 |             },
209 |           },
210 |           legacy: {
211 |             shardOrder: ["0", "1"],
212 |             hashInitialization: 11946695773637837490,
213 |           },
214 |         },
215 |       },
216 |       replicator: { ... }
217 |     },
218 |   },
219 |   ...
220 | }
221 | ```
222 | 
223 | ### Final Configuration
224 | 
225 | When the cluster has used the read fallback configuration for an acceptable
226 | amount of time you can drop the fallback and have only the new configuration.
227 | 
228 | ```jsonnet
229 | {
230 |   ...
231 |   contentAddressableStorage: {
232 |     sharding: {
233 |       shards: {
234 |         "0": {
235 |           backend: { grpc: { address: 'storage-0:8980' } },
236 |           weight: 1,
237 |         },
238 |         "1": {
239 |           backend: { grpc: { address: 'storage-1:8980' } },
240 |           weight: 1,
241 |         },
242 |       },
243 |     },
244 |   },
245 |   ...
246 | }
247 | ```
248 | 
249 | ## Implementation
250 | 
251 | Rendezvous hashing was chosen for it's simplicity of implementation,
252 | mathematical elegance and not needing to be parameterized in configuration.
253 | 
254 | > Of all shards pick the one with the largest $-\frac{weight}{log(h)}$ where `h` is a
255 | hash of the key/node pair mapped to (0, 1).
256 | 
257 | The implementation uses an optimized fixed point integer calculation. Checking a
258 | server takes approximately 5ns on a Ryzen 7 7840U. Running integration
259 | benchmarks on large `FindMissingBlobs` calls shows us that the performance is
260 | within variance until we reach the 10k+ shards mark.
261 | 
262 | ```
263 | goos: linux
264 | goarch: amd64
265 | cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics     
266 | BenchmarkSharding10-16             17619            677008 ns/op
267 | BenchmarkSharding100-16             6465           1636342 ns/op
268 | BenchmarkSharding1000-16            1488           8842389 ns/op
269 | BenchmarkSharding10000-16            783          18426834 ns/op
270 | BenchmarkLegacy10-16               16959            704753 ns/op
271 | BenchmarkLegacy100-16               6562           1574817 ns/op
272 | BenchmarkLegacy1000-16              1558           8045423 ns/op
273 | BenchmarkLegacy10000-16              958          12146523 ns/op
274 | ```
275 | 
276 | ## Other Algorithms
277 | 
278 | There are some minor advantages and disadvantages to the other algorithms, in
279 | particular with regards to how computationally heavy they are. However, as noted
280 | above, the performance characteristics are not relevant for clusters below 10k
281 | shards.
282 | 
283 | Other algorithms considered were: [Consistent
284 | hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
285 | [Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).
286 | 
287 | ### Consistent Hashing
288 | 
289 | Consistent Hashing can be reduced into a special case of Rendezvous with the
290 | advantage that it can precomputed into a sorted list. There we can binary search
291 | for the target in `log(W)` time. This is in contrast to to Rendezvous hashing
292 | which takes `N` time to search.
293 | 
294 | * `N` is the number of shards
295 | * `W` is the sum of weights in its reduced form (i.e. smallest weight is 1 and
296 |   all weights are integer)
297 | 
298 | There are some concrete disadvantages to Consistent Hashing. A major one is that
299 | it is only unbiased on average, any given sharding configuration with Consistent
300 | Hashing will have a bias towards one of the shards.
301 | 
302 | The performance advantages of Consistent Hashing are also unlikely to
303 | materialize in practice. Straight linear search with Rendezvous is often faster
304 | than binary searching for reasonable N. Rendezvous is also independent on the
305 | weight of the individual shards which improves the relative performance of
306 | Rendezvous for shards that are not evenly weighted.
307 | 
308 | ### Maglev
309 | 
310 | Maglev in turn uses an iterative method that precomputes an evenly distributed
311 | lookup table of any arbitrary size. After the precomputational step the target
312 | shard is a simple lookup of the key modulo the size of the table.
313 | 
314 | The disadvantage of Maglev is that it aims for minimal resharding rather than
315 | optimal resharding. When removing a shard from the set of all shard, the
316 | algorithm tolerates a small amount of unecessary resharding. This is the order
317 | of a few percentiles depending on the number of shards and size of the lookup
318 | table.
319 | 
320 | Precomputing the lookup table is also a fairly expensive operation which impacts
321 | startup time of components or requires managing the table state externally.
322 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Buildbarn Architecture Decision Records
 2 | 
 3 | This repository contains a set of
 4 | [Architecture Decision Records](https://github.com/joelparkerhenderson/architecture_decision_record)
 5 | that apply to Buildbarn. The idea is that every time we make a
 6 | large-scale change to Buildbarn, we add an article to this repository
 7 | that explains the reasoning behind it.
 8 | 
 9 | These articles don't act as user documentation, but at least explain the
10 | reasoning that went into designing some of Buildbarn's features. They
11 | are also not updated over time, meaning that each of the articles is
12 | written in the context of what Buildbarn looked like at the time.
13 | 


--------------------------------------------------------------------------------
/images/0005/before.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/before.png


--------------------------------------------------------------------------------
/images/0005/persistent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/persistent.png


--------------------------------------------------------------------------------
/images/0005/volatile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/buildbarn/bb-adrs/bf2066633e1712a3ef7c295d37cd52e65867391d/images/0005/volatile.png


--------------------------------------------------------------------------------