├── .gitignore ├── CODEOWNERS ├── LICENSE ├── NOTICE ├── README.md ├── adr ├── 0000-use-architectural-decision-records.md ├── README.md ├── index.md └── template.md ├── notes ├── 113340231-spike-benchmarks-cf-mysql.md ├── 114130447-cf-mysql-stress-tests.md ├── 115893407-diego-and-consul-failure-modes.md ├── 117935875-cf-mysql-with-consul.md ├── 123482771-spike-influxdb-graphana.md ├── 125596231-usability-of-context.md ├── debug-app-crash.md ├── diego_deployments_with_longer_term_period.md ├── distributed-locking-and-presence.md ├── grafana-sample.json ├── logging-guidance.md └── lrp-task-states-and-transitions.md └── proposals ├── README.md ├── bbs-migration-stories.prolific.md ├── bbs-migrations.md ├── better-buildpack-caching.md ├── bind-mounting-downloads-stories.prolific.md ├── bind-mounting-downloads.md ├── bosh-deployments.graffle ├── bosh-deployments.md ├── bosh-deployments.png ├── desired_lrp_update_API.md ├── desired_lrp_update_extension.md ├── diego_versioning.svg ├── docker_registry_caching.md ├── docker_registry_configuration.md ├── faster-missing-cell-recovery.md ├── go-modules-migrations.md ├── lifecycle-design.md ├── measuring_performance.md ├── per-application-crash-configuration-stories.csv ├── per-application-crash-configuration.md ├── placement-constraints-stories.csv ├── placement_pools.md ├── private-docker-registry.md ├── relational-bbs-db-stories-2016-02-16.prolific.md ├── relational-bbs-db-stories-2016-04-18.prolific.md ├── relational-bbs-db-stories-2016-05-10.prolific.md ├── relational-bbs-db.md ├── release-versioning-testing-stories.prolific.md ├── release-versioning-testing-v2.md ├── release-versioning-testing.md ├── rolling-out-diego.md ├── routing.md ├── schema-migrations-for-locket.md ├── secure-auctioneer-cell-apis.md ├── ssh-one-time-auth-code.md ├── tuning-health-checks-stories.prolific.md ├── tuning-health-checks.md ├── updatable-lrp-api.md └── versioning.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @cloudfoundry/wg-app-runtime-platform-diego-approvers 2 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015-Present CloudFoundry.org Foundation, Inc. All Rights Reserved. 2 | 3 | This project contains software that is Copyright (c) 2014-2015 Pivotal Software, Inc. 4 | 5 | Licensed under the Apache License, Version 2.0 (the "License"); 6 | you may not use this file except in compliance with the License. 7 | You may obtain a copy of the License at 8 | 9 | http://www.apache.org/licenses/LICENSE-2.0 10 | 11 | Unless required by applicable law or agreed to in writing, software 12 | distributed under the License is distributed on an "AS IS" BASIS, 13 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | See the License for the specific language governing permissions and 15 | limitations under the License. 16 | 17 | This project may include a number of subcomponents with separate 18 | copyright notices and license terms. Your use of these subcomponents 19 | is subject to the terms and conditions of each subcomponent's license, 20 | as noted in the LICENSE file. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Diego Notes 2 | 3 | This folder contains development notes on Diego project. 4 | 5 | - [adr](./adr) folder contains architectural decision records. See its 6 | [README.md](./adr/README.md) for more information. 7 | - [notes](./notes) folder contains an archived documentation related to 8 | specific subject/story. The purpose of these documents is to get an idea of 9 | the initial motivation and architectural decisions related to that topic at 10 | that point of time. These documents have not been updated since the work on a 11 | story has been finished and should not be treated as a source of truth of the 12 | current state of system. 13 | - [proposals](./proposals) folder contains documents that describe ideas and 14 | solutions that could solve a particular problem. 15 | 16 | If you would like to see a documentation on a specific topic please open an 17 | issue in this repo. 18 | -------------------------------------------------------------------------------- /adr/0000-use-architectural-decision-records.md: -------------------------------------------------------------------------------- 1 | # Use Markdown Architectural Decision Records 2 | 3 | ## Context and Problem Statement 4 | 5 | We want to record architectural decisions made in this project. 6 | Which format and structure should these records follow? 7 | 8 | ## Considered Options 9 | 10 | * [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records 11 | * [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR" 12 | * [Sustainable Architectural Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) - The Y-Statements 13 | * Other templates listed at 14 | * Formless - No conventions for file format and structure 15 | 16 | ## Decision Outcome 17 | 18 | Chosen option: "MADR 2.1.0", because 19 | 20 | * Implicit assumptions should be made explicit. 21 | Design documentation is important to enable people understanding the decisions later on. 22 | See also [A rational design process: How and why to fake it](https://doi.org/10.1109/TSE.1986.6312940). 23 | * The MADR format is lean and fits our development style. 24 | * The MADR structure is comprehensible and facilitates usage & maintenance. 25 | * The MADR project is vivid. 26 | * Version 2.1.0 is the latest one available when starting to document ADRs. 27 | -------------------------------------------------------------------------------- /adr/README.md: -------------------------------------------------------------------------------- 1 | # Diego Architectural Decision Records 2 | 3 | This folder contains [ADRs (Architectural Decision Records)](https://adr.github.io/) for Diego project. 4 | 5 | ## Create a new ADR 6 | 7 | 1. Copy `template.md` to `NNNN-title-with-dashes.md`, where `NNNN` indicates the next number in sequence. 8 | 1. Edit `NNNN-title-with-dashes.md`. 9 | 1. Update `index.md`, e.g., by executing `adr-log -i -d .` You can get [adr-log](https://github.com/adr/adr-log) by executing `npm install -g adr-log`. 10 | -------------------------------------------------------------------------------- /adr/index.md: -------------------------------------------------------------------------------- 1 | # Architectural Decision Log 2 | 3 | This lists the architectural decisions for Diego. 4 | 5 | 6 | 7 | - [ADR-0000](0000-use-architectural-decision-records.md) - Use Markdown Architectural Decision Records 8 | 9 | 10 | 11 | [template.md](template.md) contains the template. 12 | More information on MADR is available at . 13 | -------------------------------------------------------------------------------- /adr/template.md: -------------------------------------------------------------------------------- 1 | # [short title of solved problem and solution] 2 | 3 | * Status: [accepted | superseeded by [ADR-0005](0005-example.md) | deprecated | …] 4 | * Deciders: [list everyone involved in the decision] 5 | * Date: [YYYY-MM-DD when the decision was last updated] 6 | 7 | Technical Story: [description | ticket/issue URL] 8 | 9 | ## Context and Problem Statement 10 | 11 | [Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.] 12 | 13 | ## Decision Drivers 14 | 15 | * [driver 1, e.g., a force, facing concern, …] 16 | * [driver 2, e.g., a force, facing concern, …] 17 | * … 18 | 19 | ## Considered Options 20 | 21 | * [option 1] 22 | * [option 2] 23 | * [option 3] 24 | * … 25 | 26 | ## Decision Outcome 27 | 28 | Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)]. 29 | 30 | ### Positive Consequences 31 | 32 | * [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …] 33 | * … 34 | 35 | ### Negative consequences 36 | 37 | * [e.g., compromising quality attribute, follow-up decisions required, …] 38 | * … 39 | 40 | ## Pros and Cons of the Options 41 | 42 | ### [option 1] 43 | 44 | [example | description | pointer to more information | …] 45 | 46 | * Good, because [argument a] 47 | * Good, because [argument b] 48 | * Bad, because [argument c] 49 | * … 50 | 51 | ### [option 2] 52 | 53 | [example | description | pointer to more information | …] 54 | 55 | * Good, because [argument a] 56 | * Good, because [argument b] 57 | * Bad, because [argument c] 58 | * … 59 | 60 | ### [option 3] 61 | 62 | [example | description | pointer to more information | …] 63 | 64 | * Good, because [argument a] 65 | * Good, because [argument b] 66 | * Bad, because [argument c] 67 | * … 68 | 69 | ## Links 70 | 71 | * [Link type] [Link to ADR] 72 | * … 73 | -------------------------------------------------------------------------------- /notes/113340231-spike-benchmarks-cf-mysql.md: -------------------------------------------------------------------------------- 1 | Running everything from the same VM is hard. We had to implement semaphores to 2 | get around a problem with having too many connections to the BBS in parallel. 3 | We had satisfactory results with the CF-MySQL release even though they're 4 | slightly worse than RDS. We think that once the real BBS schema is fully 5 | flushed out we can optimize certain aspects of the complexity we have with 6 | convergence, and the other bulkers. 7 | 8 | One thing to note from the experiment is that there might be artificial 9 | slowdowns because of the single VM acting as multiple components. We decided 10 | that for now the complexity of devising a distributed benchmarks suite is too 11 | costly to be justifiable. 12 | 13 | ## Experiment #5 (full benchmark run) 14 | 15 | ### Config: 16 | 17 | - CF-MySQL 3 nodes talking to one node directly 18 | - `num_populate_workers`: `500` 19 | - `num_reps`: `1000` 20 | - `desired_lrps`: `200000` 21 | - Semaphore around all HTTP calls on the BBS client (25,000 resources) 22 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client 23 | 24 | ### Results: 25 | 26 | ``` 27 | Running Suite: Benchmark BBS Suite 28 | ================================== 29 | Random Seed: 1457549669 30 | Will run 4 of 4 specs 31 | 32 | • [MEASUREMENT] 33 | Fetching for Route Emitter 34 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/route_emitter_fetch_test.go:33 35 | data for route emitter 36 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/route_emitter_fetch_test.go:32 37 | 38 | Ran 10 samples: 39 | {FetchActualLRPsAndSchedulingInfos} 40 | fetch all actualLRPs: 41 | Fastest Time: 17.671s 42 | Slowest Time: 51.122s 43 | Average Time: 37.447s ± 15.995s 44 | ------------------------------ 45 | [====================================================================================================] 10000/10000 46 | Done with all loops, waiting for queue to clear out... 47 | [====================================================================================================] 2000000/2000000 48 | • [MEASUREMENT] 49 | Fetching for rep bulk loop 50 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114 51 | data for rep bulk 52 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113 53 | 54 | Ran 1 samples: 55 | {RepBulkFetching} 56 | rep bulk fetch: 57 | Fastest Time: 0.006s 58 | Slowest Time: 7.160s 59 | Average Time: 0.078s ± 0.269s 60 | {RepBulkLoop} 61 | rep bulk loop: 62 | Fastest Time: 0.008s 63 | Slowest Time: 7.163s 64 | Average Time: 0.089s ± 0.299s 65 | {RepStartActualLRP} 66 | start actual LRP: 67 | Fastest Time: 0.002s 68 | Slowest Time: 120.636s 69 | Average Time: 0.163s ± 2.066s 70 | {RepClaimActualLRP} 71 | claim actual LRP: 72 | Fastest Time: 0.007s 73 | Slowest Time: 120.522s 74 | Average Time: 0.434s ± 0.896s 75 | ------------------------------ 76 | • [MEASUREMENT] 77 | Fetching for nsync bulker 78 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/nsync_bulker_fetch_test.go:29 79 | DesiredLRPs 80 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/nsync_bulker_fetch_test.go:28 81 | 82 | Ran 10 samples: 83 | {NsyncBulkerFetching} 84 | fetch all desired LRP scheduling info: 85 | Fastest Time: 14.950s 86 | Slowest Time: 15.450s 87 | Average Time: 15.127s ± 0.140s 88 | ------------------------------ 89 | • [MEASUREMENT] 90 | Gathering 91 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/lrp_convergence_gather_test.go:33 92 | data for convergence 93 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/lrp_convergence_gather_test.go:32 94 | 95 | Ran 10 samples: 96 | {ConvergenceGathering} 97 | BBS' internal gathering of LRPs: 98 | Fastest Time: 21.318s 99 | Slowest Time: 28.017s 100 | Average Time: 26.993s ± 1.904s 101 | ------------------------------ 102 | 103 | Ran 4 of 4 Specs in 3514.024 seconds 104 | SUCCESS! -- 4 Passed | 0 Failed | 0 Pending | 0 Skipped PASS 105 | 106 | Ginkgo ran 1 suite in 58m35.914623967s 107 | Test Suite Passed 108 | ``` 109 | 110 | ### Conclusions: 111 | 112 | We can make the HA MySQL release works assuming they solve the proxy/load 113 | balancing issues. Convergence is a little slow still, but there's hope that 114 | with a different schema we can really improve how we do convergence in general, 115 | same holds true for nsync and route-emitter. 116 | 117 | ## Experiment #4 (focused on rep-bulk cycles) 118 | 119 | ### Config: 120 | 121 | - CF-MySQL 3 nodes talking to one node directly 122 | - `num_populate_workers`: `500` 123 | - `num_reps`: `1000` 124 | - `desired_lrps`: `200000` 125 | - Semaphore around all HTTP calls on the BBS client (25,000 resources) 126 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client 127 | 128 | ### Results: 129 | 130 | ``` 131 | • [MEASUREMENT] 132 | Fetching for rep bulk loop 133 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114 134 | data for rep bulk 135 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113 136 | 137 | Ran 1 samples: 138 | {RepBulkFetching} 139 | rep bulk fetch: 140 | Fastest Time: 0.005s 141 | Slowest Time: 3.409s 142 | Average Time: 0.060s ± 0.171s 143 | {RepBulkLoop} 144 | rep bulk loop: 145 | Fastest Time: 0.008s 146 | Slowest Time: 3.413s 147 | Average Time: 0.066s ± 0.175s 148 | {RepStartActualLRP} 149 | start actual LRP: 150 | Fastest Time: 0.002s 151 | Slowest Time: 121.628s 152 | Average Time: 0.116s ± 2.025s 153 | {RepClaimActualLRP} 154 | claim actual LRP: 155 | Fastest Time: 0.007s 156 | Slowest Time: 121.213s 157 | Average Time: 0.345s ± 0.979s 158 | ------------------------------ 159 | SS 160 | Ran 1 of 4 Specs in 1820.280 seconds 161 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED 162 | ``` 163 | 164 | ### Conclusions: 165 | 166 | "Slowest" on claim/start is due to the artificial limits we have with the 167 | semaphore on the BBS client. Could be mitigated by having a distributed 168 | benchmark suite. Going forward we should probably move that semaphore up to 169 | the benchmark code, since it doesn't make too much sense for it to exist on the 170 | BBS client (that many requests in parallel is not a realistic use case). 171 | 172 | ## Experiment #3 (focused on rep-bulk cycles) 173 | 174 | ### Config: 175 | 176 | - CF-MySQL 3 nodes talking to one node directly 177 | - `num_populate_workers`: `500` 178 | - `num_reps`: `250` 179 | - `desired_lrps`: `50000` 180 | - Semaphore around all HTTP calls on the BBS client (25,000 resources) 181 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client 182 | 183 | ### Results: 184 | 185 | ``` 186 | ------------------------------ 187 | • [MEASUREMENT] 188 | Fetching for rep bulk loop 189 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114 190 | data for rep bulk 191 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113 192 | 193 | Ran 1 samples: 194 | {RepBulkFetching} 195 | rep bulk fetch: 196 | Fastest Time: 0.005s 197 | Slowest Time: 1.093s 198 | Average Time: 0.011s ± 0.048s 199 | {RepBulkLoop} 200 | rep bulk loop: 201 | Fastest Time: 0.007s 202 | Slowest Time: 1.096s 203 | Average Time: 0.014s ± 0.049s 204 | {RepStartActualLRP} 205 | start actual LRP: 206 | Fastest Time: 0.002s 207 | Slowest Time: 120.516s 208 | Average Time: 0.157s ± 0.609s 209 | {RepClaimActualLRP} 210 | claim actual LRP: 211 | Fastest Time: 0.007s 212 | Slowest Time: 3.162s 213 | Average Time: 0.028s ± 0.087s 214 | ------------------------------ 215 | S 216 | Ran 1 of 4 Specs in 661.736 seconds 217 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED 218 | ``` 219 | 220 | Conclusions: 221 | 222 | Semaphores work, let's move on to 200,000 again. 223 | 224 | ## Experiment #2 (focused on rep-bulk cycles) 225 | 226 | ### Config: 227 | 228 | - CF-MySQL 3 nodes talking to one node directly 229 | - `num_populate_workers`: `100` 230 | - `num_reps`: `250` 231 | - `desired_lrps`: `50000` 232 | - Workpool of size 25,000 only around BBS client calls 233 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client 234 | 235 | ### Results: 236 | 237 | **None** 238 | 239 | ### Conclusions: 240 | 241 | We cancelled this experiment because our workpool implementation seems to have 242 | issues with large pool sizes. The tests were taking way too long to finish and 243 | we aborted them. Next experiment will replace the workpool with a semaphore. 244 | 245 | ## Experiment #1 (focused on rep-bulk cycles) 246 | 247 | ### Config: 248 | 249 | - CF-MySQL 3 nodes talking to one node directly 250 | - `num_populate_workers`: `100` 251 | - `num_reps`: `250` 252 | - `desired_lrps`: `50000` 253 | - No workpool 254 | 255 | ### Results: 256 | 257 | ``` 258 | • [MEASUREMENT] 259 | Fetching for rep bulk loop 260 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:84 261 | data for rep bulk 262 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:83 263 | 264 | Ran 1 samples: 265 | {RepBulkFetching} 266 | rep bulk fetch: 267 | Fastest Time: 0.005s 268 | Slowest Time: 1.610s 269 | Average Time: 0.016s ± 0.086s 270 | {RepBulkLoop} 271 | rep bulk loop: 272 | Fastest Time: 0.007s 273 | Slowest Time: 1.612s 274 | Average Time: 0.021s ± 0.088s 275 | {RepStartActualLRP} 276 | start actual LRP: 277 | Fastest Time: 0.003s 278 | Slowest Time: 4.093s 279 | Average Time: 0.054s ± 0.164s 280 | {RepClaimActualLRP} 281 | claim actual LRP: 282 | Fastest Time: 0.007s 283 | Slowest Time: 1.639s 284 | Average Time: 0.038s ± 0.128s 285 | ------------------------------ 286 | SSS 287 | Ran 1 of 4 Specs in 661.140 seconds 288 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED 289 | ``` 290 | 291 | ### Conclusions: 292 | 293 | Given the success running the experiment at this size, we thought 50,000/250 294 | would be a decent size to split the experiment on if we were to distribute it. 295 | We thought about running multiple instances of the benchmark-bbs errand in 296 | parallel, but that is not supported by BOSH. 297 | The alternative would be making multiple deployments with the same errand with 298 | different parameters and synchronizing the startup of those. Seems fragile. 299 | We could also try to devise an smarter orchestrated test suite that spins up 300 | tasks on a Diego deployment, but also seems hard. Trying with a workpool first. 301 | 302 | -------------------------------------------------------------------------------- /notes/114130447-cf-mysql-stress-tests.md: -------------------------------------------------------------------------------- 1 | We want to know how SQL behaves in golang when using the CF-MySQL HA release. 2 | We're also interested in knowing what behavior the current proxy (switchboard) 3 | presents in face of a cluster outage. 4 | 5 | We devised scenarios to observe those behaviors. Those are documented 6 | [here](https://github.com/luan/cf-mysql-proxy-stress-test). 7 | 8 | ## Grand Conclusion 9 | 10 | The go sql library does the right thing: it fails existing connections when the 11 | pipe is broken, and recreates them next time you try to use it. Switchboard 12 | seems to react pretty quickly to a server going down, allowing the go sql 13 | library to do its job. No major conclusions as far as next steps, which is 14 | good, looks like we don't have to do much or don't have a lot to complain about 15 | to MySQL just now. 16 | 17 | ## Scenario 1 18 | 19 | success on all cases 20 | 21 | ## Scenario 2 22 | stop non leader: success 23 | 24 | stop leader: flakes when you restart because the connection gets dropped, moves 25 | on to succeeding since switchboard starts routing them to the new leader. 26 | 27 | ## Scenario 3 28 | stop non leader: success 29 | 30 | stop leader: 2 records lost 31 | 32 | ``` 33 | Kill the leader now and press ENTER2016/03/15 22:32:39 failed to commit Transaction: Error 1047: WSREP has not yet prepared node for application use 34 | [mysql] 2016/03/15 22:32:39 packets.go:33: unexpected EOF 35 | [mysql] 2016/03/15 22:32:39 packets.go:33: unexpected EOF 36 | [mysql] 2016/03/15 22:32:39 packets.go:124: write tcp 10.10.22.11:37713->10.10.22.11:3306: write: broken pipe 37 | [mysql] 2016/03/15 22:32:41 packets.go:33: unexpected EOF 38 | 2016/03/15 22:32:41 failed to begin transaction: driver: bad connection 39 | [mysql] 2016/03/15 22:32:41 packets.go:33: unexpected EOF 40 | 41 | 2016/03/15 22:32:50 Expected 42 | : 2799 43 | to equal 44 | : 2801 45 | ``` 46 | 47 | ## Scenario 4 48 | 49 | stop non leader: success 50 | restart non leader: success 51 | 52 | stop leader: flakes when you restart because the connection gets dropped, moves 53 | on to succeeding since switchboard starts routing them to the new leader. 54 | 55 | ## Scenario 5 56 | 57 | stop non leader: lost 8 records (with 10 maxConnections) lost 98 records (with 100 maxConnections) 58 | 59 | stop leader: lost ~1000 records 60 | restart leader: Causes the VM to be stuck in monit hell. We cannot stop/start even after reloading the monit deamon. Possible bug in ctl script? 61 | 62 | Often it drops n-1 connections and thus has that many fewer records. ex 10, 20 100 63 | At 500, only 495 failed. it's great! When the db dies, the the error 64 | 65 | ``` 66 | 2016/03/15 18:26:12 failed write: Error 1213: Deadlock found when trying to get lock; try restarting transaction 67 | ``` 68 | 69 | is given. It doesn't matter which db is killed, the result is the same. Note 70 | that this is consistent with the results from scenario 3, because =1, thus 71 | n-1=0=number of errors that occured. 72 | 73 | At 1000 connections, there is the following error: 74 | 75 | ``` 76 | [mysql] 2016/03/15 18:28:36 statement.go:27: invalid connection 77 | [mysql] 2016/03/15 18:28:36 packets.go:33: unexpected EOF 78 | ``` 79 | 80 | ## Scenario 6 81 | 82 | same as above 83 | 84 | -------------------------------------------------------------------------------- /notes/115893407-diego-and-consul-failure-modes.md: -------------------------------------------------------------------------------- 1 | # Diego and Consul Failure Modes 2 | 3 | ## Presence (Cell) 4 | 5 | ### Experiment 1 6 | In this experiment, we performed a rolling deploy of a 3 node 7 | consul cluster. We took the consul nodes down one-by-one and noticed that once 8 | we are in the middle of doing leader election, reps stop functioning. 9 | 10 | We also noted that leader election may occasionally take longer than usual in 11 | which case the system took even longer to get back to normal.A 12 | 13 | ## Stories to Create 14 | 15 | 1. If rep cannot set Presence do not indicate it is ready [#118500389](https://www.pivotaltracker.com/story/show/118500389) 16 | 1. It is possible to set duplicate presence similar to BBS locket error seen in 17 | [#117967437](https://www.pivotaltracker.com/story/show/117967437), [#118957221](https://www.pivotaltracker.com/story/show/118957221) 18 | 19 | 20 | ## Locks (route_emitter, bbs, auctioneer, tps-watcher, nsync-bulker, converger) 21 | 22 | ### Experiment 1 23 | In this experiment, we performed a rolling deploy of a 3 node consul cluster. 24 | We took the consul nodes down one-by-one and noticed that once we are in the 25 | middle of doing leader election, the component currently holding the lock lost 26 | the lock. The components then attempt to reacquire the lock and during the 27 | leader election they are not able to but will retry correctly and eventually 28 | once consul has a new leader the lock can be acquired by one of the instances. 29 | 30 | Impact is that the leaders may change and components will be down until the 31 | leader election completes. 32 | 33 | ## Stories to Create 34 | 35 | 1. Losing the lock during migrations of the BBS does not cause the migration to 36 | exit immediately [#118957301](https://www.pivotaltracker.com/story/show/118957301) 37 | 38 | ## Service registration 39 | 40 | ### Experiment 1 41 | In this experiment, we performed a rolling deploy of a 3 node 42 | consul cluster. We took the consul nodes down one-by-one and noticed that once 43 | we are in the middle of doing leader election the service registrations seem to 44 | persist but are not available during the leader election. 45 | 46 | Once a new leader is elected we can see the service registrations again. 47 | 48 | ## Stories to Create 49 | 50 | 1. Service registration should reregister if consul drops the service 51 | registration [#118957335](https://www.pivotaltracker.com/story/show/118957335) 52 | -------------------------------------------------------------------------------- /notes/117935875-cf-mysql-with-consul.md: -------------------------------------------------------------------------------- 1 | We want to know how the CF-MySQL release can take advantage of consul for 2 | service discovery. 3 | 4 | The release ships with a proxy job that can be horizontally scaled for high 5 | availability, their responsibility is to chose one "master" node from the 6 | Galera cluster and only connect clients to that one until it's no longer 7 | available. 8 | 9 | There is no way to make the proxy HA, though. The CF-MySQL team suggests using 10 | a Load Balancer like Amazon ELB or HAProxy but cautions against it since load 11 | balancing across multiple proxies might cause clients to connect to different 12 | "master" MySQL nodes. That is a problem because it can cause "deadlock" errors 13 | when writing consistently to multiple masters on Galera. 14 | 15 | We first considered removing the proxies entirely and relying on consul 16 | directly on the MySQL nodes. However, if a MySQL node becomes unavailable, 17 | consul will only prevent new connections to the dead node. There's no way for 18 | the consul agent to close existing client connections like the proxies do. 19 | Another problem is when shutting down the master MySQL node the MySQL server 20 | can get into an irrecoverable state that causes established connections from 21 | the clients to hang forever. This seems to be a known issue with tcp 22 | connections that never really got addressed, see 23 | [this](http://www.evanjones.ca/tcp-stuck-connection-mystery.html) for more 24 | details. 25 | 26 | So we experimented with putting consul agents on the proxies instead of the 27 | consul nodes so that only one proxy is being used at a time. It seems that a 28 | single proxy node will consistently connect to one MySQL node (as long as that 29 | node is up), so we shouldn't run into deadlocks. 30 | 31 | The first few experiments were not properly tracked so we're starting from the 32 | middle here. 33 | 34 | ## Experiment 1 35 | 36 | ### Config 37 | 38 | * 2 MySQL Nodes 39 | * 2 Proxies (1 with consul agent running) 40 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker) 41 | * Patched BBS with a retry logic on Deadlock errors 42 | 43 | We stopped the master MySQL during seeding roughly after 20k insertions and 44 | restarted it after 30k. 45 | 46 | The results seem to indicate that this is a viable solution. There were some 47 | deadlock errors when bringing the MySQL node back online but the retry logic 48 | spiked on the BBS mitigated that. 49 | 50 | ### Results 51 | 52 | ```ruby 53 | • [MEASUREMENT] 54 | main benchmark test 55 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:210 56 | data for benchamrks 57 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:209 58 | 59 | Ran 1 samples: 60 | map[MetricName:RepBulkFetching] 61 | rep bulk fetch: 62 | Fastest Time: 0.008s 63 | Slowest Time: 1.831s 64 | Average Time: 0.087s ± 0.159s 65 | map[MetricName:RepBulkLoop] 66 | rep bulk loop: 67 | Fastest Time: 0.014s 68 | Slowest Time: 1.837s 69 | Average Time: 0.099s ± 0.161s 70 | map[MetricName:RepStartActualLRP] 71 | start actual LRP: 72 | Fastest Time: 0.010s 73 | Slowest Time: 30.599s 74 | Average Time: 0.292s ± 1.438s 75 | map[MetricName:RepClaimActualLRP] 76 | claim actual LRP: 77 | Fastest Time: 0.018s 78 | Slowest Time: 1.738s 79 | Average Time: 0.118s ± 0.133s 80 | map[MetricName:FetchActualLRPsAndSchedulingInfos] 81 | fetch all actualLRPs: 82 | Fastest Time: 5.633s 83 | Slowest Time: 5.633s 84 | Average Time: 5.633s ± 0.000s 85 | map[MetricName:ConvergenceGathering] 86 | BBS internal gathering of LRPs: 87 | Fastest Time: 6.244s 88 | Slowest Time: 6.244s 89 | Average Time: 6.244s ± 0.000s 90 | map[MetricName:NsyncBulkerFetching] 91 | fetch all desired LRP scheduling info: 92 | Fastest Time: 2.002s 93 | Slowest Time: 2.002s 94 | Average Time: 2.002s ± 0.000s 95 | ------------------------------ 96 | 97 | Ran 1 of 1 Specs in 627.672 seconds 98 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped 99 | 100 | Ginkgo ran 1 suite in 10m31.388780701s 101 | ``` 102 | 103 | ## Experiment 2 104 | 105 | ### Config 106 | 107 | * 2 MySQL Nodes 108 | * 2 Proxies (using the consul lock functionality) 109 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker) 110 | * Patched BBS with a retry logic on Deadlock errors 111 | 112 | We stopped the master Proxy during seeding roughly after 20k insertions and 113 | restarted it after 30k. 114 | 115 | The results seem to indicate that this is a viable solution. There were some 116 | deadlock errors when bringing the MySQL node back online but the retry logic 117 | spiked on the BBS mitigated that. 118 | 119 | ### Results 120 | 121 | ```ruby 122 | • [MEASUREMENT] 123 | main benchmark test 124 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:210 125 | data for benchamrks 126 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:209 127 | 128 | Ran 1 samples: 129 | map[MetricName:RepBulkFetching] 130 | rep bulk fetch: 131 | Fastest Time: 0.007s 132 | Slowest Time: 1.772s 133 | Average Time: 0.084s ± 0.160s 134 | map[MetricName:RepStartActualLRP] 135 | start actual LRP: 136 | Fastest Time: 0.009s 137 | Slowest Time: 4.640s 138 | Average Time: 0.168s ± 0.247s 139 | map[MetricName:RepBulkLoop] 140 | rep bulk loop: 141 | Fastest Time: 0.014s 142 | Slowest Time: 1.779s 143 | Average Time: 0.097s ± 0.163s 144 | map[MetricName:RepClaimActualLRP] 145 | claim actual LRP: 146 | Fastest Time: 0.017s 147 | Slowest Time: 0.456s 148 | Average Time: 0.100s ± 0.070s 149 | map[MetricName:NsyncBulkerFetching] 150 | fetch all desired LRP scheduling info: 151 | Fastest Time: 2.940s 152 | Slowest Time: 2.940s 153 | Average Time: 2.940s ± 0.000s 154 | map[MetricName:ConvergenceGathering] 155 | BBS' internal gathering of LRPs: 156 | Fastest Time: 5.598s 157 | Slowest Time: 5.598s 158 | Average Time: 5.598s ± 0.000s 159 | map[MetricName:FetchActualLRPsAndSchedulingInfos] 160 | fetch all actualLRPs: 161 | Fastest Time: 2.627s 162 | Slowest Time: 2.627s 163 | Average Time: 2.627s ± 0.000s 164 | ------------------------------ 165 | 166 | Ran 1 of 1 Specs in 712.872 seconds 167 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped 168 | 169 | Ginkgo ran 1 suite in 11m56.614700533s 170 | Test Suite Passed 171 | ``` 172 | 173 | ## Experiment 3 174 | 175 | ### Config 176 | 177 | * 2 MySQL Nodes 178 | * 2 Proxies (using the consul lock functionality) 179 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker) 180 | * Patched BBS with a retry logic on Deadlock errors 181 | 182 | We stopped the master MySQL node during seeding roughly after 20k insertions and 183 | restarted it after 30k. 184 | 185 | The Proxy successfully transferred all of the active connections to the available MySQL 186 | node immediately after bringing the master node down. 187 | 188 | However, this time we watched the status of the MySQL node on restart, 189 | and noticed that the MySQL node was unable to start back up until there was no 190 | load on the system. Every time that the MySQL node failed to start, it would 191 | stop/start the internal mysqld server which would result in a decent number of 192 | deadlock errors. 193 | 194 | ### Results 195 | 196 | ```ruby 197 | Failure [592.772 seconds] 198 | [BeforeSuite] BeforeSuite 199 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 200 | 201 | Expected error: 202 | <*errors.errorString | 0xc825ad59c0>: { 203 | s: "Error rate of 0.060 for actuals exceeds tolerance of 0.050", 204 | } 205 | Error rate of 0.060 for actuals exceeds tolerance of 0.050 206 | not to have occurred 207 | 208 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:244 209 | ------------------------------ 210 | Failure [592.781 seconds] 211 | [BeforeSuite] BeforeSuite 212 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 213 | 214 | BeforeSuite on Node 1 failed 215 | 216 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 217 | ------------------------------ 218 | Failure [592.775 seconds] 219 | [BeforeSuite] BeforeSuite 220 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 221 | 222 | BeforeSuite on Node 1 failed 223 | 224 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 225 | ------------------------------ 226 | Failure [592.782 seconds] 227 | [BeforeSuite] BeforeSuite 228 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298 229 | ``` 230 | 231 | ## Experiment 4 232 | 233 | ### Config 234 | 235 | * 2 MySQL Nodes 236 | * 2 Proxies (using the consul lock functionality) 237 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker) 238 | * Patched BBS with a retry logic on Deadlock errors 239 | 240 | We did a rolling deploy during the seeding phase. 241 | 242 | ### Results 243 | 244 | The rolling deploy never finished. A mysql node is unable to restart during the seeding phase. 245 | 246 | ## Experiment 5 247 | 248 | ### Config 249 | 250 | * 2 MySQL Nodes 251 | * 2 Proxies (using the consul lock functionality) 252 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker) 253 | * Patched BBS with a retry logic on Deadlock errors 254 | 255 | We did a rolling deploy during the measurement phase with 50k lrps and 8% writes. 256 | 257 | ### Results 258 | 259 | It worked. 260 | -------------------------------------------------------------------------------- /notes/123482771-spike-influxdb-graphana.md: -------------------------------------------------------------------------------- 1 | A version of the influxdb-firehose-nozzle that works with current loggregator 2 | can be found at: https://github.com/luan/influxdb-firehose-nozzle. That fork 3 | updates loggregator libraries and code with most recent datadog-firehose-nozzle 4 | 5 | The nozzle could be made into a bosh release or deployed as a CF app, problem 6 | with CF app is that it can interfere with performance tests. 7 | 8 | Grafana (http://grafana.org) can be either deployed with bosh 9 | (https://github.com/vito/grafana-boshrelease) or ran locally and pointed to a 10 | remote InfluxDB. 11 | 12 | InfluxDB can be deployed via bosh 13 | (https://github.com/vito/influxdb-boshrelease) its configuration is really 14 | minimal and should straightforward. 15 | 16 | There is sample dashboard for Grafana in the same directory as this document as 17 | `grafana-sample.json`. 18 | -------------------------------------------------------------------------------- /notes/125596231-usability-of-context.md: -------------------------------------------------------------------------------- 1 | ## Story 2 | See [story](https://www.pivotaltracker.com/story/show/125596231) 3 | 4 | ## Background 5 | #### Advantage of `context.Context` 6 | 7 | 1. It is an accepted, standard way of doing things. 8 | 1. We force the user to provide a loggregator, when they may not want anything 9 | logged. With context, the user doesn't have to pass loggregator and can just not 10 | specify a value on the `context.Context` object. 11 | 1. Concurrency issues can be handled using `context.Context`. Currently, we use `ifrit` to handle concurrency. `context.Context` can be used instead. 12 | 13 | #### Disadvantage of `context.Context` 14 | 15 | 1. Implicit arguments. We can place request-scoped variables in the `context.Context` object and retrieve them with `ctx.Value("key")`. It might seem odd to have something passed implicitly 16 | instead of just saying that this function uses this argument. 17 | However, according to [this guy] 18 | (https://medium.com/@cep21/how-to-correctly-use-context-context-in-go-1-7-8f2c0fafdf39#.9o93w6x2m), 19 | using `context.Context` for logging is an acceptable paradigm, as the presence/lack 20 | of a logger should not affect our program flow. The obvious candidates for placing in `context.Context` objects are: 21 | * logger 22 | * metadata objects that are not necessary for program flow, such as the `checksum` variable on the `cacheddownloader` 23 | 1. Type-safety is a possible concern for passing varibles in `context.Context` object. 24 | 1. Need to update to Go 1.7 if we want to use `Context` from the standard library. 25 | Otherwise, we can just import from here `golang.org/x/net/context` pre-1.7. 26 | 27 | ## Benefits of Context in Diego Release 28 | 29 | ### 1. Cross Cutting Concerns 30 | Context is useful when we want to pass aspects that are not defining the behavior of the code. This includes cross-cutting 31 | concerns such as logging and security checks. Also meta-data related to the behavior of the an object can be passed in 32 | through context. The benefit is that, metadata is not directly affecting the behavior of the system and over time may grow 33 | or shrink. Passing metadata through context allows the code signature to stay the same and potentially for the complexity of the code to be reduced. 34 | 35 | #### Logging 36 | One of the immediate advantages of using context, is to allow for passing the `logger` object through to the handlers, without having the object be a parameter for every endpoint. 37 | 38 | There are two options for consideration: 39 | 40 | *Option 1:* 41 | 42 | This can be easily achieved by modifying the `middleware` in `bbs` where the following change is possible: 43 | 44 | ```go 45 | func LogWrap(logger lager.Logger, loggableHandlerFunc LoggableHandlerFunc) http.HandlerFunc { 46 | return func(w http.ResponseWriter, r *http.Request) { 47 | requestLog := logger.Session("request", lager.Data{ 48 | "method": r.Method, 49 | "request": r.URL.String(), 50 | }) 51 | 52 | // two new lines below 53 | ctx := context.WithValue(req.Context(), "logger", &requestLog) 54 | r = r.WithContext(ctx) 55 | 56 | requestLog.Debug("serving") 57 | loggableHandlerFunc(requestLog, w, r) 58 | requestLog.Debug("done") 59 | } 60 | } 61 | ``` 62 | 63 | And then in the handler, the logger can be used like the following: 64 | 65 | ```go 66 | // the logger is removed from the function signature below 67 | func (h *PingHandler) Ping(w http.ResponseWriter, req *http.Request) { 68 | // the logger is read from the context of the Request 69 | // passing it with pointer, allows for the session information to persist 70 | logger := req.Context().Value("logger").(*lager.Logger) 71 | response := &models.PingResponse{} 72 | response.Available = true 73 | writeResponse(w, response) 74 | } 75 | ``` 76 | 77 | Similar change can be done to `auctioneer`'s `handlers.go` as well. Alternatively the logger can be taken out of `LogWrap` and context to be passed in. 78 | 79 | *Option 2:* 80 | 81 | We can modify each handler to have a `context` object passed to it as a first argument. This will be in line with the pattern Google follows as [discussed here](https://news.ycombinator.com/item?id=8103128) 82 | 83 | Overall, the code change involves passing a Context object instead of a 84 | `loggregator` object around, and it doesn't seem like a difficult change, just a 85 | tedious one. 86 | 87 | In case of the above handler the change would be as follows: 88 | ```go 89 | // the logger is removed from the function signature below 90 | func (h *PingHandler) Ping(ctx context.Context, w http.ResponseWriter, req *http.Request) { 91 | // the logger is read from the context of the Request 92 | // passing it with pointer, allows for the session information to persist 93 | logger := ctx.Value("logger").(*lager.Logger) 94 | response := &models.PingResponse{} 95 | response.Available = true 96 | writeResponse(w, response) 97 | } 98 | ``` 99 | 100 | *Option 3:* 101 | 102 | Rather than including the `logger` object into the `context`, we can alternatively pass the session information and create a new logger object in each handler. On the negative side, this would result in having multiple `logger` objects created in different functions throughout the flow of code execution, which may not necessarily be desirable. On the positive side, passing session names around results in having immutable objects stored in the `context` which is potentially a better practice. 103 | 104 | **Suggestion**: We advocate for the second option, because it prevents unnecessary objects getting created, and also offers a cleaner implementation. Reconstructing the logging object seems to be unnecessary and of little value. 105 | 106 | This is however a fairly big change the signature of functions in `bbs` and `auctioneer`. The following repos will be affected" 107 | 108 | 1. `bbs` (all the handlers need to be changed to have `Context` and not loggregator 109 | and related test code. Our main function will create a `context.Background` and 110 | we can get rid of the middleware.`LogWrap` function.) 111 | 112 | 1. `auctioneer` (same as `bbs`) 113 | 114 | These need to be changed because they use the `bbsClient`. 115 | 116 | 1. `benchmarkbbs` 117 | 1. `cfdot` 118 | 1. `route-emitter` 119 | 1. `diego-ssh` 120 | 1. `inigo` 121 | 1. `rep` 122 | 1. `vizzini` 123 | 124 | ### 2. Concurrency 125 | 126 | Functionality offered by `Context` can also be used to simplify some of the logic when dealing with concurrency. We explored a few areas in the code where `Context` can be helpful with simplifying the code. 127 | 128 | _For an example, consider the scenario below:_ 129 | 130 | In the `executor`, the `StepRunner` and all the associated steps from the 131 | `Step` interface can be simplified by having the `Perform` function 132 | take a `context.Context` as a variable. This in turn, would allow us 133 | to get rid of the `Cancel` function, because signalling a cancel can be 134 | handled by using the context. The same context, can also be used to push 135 | the logger into each of the steps, hence simplifying the structs that we 136 | use to define each step. 137 | 138 | To prove this, we spiked on modifying the `timeoutStep` to take advantage of the 139 | new model, and the code changed as follows: 140 | 141 | The struct changed from: 142 | 143 | ```go 144 | type timeoutStep struct { 145 | substep Step 146 | timeout time.Duration 147 | cancelChan chan struct{} 148 | logger lager.Logger 149 | } 150 | ``` 151 | 152 | to the following: 153 | 154 | ```go 155 | type timeoutStep struct { 156 | substep Step 157 | timeout time.Duration 158 | } 159 | ``` 160 | 161 | Also the `Perform` function was modified from the following: 162 | ```go 163 | func (step *timeoutStep) Perform() error { 164 | resultChan := make(chan error, 1) 165 | timer := time.NewTimer(step.timeout) 166 | defer timer.Stop() 167 | 168 | go func() { 169 | resultChan <- step.substep.Perform() 170 | }() 171 | 172 | for { 173 | select { 174 | case err := <-resultChan: 175 | return err 176 | 177 | case <-timer.C: 178 | step.logger.Error("timed-out", nil) 179 | 180 | step.substep.Cancel() 181 | 182 | err := <-resultChan 183 | return NewEmittableError(err, emittableMessage(step.timeout, err)) 184 | } 185 | } 186 | } 187 | 188 | func (step *timeoutStep) Cancel() { 189 | step.substep.Cancel() 190 | } 191 | ``` 192 | 193 | to the simplified version below: 194 | 195 | ```go 196 | func (step *timeoutStep) Perform(ctxt context.Context) error { 197 | resultChan := make(chan error, 1) 198 | 199 | ctxt, _ := context.WithTimeout(ctxt, step.timeout) 200 | 201 | go func() { 202 | resultChan <- step.substep.Perform(ctxt) 203 | }() 204 | 205 | for { 206 | select { 207 | case err := <-resultChan: 208 | return err 209 | 210 | case <-ctxt.Done(): 211 | err := <-resultChan 212 | return NewEmittableError(err, emittableMessage(step.timeout, err)) 213 | } 214 | } 215 | } 216 | 217 | ``` 218 | 219 | We no longer have a logger in the struct as it is retrieved by the 220 | `context`, nor a `cancelChan` since a context provides us with a mechanism 221 | for canceling by using `context.WithCancel`. Instead, we would pass a context object to the `Perform` function of 222 | `timeoutStep`. We also don't have to create timers, since `context` 223 | has support for `Timeout`. 224 | 225 | More interestingly, `context` gets passed to the substeps, so since these `context` 226 | is shared, canceling the parent `context` will propagate through and to the children, 227 | which exempts us from having to write code below and make sure we call it on the child steps. 228 | 229 | ```go 230 | func (step *timeoutStep) Cancel() { 231 | step.substep.Cancel() 232 | } 233 | ``` 234 | 235 | The above is only one possibility for how `context` can be used in `diego release`. Our investigation only looked at a few places including the `StepRunner` and `CachedDownloader`. In case we decide that `context` is useful, we can dig deeper and look into other places where `context` can be helpful. 236 | 237 | ## Conclusion 238 | 239 | There are clear benefits in using `context` and achieving simplicity in the code. However the cost of refactoring could be rather expensive. For example, adding `logger` to the code is a fairly trivial change that can be easily achieved. On the other hand, using `context` to achieve concurrency allows for better code but at the cost of more significant change to the code base. 240 | -------------------------------------------------------------------------------- /notes/debug-app-crash.md: -------------------------------------------------------------------------------- 1 | # Understanding App Crashes in diego 2 | 3 | ## Overview 4 | When an App (ActualLRP) crashes, Rep will notify BBS of the event using 5 | `/v1/actual_lrps/crash` endpoint. In turn, BBS will generate an 6 | [ActualLRPCrashedEvent](https://github.com/cloudfoundry/bbs/blob/master/doc/events.md#actuallrpcrashedevent) that is then consumed by [tps](https://github.com/cloudfoundry/tps) to record and notify the clients. 7 | 8 | ## Logs 9 | - Start by looking for a `bbs-start-actual-lrp` log in Rep for the application 10 | in question. This will tell us when the instance was started. 11 | ```json 12 | {"timestamp":"some-time","level":"info","source":"rep","message":"rep.executing-container-operation.ordinary-lrp-processor.process-running-container.bbs-start-actual-lrp","data":....} 13 | ``` 14 | - The next thing to look for would be `run-step.process-exit` log line for the 15 | application. This will tell us when the process exited the container. 16 | ```json 17 | {"timestamp":"some-time","level":"info","source":"rep","message":"rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-run.node-run.action.run-step.process-exit","data":{"cancelled":true,...}} 18 | ``` 19 | - If `log_level: debug` is set for Rep, we would even see when the rep called 20 | the crash endpoint 21 | ```json 22 | {"timestamp":"some-time","level":"debug","source":"rep", 23 | "message":"rep.executing-container-operation.ordinary-lrp-processor.process-c 24 | ompleted-container.do-request.complete","data":{.... ,"request_path":"/v1/actual_lrps/crash","session":"61.1.1.1"}} 25 | ``` 26 | - Next, we would want to look at the BBS logs and look for `crash-actual-lrp`. This will let us know when the BBS received notification and when it generated the event. 27 | ```json 28 | {"timestamp":"some-time","level":"info","source":"bbs","message":"bbs.request.crash-actual-lrp.db-crash-actual-lrp.starting","data":{"crash_reason":"APP/PROC/WEB: Exited with status 2",....}} 29 | {"timestamp":"some-time","level":"info","source":"bbs","message":"bbs.request.crash-actual-lrp.db-crash-actual-lrp.complete","data":{"crash_reason":"APP/PROC/WEB: Exited with status 2",...}} 30 | ``` 31 | - `route_emitter` is one of those consumers. The following would be captured 32 | by route-emitter when crash has happened. 33 | ```json 34 | {"timestamp":"some-time","level":"error","source":"route-emitter","message":"route-emitter.watcher.handling-event.did-not-handle-unrecognizable-event","data":{"error":"unrecognizable-event","event-type":"actual_lrp_crashed","session":"8.123"}} 35 | {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.watcher.handling-event.received-actual-lrp-instance-removed-event","data":{}} 36 | {"timestamp":"some-time","level":"info","source":"route-emitter","message":"route-emitter.watcher.handling-event.removed","data":{}} 37 | {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.watcher.handling-event.emit-messages","data":{}} 38 | {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.nats-emitter.emit","data":{}} 39 | ``` 40 | 41 | - tps-watcher is the another component in line for retrieving the 42 | [ActualLRPCrashedEvent](https://github.com/cloudfoundry/bbs/blob/master/doc/events.md#actuallrpcrashedevent) and notifying CAPI. 43 | ```json 44 | {"timestamp":"some-time","level":"info","source":"tps-watcher","message":"tps-watcher.watcher.app-crashed","data":{}} 45 | {"timestamp":"some-time","level":"info","source":"tps-watcher","message":"tps-watcher.watcher.recording-app-crashed","data":{}} 46 | ``` 47 | -------------------------------------------------------------------------------- /notes/distributed-locking-and-presence.md: -------------------------------------------------------------------------------- 1 | # Distributed Locking/Presence with Consul 2 | 3 | Diego defers to Consul/etcd for the distributed systems magic. We have two basic needs that are continuing to give us trouble. These are both very different and we should pick the best tool/abstraction for the job. 4 | 5 | ## Distributed Locking 6 | 7 | We have a handful of processes that are actually meant to be singletons: `nsync-bulker`, `tps-watcher`, `converger`, `auctioneer`, and `route-emitter`. 8 | 9 | Some terms: 10 | 11 | - **worker** will refer to a process that is doing work. 12 | - **slumberer** will refer to a process that is not doing work. **slumberers** wait to become **workers** in the event of failure. 13 | - *immediately* means "ASAP" or "in under a second" 14 | - *eventually* means within some predefined threshold. Ideal targets here are 5-10 seconds. 15 | 16 | The desired behavior here is: 17 | 18 | 1. Given a set of processes, at most one may be the **worker** 19 | 2. If the **worker** shuts down cleanly (`SIGINT`, `SIGTERM`, `monit stop`, etc..) exactly one **slumberer** should become the next **worker** *immediately*. 20 | 3. If the **worker** dies hard (`SIGKILL`, VM explodes, etc...) exactly one **slumberer** should become the next **worker** *eventually*. 21 | 4. If a **worker** loses confidence that it is the *only* **worker**, the **worker** should shut down. 22 | 23 | #### Where we are today: 24 | 25 | 1 and 2 are working. 3 is taking ~30s despite TTLs of 10s having been explicitly specified. Onsi has not reproduced 4 yet. 26 | 27 | ## Presence 28 | 29 | Cells maintain their presence in a centralized store. This serves two purposes: 30 | 1. the converger uses this information to determine whether the Cell is still running. If it is not, then applications associated with that Cell are rescheduled. 31 | 2. the auctioneer uses this information to discover and communicate with the Cell. 32 | 33 | The desired behavior here is: 34 | 35 | 1. When a Cell dies hard (`SIGKILL`, VM explodes, etc...): 36 | a) the converger should be notified within 5-10 seconds. i.e. the converger should not wait until a convergence loop to act. If it is notified that a cell has disappeared it should act immediately. 37 | b) when the convergence loop runs the Cell should not be present 38 | c) when the auctioneer schedules work it should not attempt to talk to the (now missing) Cell 39 | 2. When a Cell is shut down cleanly (`SIGTERM`, `monit stop`, rolling deploy): 40 | a) the converger should not be notified immediately. There is no need to trigger a convergence loop. 41 | b) when the convergence loop runs the Cell should not be present 42 | c) when the auctioneer schedules work it should not attempt to talk to the (now missing) Cell 43 | 44 | #### Where we are today: 45 | 46 | 2b and 2c are working. 2a does not work as specified (a convergence loop is triggered when the cell presence is taken away) -- this is OK if not ideal. 47 | 48 | 1a, 1b, and 1c are not working at all. This is a substantial regression and we should add inigo coverage to make sure we have this important edge case covered. 49 | 50 | ## Current implementation of locks using Consul API 51 | 52 | Distributed Locking and Presence are using the same set of APIs. The Consul Lock API is used to acquire and release some piece of data. In the case of a lock, the data is meaningless. For presence, any useful presence information is stored. 53 | 54 | At the center of Consul is the concept of a Session. A session some important characteristics including a TTL, LockDelay, Checks, and a Behavior. 55 | If a session expires, exceeds its TTL, it is invalidated. 56 | The Behavior describes how data owned by the session is handled when invalidated. It can be released (default) or deleted. 57 | The LockDelay is used to mark certain locks as unacquirable. When a lock is forcefully released (failing health check, destroyed session, etc), it is subject to the LockDelay impossed by the session. This prevents another session from acquiring the lock for some period of time as a protection against split-brains. 58 | It is not clear how Checks interact with Sessions. They do refer to defined checks but I am not sure if they are on the node or a service, so we will ignore them for now since I do not believe we need to be concerned with them (yet). 59 | 60 | The current code creates locks to represent a lock or presence. The lock api will create a session if one is not provided. If a session is created, it is also automatically renewed on an interval of ttl/2. A key/value is then acquired. If already acquired, the lock will keep retrying until it becomes available. Any error will fail the Lock(). A conflict may be returned if you are locking a semaphore. The retry has a fixed delay of 5s. On acquisition, the lock is monitored. If for any reason, this fails, you are notified that the lock is lost. Note that if the connection to the agent times out (which it does), you will think you lost the lock prematurely. 61 | 62 | An Unlock() does the reverse. If a session is being automatically renewed, it is stopped. The key/value is released, not deleted. 63 | 64 | A Destroy() assumes you have Unlock()ed and remove the key/value if possible. 65 | 66 | ## Where do we need to go 67 | 68 | Its pretty obvious that we need to center around a Session. There should only be one session per process. The ideal solution would allow us to create or re-acquire the session for a process. Our behavior would be to delete rather than release. For presence, a lock delay of 0 makes perfect sense since there is no concern for split brain. We can argue whether we need some lock delay for locks but we can begin with 0. If we require a lock delay for locks, then we will need a separate session for locks. 69 | 70 | Locks and presence become nothing more than acquiring and releasing a key/value beyond this point. The focus is all about keeping the session current. 71 | 72 | A limitation or possible enhancement to discuss with Hashicorp would be to perform a release&delete which is currently two operations. This may not be needed if all we ever do is just whack the session which will auto-destruct all our data. 73 | 74 | It seems that if data exists, we should also be checking the Leader(Session) of the data to determine if its still owned. Given the data may still exist in the KV store, it may not however be acquired (see above, 2 step process). 75 | 76 | To prevent multiple WATCHes, the parent location should be created. If there is nothing to watch with a given prefix, Consul will return immediately with a 404. Our only recourse is to spin or put a slight delay, or inject the root node. 77 | 78 | ### Presence 79 | 80 | An alternative for presence can be done via Services. One can register/deregister a service instance that will be accumulated into a given Service. Health checks can be added to manage the lifetime of the service. The check can be script, HTTP or TTL. Script and HTTP checks are performed on an interval. TTL checks must be updated via the UpdateTTL with a status. 81 | 82 | We will need to decide how much we allow Consul to do versus ourselves. There is obviously a balance of coupling between the ability to change our implementation and relying on something else. 83 | -------------------------------------------------------------------------------- /notes/logging-guidance.md: -------------------------------------------------------------------------------- 1 | # Logging Guidance in Diego 2 | 3 | ## Formatting 4 | 5 | Logger session names and messages should be hyphenated. Keys in a message data payload should use snake-case (`key_name`) instead of hyphens (`key-name`) or camel-case (`keyName`). 6 | 7 | ```go 8 | logger = outer_logger.Session("handling-request") 9 | logger.Info("message-to-show", lager.Data{"key_to_show": data}) 10 | ``` 11 | 12 | ## Sinks 13 | 14 | Typically each Diego component has only one logging sink registered, which writes all log messages to stdout. 15 | 16 | 17 | ## Levels 18 | 19 | The `lager` Logger type supports the following logging levels: 20 | 21 | * Fatal 22 | * Error 23 | * Info 24 | * Debug 25 | 26 | 27 | ### Fatal 28 | 29 | Log a message at the fatal level to cause the logger to panic, writing a log line with the stack trace at the panic site. Reserve this level only for errors on which it is impossible for the program to continue operating normally. These should be used in place of `panic(err)` and `os.Exit(1)`. 30 | 31 | ### Error 32 | 33 | Log a message at the error level to indicate something important failed. 34 | 35 | 36 | ### Info 37 | 38 | Log a message at the info level to indicate some normal but significant event, such as beginning or ending an important section of logic. Info logs are also usually appropriate to mark the boundaries of various internal and external APIs. 39 | 40 | If we're calling a function in our own code base, we should be wary of wrapping functions that log in additional logging. For example, if we have a synchronous function that logs using the preferred started/finished pattern: 41 | ```go 42 | func foo(logger lager.Logger) error { 43 | logger := logger.Session("foo") 44 | logger.Info("started") 45 | defer logger.Info("finished") 46 | return nil 47 | } 48 | ``` 49 | We should be careful not to log around its success when we call out to it: 50 | ```go 51 | logger.Info("calling-foo") // this line is excessive logging 52 | err := foo(logger) 53 | if err != nil { 54 | logger.Error("failed-fooing", err) // this line is great! 55 | } 56 | logger.Info("finished-fooing") // this line is excessive logging 57 | ``` 58 | Instead, this should be: 59 | ```go 60 | err := foo(logger) 61 | if err != nil { 62 | logger.Error("failed-fooing", err) // this line is great! 63 | } 64 | ``` 65 | 66 | 67 | 68 | ### Debug 69 | 70 | Log a message at the Debug level to trace ordinary or frequent events, such as pings, metrics, heartbeats, subscriptions, polling, and notifications. If it happens on a timer in a component that is usually driven by a client, internal or otherwise, it should probably log at the Debug level. If you can't imagine caring, but some other developer did in the past, you might want to log at Debug level. 71 | -------------------------------------------------------------------------------- /proposals/bbs-migration-stories.prolific.md: -------------------------------------------------------------------------------- 1 | [CHORE] Remove legacyBBS from the Auctioneer 2 | 3 | The Auctioneer is still referencing `legacyBBS` (i.e. `runtime-schema`). It does this to fetch the set of Cells. This should be an API on the BBS (it can reach out to consul to get the relevant data). 4 | 5 | L: versioning, diego:ga 6 | 7 | --- 8 | 9 | Remove the Receptor 10 | 11 | Acceptance: 12 | - The access VM should not have a Receptor! 13 | 14 | L: versioning, diego:ga 15 | 16 | --- 17 | 18 | All access to the BBS should go through one, master-elected, BBS server 19 | 20 | We set this up to drive out the rest of the migration work. 21 | 22 | All slave BBS servers should redirect to the master. 23 | 24 | Acceptance: 25 | - I get a redirect when I talk to a slave. 26 | 27 | L: versioning, diego:ga 28 | 29 | --- 30 | 31 | The master-elected BBS server should never allow multiple simultaneous LRP/Task Convergence runs 32 | 33 | Currently, the endpoints that trigger convergence will always spin up a convergence goroutine. Instead, if a convergence goroutine is still running we should queue up at most *one* additional convergence request to run. (i.e. if during a given convergence run *multiple convergence requests come in* we should only queue up *one* additional convergence request). 34 | 35 | L: versioning, diego:ga 36 | 37 | --- 38 | 39 | BBS server should emit metrics 40 | 41 | - Convergence-related metrics (these were lost when we gutted the converger) 42 | - BBS requests (both # of requests so we can compute request rates, and request latencies) 43 | - All metrics emitted by the metrics server could, and should, just be emitted by the BBS during the convergence loop. 44 | 45 | L: versioning, diego:ga 46 | 47 | --- 48 | 49 | After a BOSH deploy, all data in the BBS should be stored in base64 encoded protobuf format 50 | 51 | This drives out (see https://github.com/onsi/migration-proposal#the-bbs-migration-mechanism): 52 | 53 | - introducing a database version 54 | - teaching the BBS server to run the migration on start 55 | - teaching the BBS server to bump the database version upon a succesful migration 56 | - removing the command line arguments that control the data encoding 57 | 58 | Acceptance: 59 | - After completing a BOSH deploy I see Task data in the BBS stored in base64 encoded protobuf format 60 | - I can (manually) fetch a database version key from etcd 61 | 62 | L: versioning, diego:ga, migrations 63 | 64 | --- 65 | 66 | [BUG] If the BBS is killed mid-migration, bringing the BBS back completes the migration 67 | 68 | This drives out managing the database version in the case of a failed migration. 69 | 70 | Acceptance: 71 | - while true; do 72 | - I fill etcd with many JSON-encoded tasks 73 | - I perform a BOSH deploy 74 | - I kill the BBS mid-migration 75 | - I bring the BBS back 76 | - I see protobuf-encoded tasks, only 77 | 78 | L: versioning, diego:ga, migrations 79 | 80 | --- 81 | 82 | During a migration, the BBS should return 503 for all requests 83 | 84 | - CC-Bridge components should forward the 503 error along where appropriate. 85 | - Route-Emitter should maintain the routing table and continue emitting it 86 | 87 | Acceptance: 88 | - I set up a long-running migration (perhaps I have a lot of data) 89 | - I see requests to the BBS return 503 during the migration 90 | 91 | L: versioning, diego:ga, migrations 92 | 93 | --- 94 | 95 | If a migration fails, I should be able to BOSH deploy the previously deployed release and recover. 96 | 97 | If we keep the `/vN` root node then this becomes trivial. We simply don't delete the `/vN-1` root node until after the migration. 98 | 99 | L: versioning, diego:ga, migrations 100 | 101 | --- 102 | 103 | If the Rep repeatedly fails to mark its ActualLRPs as EVACUATING it should fail to drain and the BOSH deploy should abort. 104 | 105 | This is a safety valve. It ensures we don't catastrophically lose all the applications if the BBS is unavailable. We should be somewhat generous and retry our evacuation requests repeatedly and perhaps require that all the requests fail or 503. The Cell should exit EVACUATING mode if this happens. 106 | 107 | L: versioning, diego:ga, migrations 108 | 109 | --- 110 | 111 | DesiredLRP data should be split across separate records 112 | 113 | This should bump the DB version. All LRP data should get migrated via a BOSH deploy. 114 | 115 | We do not modify the API, but - instead - do whatever work is necessary under the hood to maintain the existing API. 116 | 117 | Acceptance: 118 | - After a BOSH deploy I can see that all LRP data has been split in two. 119 | 120 | L: perf, versioning, diego:ga 121 | 122 | --- 123 | 124 | As a BBS client, I can efficiently get frequently accessed data for all DesiredLRPs in a domain 125 | 126 | Add a new API endpoint 127 | 128 | L: perf, versioning, diego:ga 129 | 130 | --- 131 | 132 | NSYNC's bulker should fetch the minimal set of DesiredLRP data 133 | 134 | L: perf, versioning, diego:ga 135 | 136 | --- 137 | 138 | Route-Emitter's bulk loop should fetch the minimal set of DesiredLRP data 139 | 140 | L: perf, versioning, diego:ga 141 | 142 | --- 143 | 144 | 145 | All Diego components should communicate securely via mutually-authenticated SSL 146 | 147 | L: versioning, diego:ga 148 | 149 | --- 150 | 151 | Cut Diego 0.9.0 152 | 153 | We draw a line in the sand. All subsequent work should ensure a clean upgrade path from 0.9.0. 154 | 155 | We call this 0.9.0 not to signify that all our GA goals have been met (performance is still somewhat lacking). However this gives us a reference point to be disciplined about maintaining backward compatibility during subsequent deploys. 156 | 157 | L: versioning, diego:ga 158 | 159 | --- 160 | 161 | [CHORE] Diego has an integration suite/environment that validates that upgrades from 0.9.0 to HEAD satisfy the minimal-downtime requirement 162 | 163 | This ensures that subsequent work migrates cleanly. 164 | 165 | L: versioning 166 | 167 | --- 168 | 169 | Set up a test suite that runs all migrations from 0.9.0 up for 100k records. 170 | 171 | Test suite should fail if migration time exceeds some threshold (2 minutes?) 172 | 173 | This can help us determine whether or not we want to implement background-migrations for things like rotating encryption keys. 174 | 175 | L: perf 176 | 177 | --- 178 | 179 | The converger should run convergence, not the BBS. 180 | 181 | To do this we'll want to: 182 | 183 | 1. Separate Task & Desired/Actual LRP convergence from the database-cleanup convergence. 184 | 2. Have the BBS perform database-cleanup periodically. 185 | 3. Have the Converger *fetch* data from the BBS, harmonize it, then update the BBS as relevant. 186 | 187 | L: versioning, cleanup 188 | 189 | --- 190 | 191 | I can rollback a migration to an earlier release 192 | 193 | We can either run a bosh errand that triggers down-migrations to the specified version. Or (better?) we can write a drain script on the BBS that triggers the down-migration -- is there any way to specify the target version this way? If we go that direction how do we prevent other BBSes from rerunning up migrations. 194 | 195 | L: versioning, needs-discussion 196 | -------------------------------------------------------------------------------- /proposals/better-buildpack-caching.md: -------------------------------------------------------------------------------- 1 | # Better Buildpack Caching 2 | 3 | Things are coming to a head: 4 | - staging on Diego is slow because we're copying buildpacks around. 5 | - the latest buildpacks are massive -- so much so that staging fails on Diego because we exceed the disk quota in the container! 6 | 7 | The buildpacks team is slimming down the buildpacks but this only solves a particular usecase. Customers with their own not-so-lithe buildpacks are going to run into similar issues. 8 | 9 | What can we do to improve Diego's staging performance? Here's a proposal. It isn't blocked on any new API from Garden though we can transparently transition to the Volumes API at some point in the future if it makes sense. 10 | 11 | ## Abstractions vs Optimizations 12 | 13 | We are faced with an optimization problem that we need to adapt our abstraction to support. To do this I propose we extend Diego's cached downloading mechanism to support the notion of a named cache that can be mounted and shared between containers at container creation time. 14 | 15 | ### The new API 16 | 17 | We modify Tasks and DesiredLRPs to include a set of shared download caches to mount: 18 | 19 | ``` 20 | Task/DesiredLRP = { 21 | ... 22 | SharedDownloadCaches: [ 23 | { 24 | SharedCacheName: "buildpacks", 25 | MountPoint: "/buildpacks", 26 | } 27 | ] 28 | ... 29 | } 30 | ``` 31 | 32 | Then we update the `DownloadAction` to reference a shared cache. There are many paths we can go down here but I'll put up a strawman. I propose we introduce a new `CachedDownloadAction` so that we have: 33 | 34 | ``` 35 | DownloadAction = { 36 | From: "http://foo", 37 | To: "/absolute/path/in/container" 38 | } 39 | ``` 40 | 41 | ``` 42 | CachedDownloadAction = { 43 | From: "http://foo", 44 | To: "/path/relative/to/mount/point", 45 | CacheKey: "foo", 46 | SharedCacheName: "buildpacks", 47 | } 48 | ``` 49 | 50 | With this split the semantics become clear. A `DownloadAction` is never cached and always incurs the cost of a copy *into* the container. A `CachedDownloadAction` is always cached and mounted directly into the container. For the asset downloaded by a `CachedDownloadAction` to make it into the container the user *must* add a mount entry in the `SharedDownloadCaches` array. 51 | 52 | ### Implementation Notes 53 | 54 | The basic idea here is that executor will use the information in `SharedDownloadCaches` to bindmount the cache directories into containers. The cached download action will then download and untar/unzip assets directly into the cache directory. The cache directory will be assumed to live on the local box - though with the Volumes API in Garden we will eventually have a path towards having Garden live on another box. This is not a priority for now. 55 | 56 | Open questions: 57 | 58 | - today's Cached Downloader will delete files when it is running low on resources. We will at least need to teach it to only do this if a directory is not actively bound to a container. 59 | - what happens if an asset in the cache needs to be updated (for example: a new ruby buildpack has arrived)? Do we overwrite the existing asset even though the (shared) directory is bound into an existing container? What does the DEA do here? 60 | - what about cached assets where the copy-in is not a big deal (e.g. the App-Lifecycle binaries). Do we want to have separate `SharedCacheDownloadAction`s and `CachedDownloadAction`s? (The latter would behave like the existing `CachedDownloadAction` and incur the copy-in penalty). -------------------------------------------------------------------------------- /proposals/bind-mounting-downloads-stories.prolific.md: -------------------------------------------------------------------------------- 1 | Garden-Windows can support read-only bind-mounts specified at container creation 2 | 3 | For the Greenhouse team to implement. 4 | 5 | 6 | L: bind-mount-downloads 7 | 8 | --- 9 | 10 | [CHORE] Spike: verify buildpack-app staging works when buildpacks are available from bind-mounted directories 11 | 12 | Make sure staging tasks still work correctly when buildpacks are available in bind-mounted directories instead of streamed into containers. The files should have the permissions set in the current buildpack archives, and the UIDs that result from the rep/executor expanding them as the host `vcap` user. 13 | 14 | 15 | L: bind-mount-downloads 16 | 17 | --- 18 | 19 | As a cached-downloader client, I would like to retrieve a cached downloaded asset as an expanded archive in a directory 20 | 21 | 22 | The cacheddownloader should be able to provide cached assets as expanded archives in a directory, in addition to providing them as a tar stream. To be eligible to be provided in this form, a cache key must accompany the request for the asset, and the download response must provide an ETag header or a Last-Modified header (or both), which are later used as criteria for invalidating cached assets. 23 | 24 | When a client obtains an asset directory, it is also understood to obtain a reference to the directory itself. When it is done using the directory, it must explicitly signal to the cacheddownloader that it has released its reference to the directory, so that the cacheddownloader can eventually clean up unused directories from the cache. 25 | 26 | - If an asset is requested as a directory and is a cache hit as a directory, the existing directory path is provided. 27 | - If an asset is requested as a directory and is a cache hit as an archive, the asset is expanded into a new directory, and that directory is provided to the client. The archive asset may be deleted only after all previous readers for that archive are closed (as happens today). 28 | - If an existing asset is requested as a directory, but is a cache miss, the asset is downloaded into a new directory and provided to the client. Any prior asset for that cache key (directory or archive) should not be removed from disk until all other references to it are released. 29 | 30 | The cacheddownloader must also still be able to provide these cached assets as tar streams to clients: 31 | 32 | - If an asset is requested as an archive, but is a cache hit as a directory, the cacheddownloader should provide the contents of the directory as a tar stream, rather than redownloading the asset archive. 33 | 34 | The cacheddownloader should also account for the size of assets in directories correctly enough, so that it can correctly account for disk usage when deciding to remove the least-recently items from the cache to make space for new assets. 35 | 36 | However we implement this, it must work correctly on Windows as well as on Linux and Darwin. 37 | 38 | 39 | L: bind-mount-downloads 40 | 41 | --- 42 | 43 | As a BBS API client, I can specify that my DownloadAction on a Task or DesiredLRP is to be provided via a read-only bind-mount, and updated Diego Cells will respect that 44 | 45 | On the BBS and its models: Add a new optional, string-valued `BindMount` field to the `DownloadAction` model. If the value is specified as `readonly`, validate that the `CacheKey` has been set, in addition to any other existing `DownloadAction` validations. 46 | 47 | Verify that this is not a breaking change to the BBS API. We do not expect this to require any new endpoints. 48 | 49 | The rep/executor must recognize the presence of a `readonly` value in the optional `BindMount` value on a container's DownloadAction. In this case, before container creation, the executor should use the cacheddownloader to download the asset into a directory on the host system. The executor should then bind-mount that directory to the desired container-side directory at the time of container creation, using the `BindMounts` field on the garden `ContainerSpec`. 50 | 51 | We can impose a stricter failure policy for bind-mounted downloads: even if their failure as a streamed-in asset would not cause a failure of the action flow, their failure to download and attach as a bind-mount should cause the container creation to fail. 52 | 53 | On deletion of the container, the executor should release references to all the cacheddownloader-provided directories that it has bind-mounted to the container. 54 | 55 | 56 | L: bind-mount-downloads 57 | 58 | --- 59 | 60 | Stager uses read-only bind-mounts for buildpack and lifecycle downloads 61 | 62 | The stager should construct its tasks to use bind-mounting DownloadActions for the buildpack archives it downloads from Cloud Controller, and for the lifecycle binaries it downloads from the Diego file-server. 63 | 64 | 65 | L: bind-mount-downloads 66 | 67 | --- 68 | 69 | Nsync recipe generation uses read-only bind-mounts for lifecycle downloads 70 | 71 | The nsync LRP recipe generation should construct its DesiredLRPs to use bind-mounting DownloadActions for the lifecycle binaries it downloads from the Diego file-server. It should not use them for other assets, such as the app droplet in the case of buildpack apps. 72 | 73 | 74 | L: bind-mount-downloads 75 | 76 | -------------------------------------------------------------------------------- /proposals/bosh-deployments.md: -------------------------------------------------------------------------------- 1 | # The Road to Composing BOSH Releases 2 | 3 | A number of pivots met to discuss a vision for moving away from monolithic BOSH releases towards smaller BOSH releases. 4 | 5 | ## Where we are today 6 | 7 | CF-Release is a monolithic BOSH release containing just about *everything*. Diego-Release is also a monolithic BOSH release. 8 | 9 | #### Problems with this 10 | 11 | - There is duplicate code shared between CF-Release and Diego-Release that has probably already diverged (e.g. etcd) 12 | - Deploying Diego involves a complicated dance around merging CF's templates into Diego's templates. 13 | - It is currently difficult to track compatible versions of Diego + CF 14 | - It is difficult and awkward to describe how one might BOSH deploy Lattice (i.e. Diego + Router + LAMB) 15 | - Teams are all up in each other's business - constantly bumping CF-Release 16 | - CF-Release being a monolith hides circular dependencies (e.g. LAMB depends on CC which depends on LAMB) 17 | - CF-Release and Diego-Release both conflate BOSH releases with BOSH deployment templates 18 | 19 | ## Where we want to end up 20 | 21 | We envision a world where every major component (or logical grouping of components) is its own BOSH Release. These releases would contain code, tests, and job & packaging specs with appropriately namespaced properties. 22 | 23 | A separate repository called a *deployment* repository that acts as the glue that combines disparate releases together into a vetted grouping. The deployment repository would contain templates (potentially for a variety of targets), integration tests that validate the deployment, and pointers to concrete versions of all the releases it depends on. 24 | 25 | Teams will be responsible for their own Releases and collaborate on keeping the deployment(s) up-to-date. Presumably one team (e.g. runtime) would be responsible for keeping the deployment green (typically by poking the team that contributed a component that broke the release). 26 | 27 | BOSH may add some tooling around easing someworkflows here (e.g. downloading releases via URLs, knowing how to create and upload multiple releases given a deployment) 28 | 29 | Here's what the world might look like someday: 30 | 31 | ![the future?](bosh-deployments.png) 32 | 33 | ## FAQ 34 | 35 | - *I'm an OSS consumer of CF. Is my life now harder with all these releases?* 36 | 37 | No. As an OSS consumer you still only download one repository (`cf-deployment`). This will have all the information necessary to perform a BOSH deploy. It may even have baked-in URLs pointing to final-release tarballs that you can just download and deploy with a single `bosh deploy`. 38 | 39 | - *I'm an OSS developer. Is my life now harder with all these releases?* 40 | 41 | Not necessarily. You can still tweak code in an individual release. You then create & upload said release and then run `bosh deploy` with the manifest generated by the deployment. You will have to keep `cf-deployment` in-sync with the release if you make changes to the properties however, so that might be a new point of friction. 42 | 43 | - *I'm an OSS team member working on CF every day. Is my life now harder with all these releases?* 44 | 45 | Hopefully not. There are now cleaner delineations of responsibility and dependencies become more explicit. With the new decoupling you will rarely be blocked on CF-Release being red for reasons outside of your control: you can continue to work on your release and defer integration until CF-Release is green again (unless, of course, you're the reason CF is red). 46 | 47 | - *What about shared jobs*? 48 | 49 | We think this actually gets a lot nicer with releases broken out in this way. Let's take etcd as a concrete example. 50 | 51 | Runtime, LAMB and Diego all rely on etcd. We run two different etcd clusters (one for Diego and one for Runtime+LAMB). By sharing an etcd release we reduce code duplication and can invest unified effort in maintaining a correct BOSH release of etcd. When Diego needs to update etcd Runtime+LAMB need *not* update etcd as it will be possible for the two to reference different versions of the release in cf-deployment. 52 | 53 | - *What about shared packages*? 54 | 55 | We propose either duplicating shared compile-time packages or using git submodules to solve this problem. If this becomes a pain-point BOSH can try to help. 56 | 57 | ## Concrete baby steps for Diego 58 | 59 | 1. Create `diego-deployment` by pulling out all our Diego-Release templates out of Diego-Release and into `diego-deployment`. Evaluate the developer and consumer experience once this is done. `diego-deployment` points to vetted compatible versions of CF-Release and Diego-Release. CF-Release will still have its templates. `diego-deployment` lays the groundwork for someday becoming `cf-deployment`. 60 | 2. Start consuming `garden-linux` via `garden-linux-release` instead of duplicating `garden-linux` code. `garden-linux` moves out of `diego-release` and into `diego-deployment`. 61 | 3. Pull out `etcd` into its own release and have `diego-deployment` refer to it. 62 | 4. Present our results to other teams and evaluate next steps. 63 | -------------------------------------------------------------------------------- /proposals/bosh-deployments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cloudfoundry/diego-notes/80926d1963c8b75cd2cd6e36f342a79ccf7afaa9/proposals/bosh-deployments.png -------------------------------------------------------------------------------- /proposals/desired_lrp_update_extension.md: -------------------------------------------------------------------------------- 1 | # Proposal for zero downtime updates of DesiredLRPs 2 | 3 | ## What Can be Updated? 4 | 5 | 1) App Update - Droplets, env vars etc - RunInfo 6 | 7 | 1) App Update - Memory Quota, Disk Quota - ScheduleInfo 8 | 9 | 1) Update exposed Ports - RunInfo But see below 10 | 11 | * Routing information can already be updated in a DesiredLRP. But no versions. 12 | 13 | * What does it mean to update Ports when some instances use port A others use Port B? 14 | 15 | * How would the GORouter manage this traffic? 16 | 17 | * How do we emit these new routes? How does an app user go to different ports? If we change ports does this not make old ports useless? 18 | 19 | 1) Asset URLs - RunInfo 20 | 21 | 1) Actions Updated - Start command, healthcheck etc - RunInfo 22 | 23 | 1) App SecurityGroups - RunInfo 24 | 25 | 1) Privileged Flag - RunInfo 26 | 27 | ## How do we version in the Model? 28 | 29 | RunInfo is the part of DesiredLRP stored as a BLOB in SQL schema. 30 | 31 | We could have multiple runInfos for a single desiredLRP - is this a refactor of the 32 | desiredLRP SQL. 33 | 34 | Have a RunInfo table and references to the desiredLRP. 35 | 36 | Inside the desiredLRP we could have active, old1, old2 references to RunInfo entries 37 | 38 | 39 | Link between actual LRP and the runInfo it was started with. 40 | 41 | DLRP 1-> Many(3) RunInfos 42 | ALRP 1-> 1 RunInfos 43 | DLRP 1-> Many ALRPs 44 | 45 | We may have to split some of the items that are in schedule info (disk, memory quota) stuff to be able to 46 | version those pieces. 47 | 48 | We should use locking and transactions to ensure we can update both the runInfo and the DesiredLRP and ActualLRP tables in a single step. 49 | 50 | ## Backwards Compatibility 51 | 52 | Pulling the RunInfo out will have some backwards compatibility effects. 53 | 54 | 1) Migration required to convert current BLOB to a single entry in the table. 55 | 1) API versioning. Should be fairly simple to have the legacy functionality having a single RunInfo per DLRP. 56 | 57 | ## What manages Behaviour? 58 | 59 | The behaviour is similar to evacuation. We want to ensure a new version of an instance is up prior to taking down the old version of an instance. We also want to serially roll all instances of an app. 60 | 61 | Probably use the BBS to manage the overall flow of this for a particular app. 62 | 63 | Can we reuse the evacuation flag in the actual LRP? This may cause conflicts with how cells are working as well as how convergence is running and looking at evacuating LRPs. 64 | * How would this work with cf delete commands from the cloud controller? If the versioning is meant to be transparent to users, we might want one evacuate call that targets LRPs by app name and one that can target `{name, version}` tuples. 65 | 66 | When the BBS and the Rep are in the process of switching a group of ActualLRP instances from the old version to the new version, the system is in a transition state. If the BBS receives a delete command while in this state, it should begin evacuating without trying to finish transitioning all the LRPs to the new version. 67 | 68 | ## Events 69 | 70 | What RunInfo is required for eventing. 71 | 72 | 1) Need to look into usage of RunInfo in route-emitter and nsync bulker to ensure this can work correctly. 73 | 74 | Looks like RunInfo does not directly get passed to route emitter or nsync. Instead, it is the scheduling info that is used in route-emitter to reconstruct the routing tables. The scheduling info has desiredLRP resources (Memory / Disk / Rootfs) as well as the Route information. The route information does not include port number though, just a string of routes. 75 | 76 | 77 | ## How do we cleanup unused? 78 | 79 | New convergence could cleanup unused runInfos. Any RunInfo not the "active" RunInfo for a DesiredLRP and not referenced by a running ActualLRP can be removed. 80 | 81 | What happens if cleanup occurs before a new version launches any ActualLRPs? If we don't have any kind of ordered versioning, then maybe we should spare `UNCLAIMED` and `CLAIMED` ActualLRPs as well. 82 | 83 | 84 | ## Proposed API in bbs 85 | 86 | UpdateDesiredLRP(logger lager.Logger, processGuid string, update *models.DesiredLRPUpdate) error. 87 | 88 | models.DesiredLRPUpdate would now have the following information: 89 | 90 | ```go 91 | type DesiredLRPUpdate string { 92 | Instances * int32 93 | Routes *Routes 94 | Annotation *string 95 | DesiredLRP *DesiredLRP 96 | UpdateIdentifier *string 97 | } 98 | ``` 99 | 100 | To maintain backwards compatibility we will leave the Instances, Routes, and Annottaion fields as top level entries in the DesiredLRPUpdate. The DesiredLRP field will be a fill DesiredLRP containing all the other information to be updated. 101 | The UpdateIdentifier will be the named version of this DesiredLRP allowing for mapping of an actual LRP to a specific DesiredLRP version. On create of a DesiredLRP we can create a tag for the initial version. 102 | 103 | Nysnc will be able to create this easily as they aleady build up the new DesiredLRP structure using the receipe builder. They will simply use the full desiredLRP to send to the endpoint. 104 | 105 | We could also think of creating a NEW endpoint that just takes the full new DesiredLRP. 106 | 107 | UpdateDesiredLRP(logger lager.Logger, processGuid string, update *models.DesiredLRP) error 108 | 109 | This would be a NEW endpoint and make backwards compatibility easier. 110 | 111 | ## Proposed New model for DesiredLRP and ActualLRP 112 | 113 | ### DesiredLRP 114 | 115 | Current: 116 | 117 | ```go 118 | type DesiredLRP struct { 119 | ProcessGuid string 120 | Domain string 121 | RootFs string 122 | Instances int32 123 | EnvironmentVariables []*EnvironmentVariable 124 | Setup *Action 125 | Action *Action 126 | StartTimeoutMs int64 127 | DeprecatedStartTimeoutS uint32 128 | Monitor *Action 129 | DiskMb int32 130 | MemoryMb int32 131 | CpuWeight uint32 132 | Privileged bool 133 | Ports []uint32 134 | Routes *Routes 135 | LogSource string 136 | LogGuid string 137 | MetricsGuid string 138 | Annotation string 139 | EgressRules []*SecurityGroupRule 140 | ModificationTag *ModificationTag 141 | CachedDependencies []*CachedDependency 142 | LegacyDownloadUser string 143 | TrustedSystemCertificatesPath string 144 | VolumeMounts []*VolumeMount 145 | Network *Network 146 | } 147 | ``` 148 | 149 | New: 150 | 151 | ```go 152 | type LRPDefinition struct { 153 | DefinitionIdentifier string 154 | RootFs string 155 | Instances int32 156 | EnvironmentVariables []*EnvironmentVariable 157 | Setup *Action 158 | Action *Action 159 | StartTimeoutMs int64 160 | DeprecatedStartTimeoutS uint32 161 | Monitor *Action 162 | DiskMb int32 163 | MemoryMb int32 164 | CpuWeight uint32 165 | Privileged bool 166 | Ports []uint32 167 | Routes *Routes 168 | LogSource string 169 | LogGuid string 170 | MetricsGuid string 171 | EgressRules []*SecurityGroupRule 172 | CachedDependencies []*CachedDependency 173 | LegacyDownloadUser string 174 | TrustedSystemCertificatesPath string 175 | VolumeMounts []*VolumeMount 176 | Network *Network 177 | } 178 | 179 | type DesiredLRP struct { 180 | ProcessGuid string 181 | Domain string 182 | models.LRPDefinition 183 | Annotation string 184 | LRPDefinition2 *models.LRPDefinition 185 | LRPDefinition3 *models.LRPDefinition 186 | } 187 | ``` 188 | 189 | 190 | ### ActualLRP 191 | 192 | Current: 193 | 194 | ```go 195 | type ActualLRP struct { 196 | ActualLRPKey 197 | ActualLRPInstanceKey 198 | ActualLRPNetInfo 199 | CrashCount int32 200 | CrashReason string 201 | State string 202 | PlacementError string 203 | Since int64 204 | ModificationTag ModificationTag 205 | } 206 | ``` 207 | 208 | New: 209 | 210 | ```go 211 | type ActualLRP struct { 212 | ActualLRPKey 213 | ActualLRPInstanceKey 214 | ActualLRPNetInfo 215 | DesiredLRPDefinitionIdentifier string 216 | CrashCount int32 217 | CrashReason string 218 | State string 219 | PlacementError string 220 | Since int64 221 | ModificationTag ModificationTag 222 | } 223 | ``` 224 | 225 | The main change here is to have a DefinitionIdentifier to link the ActualLRP to the definition for that DLRP 226 | -------------------------------------------------------------------------------- /proposals/docker_registry_caching.md: -------------------------------------------------------------------------------- 1 | # Background 2 | 3 | Private Docker Registry aims to: 4 | - guarantee that we fetch the same layers when the user scales an application up. 5 | - guarantee uptime (if the docker hub goes down we shall be able to start a new instance) 6 | - support private docker images 7 | 8 | # Private Registry Configuration 9 | 10 | The configuration of the registry is proposed [here](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/proposals/docker_registry_configuration.md) 11 | 12 | # Image Caching 13 | 14 | To guarantee predictable scaling and uptime we have to copy the Docker image from the public Docker Hub to the private registry. 15 | 16 | The basic steps we need to perform are: 17 | 18 | * pull image from public Docker Hub 19 | * tag the image 20 | * push the tagged image to the Private Registry 21 | * clean up the pulled image 22 | 23 | The tagging has two purposes: 24 | 25 | 1. We target the private registry host (since the tag includes the host) 26 | 1. Assosiates the image to the desired application 27 | - generate guid 28 | - send the guid back to CC as a URL pointing back to the private docker registry 29 | 30 | ## Pull location 31 | 32 | There are two ways to do this, based on the location where we store the (intermediate) pulled image: 33 | 34 | ### Builder 35 | 36 | Builder pulls the image and stores it locally in the container. Then it tags and pushes it to the private registry. 37 | 38 | **Pros:** 39 | 40 | - staging remains isolated with respect to disk space and CPU usage 41 | - easily scale the Docker image staging 42 | 43 | **Cons:** 44 | 45 | - limited disk space in staging container (currently 4096 MB) 46 | 47 | *We may need to increase the default quota for staging of docker images.* 48 | 49 | 50 | ### Private Registry 51 | 52 | We can use the docker daemon running in the [Private Docker Registry job](https://github.com/pivotal-cf-experimental/docker-registry-release), instructing it to pull, tag and push the image to the registry. To do this we need to either export the daemon's own API or provide a new endpoint that knows how to call it locally. Since the registry runs on the same machine as the daemon, this should be fairly cheap operation. 53 | 54 | The Builder can drive the caching steps (pull, tag, push) by using API remotely exposed by the Registry Release. 55 | 56 | **Pros:** 57 | 58 | - saves bandwidth 59 | 60 | *The image is not transferred between the Builder task and the Registry* 61 | 62 | - caches layers 63 | 64 | *Since docker graph is common for all cache requests, the layers are cached in the same place* 65 | 66 | **Cons:** 67 | 68 | - disk space for pulling images 69 | 70 | *Docker graph will quickly fill up with layer data for different images. While this serves as a cache and can significantly speed up the pull step, this also creates a problem if we want to get rid of the unused layers. We might use the CC URL pointing to tag guid for cleanup* 71 | 72 | - no isolation between staging tasks 73 | 74 | *Single big image might block other staging tasks. To isolate the staging tasks we can limit the size of pulled images. This can be done by creating a file-based file system for every cache request.* 75 | 76 | - need to scale registry with staging tasks 77 | 78 | ## Implementation details 79 | 80 | ### Caching process 81 | 82 | We can take two approaches to implement the caching process (pull, tag and push steps): 83 | - programmatically drive [docker code](https://github.com/docker/) 84 | - use docker client & daemon 85 | 86 | Both approaches require privileged process to be able to mount the docker graph file system. 87 | 88 | #### Programmatic 89 | 90 | We can extending the code in garden-linux that is used to fetch the rootfs by adding pull image, tag image and push image functionality 91 | 92 | **Pros:** 93 | 94 | - efficiency: reduced memory consumption, number of processes 95 | - no need to provision Docker daemon 96 | 97 | **Cons:** 98 | 99 | - dependency on Docker internals 100 | 101 | 102 | #### Docker daemon 103 | 104 | We can run Docker daemon and request its public REST API. To do this we shall run the daemon as root or in priviledged container. 105 | 106 | **Pros:** 107 | 108 | - stable public API (does not depend on internal implementation) 109 | 110 | **Cons:** 111 | 112 | - additional process and memory overhead 113 | - more complex provisioning (configuration, scripts, manifests) 114 | - root access **and** priviliged container 115 | 116 | ##### Processes 117 | 118 | We need two processes: 119 | - Docker daemon 120 | - Builder 121 | 122 | We have to launch the two processes in parallel since there is no point in waiting for the daemon to finish. 123 | 124 | The Builder is responsible for: 125 | - wait the daemon to start by reading the response from daemon's `_ping` access point 126 | - fetch metadata 127 | - cache image by invoking `docker` CLI: 128 | - pull image 129 | - tag image 130 | - push image 131 | 132 | The Docker daemon accepts requests made directly from Builder or triggered by the `docker` client. 133 | 134 | The processes are launched as [ifrit](https://github.com/tedsuo/ifrit) group, which guarantees that if one of them exits or crashes the other one will be terminated. We also use ifrit to be able to terminate both of them if the parent process is signalled. 135 | 136 | ## Cached images 137 | 138 | ### URL 139 | 140 | The images that are pushed to the private registry are stored as `:/:latest`, where GUID is a generated V4 uuid. 141 | 142 | ### Tag 143 | 144 | The cached images are stored in `library/` with tag `latest` 145 | 146 | ## Cloud Controller 147 | 148 | Cloud Controller table `droplets` is extended with `cached_docker_image` that stores the image URL returned by the staging process. 149 | 150 | When desire app request is generated, the `cached_docker_image` is sent, instead of the user provided `docker_image` in `app` table. 151 | 152 | If the user opts-out of caching the image, on restage we have to nil the `cached_docker_image` since this will disable the caching on desire app request. 153 | -------------------------------------------------------------------------------- /proposals/docker_registry_configuration.md: -------------------------------------------------------------------------------- 1 | # Configuration settings for using private Docker Registry 2 | 3 | Private Docker Registry aims to: 4 | - guarantee that we fetch the same layers when the user scales an application up. 5 | - guarantee uptime (if the docker hub goes down we shall be able to start a new instance) 6 | - support private docker images 7 | 8 | ## Security 9 | 10 | The Docker Registry can be run in 3 flavours with regards to security: 11 | - HTTP 12 | - HTTPS with self-signed certificate 13 | - HTTPS with CA signed certificate 14 | 15 | The first two are considered insecure by the docker registry client. The client needs additional configuration to allow access to insecure registries. 16 | 17 | ## Configuration 18 | 19 | ### Components 20 | 21 | The following components are involved in the staging and running of Docker image: 22 | - Builder 23 | The Docker app life-cycle builder needs to access the registry to fetch image meta data. To do this it uses the Docker Registry Client. Therefore the client has to be configured to allow insecure connections if this is needed. 24 | - Stager 25 | The builder runs inside a container. This container should allow access to internally hosted (in the protected isolated CF network) registry. The container is configured by the stager that requests launch of the docker builder task. 26 | 27 | ### Goals 28 | 29 | The configuration goals are: 30 | - to allow access to the private Docker Registry 31 | - to enable Docker Registry Client to access the host 32 | 33 | ### Proposal 34 | 35 | #### Consul service 36 | 37 | We shall register the private docker registry with Consul (as we do with the file server). The Docker Registry shall be registered as `-dockerRegistryURL=http(s)://docker_registry.service.consul:8080`. 38 | 39 | This will help us easily discover the service instances. We do not need to specify concrete IPs of the service nodes in the BOSH manifests as well. 40 | 41 | #### Stager 42 | 43 | The Docker Registry instance IPs are obtained from Consul cluster. The Stager shall be provided with URL to the Consul Agent with `-consulCluster=http://localhost:8500`. 44 | 45 | The Stager shall be instructed to add the registry service instances as insecure with the help of `-insecureDockerRegistry` flag. This flag shall be provided if the registry is accessed by HTTP or HTTPS with self-signed certificate. 46 | 47 | #### Builder 48 | 49 | The `-dockerRegistryAddresses` argument is used to provide a comma separated list of \:\ pairs to access all Docker Registry instances. 50 | 51 | Docker client needs a list of all insecure registry service instances. That's why Builder shall read the optional command line argument `-insecureDockerRegistries` to configure Docker Registry Client and allow access to any insecure hosts. The argument shall accepts comma separated list of \:\ pairs. 52 | 53 | The list is available in Consul cluster, but Builder is running inside a container with no access to Consul Agent. That's why the `-insecureDockerRegistries` list is built by Stager. 54 | 55 | As a side effect the docker app life-cycle builder may provide access to public registries that are insecure (either HTTP or self-signed cert HTTPS) if they are listed in `-insecureDockerRegistries`. This however will require also forking/modifying Stager code. 56 | 57 | The `-dockerDaemonExecutablePath=` is used to configure the correct path to the Docker executable in different environments (Inigo, different Cells, unit testing). 58 | 59 | **Pros:** 60 | 61 | - Simpler configuration 62 | - Consistent with the CF CLI 63 | 64 | **Cons:** 65 | 66 | - No support for external insecure registries 67 | 68 | ## Egress rules 69 | 70 | To be able to access the private Docker Registry we have to open up the container. We have these options: 71 | 72 | - Stager fetches a list of all registered `docker_registry` services. This would return all registered IPs and ports and we shall poke holes allowing access to all those IPs and ports. 73 | - We pick a unique port for the private docker registry that doesn't conflict with anything else in the VPC (hard/tricky!) and then open up that port on the entire VPC. 74 | 75 | If we use the first option, there's a small race in that a new Docker registry may appear/disappear while we are staging. This would result in a staging failure but this should be very infrequent. 76 | 77 | ### Staging 78 | 79 | To enable discovery of all service instances we shall use [Consul service nodes endpoint](http://www.consul.io/docs/agent/http.html#_v1_catalog_service_lt_service_gt_) (/v1/catalog/service/). This will return the following JSON: 80 | ``` 81 | [ 82 | { 83 | "Node": "foobar", 84 | "Address": "10.1.10.12", 85 | "ServiceID": "redis", 86 | "ServiceName": "redis", 87 | "ServiceTags": null, 88 | "ServicePort": 8000 89 | } 90 | ] 91 | ``` 92 | 93 | To execute the above request we use the `-consulAgentURL` flag. Currently `ServicePort` is always 0 since we do not register a concrete port and rely on hardcoded one. Using different ports requires changes in: 94 | 95 | - consul agent scripts (to register service with different ports) 96 | - stager task create request (to add several egress rules) 97 | 98 | For MVP0 we can hardcode the port and make the needed changes for all services later. 99 | 100 | Once we find all service instances of the registry we add them explicitly to the staging task container. This presents some problems since we are opening access to the registry even if the desired egress rules (the default CF configuration) does not allow this. 101 | 102 | ### Running 103 | 104 | We don't need to add egress rules to allow access to the Docker Registry instances since Garden fetches the image outside of the container. 105 | 106 | ## Opt-in 107 | 108 | Image caching in private docker registry will not be enabled by default. User can opt-in by: 109 | 110 | ``` 111 | cf set-env DIEGO_DOCKER_CACHE true 112 | ``` 113 | 114 | The app environment is propagated all the way to the stager, which configures the docker lifecycle builder accordingly with the help of the `-cacheDockerImage` command line flag. 115 | 116 | If the `-cacheDockerImage` is present, the builder: 117 | - generates cached docker image GUID 118 | - caches the image in the private registry (pull, tag & push) 119 | - includes the cached image metadata (`:/:latest`) in the staging response 120 | 121 | The stager returns the generated staging response back to Cloud Controller (CC). CC stores the cached image metadata in the CCDB. 122 | 123 | If the user opt-out then the cached image metadata is cleared to enable restage with the user-provided docker image. 124 | 125 | ## Private images 126 | 127 | Users need to provide credentials to access the Docker Hub private images. The default flow is as follows: 128 | * user provides credentials to `docker login -u -p -e ` 129 | * the login generates ~/.dockercfg file with the following content: 130 | ``` 131 | { 132 | "https://index.docker.io/v1/": { 133 | "auth": "bXlVc2VyOm15c2VjUmV0", 134 | "email": "" 135 | } 136 | } 137 | ``` 138 | * `docker pull` can now use the authentication token in the configuration file 139 | 140 | Therefore we can implement two possible flows in Diego using `docker login`: 141 | * user provides user/password 142 | * user provides authentication token and email 143 | 144 | The second option should be safer since the authentication token is supposed to be temporary, while user/password credentials can be used to generate new token and gain access to Docker Hub account. 145 | 146 | ### Findings / Issues 147 | 148 | #### Docker authentication token 149 | 150 | The current (May 2015) authentication token contains base64 encoded `:`. For example the token `bXlVc2VyOm15c2VjUmV0` above is actually `myUser:mysecRet`. This compromises the idea to use token to prevent storing the user credentials. 151 | 152 | The Cloud Controller models produce debug log entry with the arguments used to start the application. Since these arguments contain the authentication token this presents a security risk as operators or admins shall not be able to see the user's credentials. 153 | 154 | #### Docker email 155 | 156 | `docker login` always requires email, although it is not used if user/password login succeeds. This is already reported as https://github.com/docker/docker/issues/6400 and may be fixed in the future. 157 | 158 | #### In memory storage of credentials or token 159 | 160 | We cannot store application metadata in-memory since we may have more than one CC. The `create app` and `start app` requests to CC may reach different instances so the authentication info stored in-memory will not be unavialable for all CC instances. 161 | 162 | #### OAuth 163 | 164 | Docker seems to support OAuth. To use OAuth one should either: 165 | * programatically use Docker golang code 166 | * use the Docker REST endpoints with custom client 167 | 168 | Both approaches seems to bring too much overhead. We either have to adapt to Docker code changes or create a custom client that can login and pull. 169 | 170 | ### Proposed design 171 | 172 | Since we cannot use the Docker authentication token as a secure replacement of user/password we shall do the following on CC side: 173 | * accept user, password & email as credentials needed to pull private Docker image 174 | * encrypt the credentials in the database 175 | * propagate the credentials to Diego via stage app request 176 | 177 | On Diego side we shall: 178 | * propagate the credentials to `stager` and Docker `builder` 179 | * (`builder`) propagate the credentials to Docker golang API to fetch the private image metadata 180 | * (`builder`) perform `docker login` with the credentials to enable `docker pull` of private image 181 | 182 | -------------------------------------------------------------------------------- /proposals/faster-missing-cell-recovery.md: -------------------------------------------------------------------------------- 1 | # Faster Missing Cell Recovery 2 | 3 | Today, Cell reps maintain a presence key in etcd. This key has a TTL of `PRESENCE_TTL` (currently set to 30s). When a Cell disappears this key eventually expires. When the Converger next performs its convergence loop (every `CONVERGENCE_INTERVAL` seconds) it notices that there are [ActualLRPs](https://github.com/pivotal-cf-experimental/diego-dev-notes#harmonizing-desiredlrps-with-actual-lrps-converger) and [Tasks](https://github.com/pivotal-cf-experimental/diego-dev-notes#maintaining-consistency-converger) in the BBS that claim to be running on non-existing Cells. It responds by restaring the ActualLRPs and failing the Tasks. 4 | 5 | Today, both `PRESENCE_TTL` and `CONVERGENCE_INTERVAL` are set to 30s. In the worst case scenario, the time between a Cell going down and the ActualLRPs being restarted can be as high as `PRESENCE_TTL + CONVERGENCE_TTL` -- i.e. 1 minute. We can do better. 6 | 7 | #### Why bother? 8 | 9 | - Doing better makes for good demos. 10 | - Doing better is better :stuck_out_tongue_winking_eye:. 11 | 12 | ## A better way 13 | 14 | One option is to reduce `PRESENCE_TTL` and `CONVERGENCE_INTERVAL`. The fomer is practical, the latter is not: 15 | 16 | - Since we *only* heartbeat the presence information into etcd we can probably easily sustain `PRESENCE_TTL` in the neighborhood of `5 seconds`. This knob may need to be tweaked down as the number of Cells increases, however. 17 | - It is, however, impractical to substantially decrease the `CONVERGENCE_INTERVAL`. Convergence is a very expensive operation that places heavy strain on etcd. 18 | 19 | I propose the following: 20 | 21 | - We decrease `PRESENCE_TTL` to 5 seconds. 22 | - We modify the Converger to watch for expiring Cell presences and respond immediately by triggering a convergence loop. 23 | 24 | In this way we can reduce the maximum recovery time from 1 minute to ~5 seconds. 25 | 26 | #### But I thought we were moving away from etcd watches? 27 | 28 | Yes and no. 29 | 30 | We are moving away from etcd watches for three reasons: 31 | 32 | - They easily lead to the thundering herd problem (e.g. a Task is complete, all receptors head to etcd to mark the Task as RESOLVING) 33 | - They are generally unperformant under heavy write loads 34 | - They make it harder to move away from etcd 35 | 36 | But we are only primarily concerned with not watching for changes in our *data* (DesiredLRPs, ActualLRPs, Tasks). 37 | 38 | Watching for *services* (e.g. CellPresence) is fine. In the event that we move our data out of etcd we are likely to continue to use etcd (or consul, which also supports watch-like semantics) for our services. Note that moving our data out of etcd also dramatically reduces the write-pressure which makes reducing the `PRESENCE_TTL` to 5 seconds a lot more feasible even in very large deployments (>1000 cells). 39 | -------------------------------------------------------------------------------- /proposals/go-modules-migrations.md: -------------------------------------------------------------------------------- 1 | # BOSH release migration path to use Go modules 2 | 3 | ## Summary 4 | 5 | After [proposing 6 | different](https://docs.google.com/document/d/1MeiXIqzsj_j1ziYAfhVXfCCFJLiKhi7HiOyF3g-7Wpk/edit#heading=h.9jamy725425y) 7 | routes for converting bosh releases to use Go modules, we are going to move 8 | forward with the following implementation: 9 | 10 | - Single go module under `src/code.cloudfoundry.org` with a generic name 11 | `code.cloudfoundry.org` 12 | 13 | As far as estimation for this work, I estimate 1-2 weeks of work for 14 | diego-release as an example of a release with over 130 submodules. 15 | 16 | ## Migration Path 17 | 18 | Here are steps taken for conversion 19 | 20 | ``` 21 | cd ./src/code.cloudfoundry.org 22 | go mod init code.cloudfoundry.org 23 | go mod vendor -e #ignore errors 24 | # run tests 25 | # if tests failed due to mismatch on version, try the submoduled version instead 26 | ``` 27 | 28 | ### Add Depedabot for independent modules (e.g. code.cloudfoundry.org/localip) 29 | 30 | When a new module is created, let's also add dependabot so that the dependencies 31 | can remain updated. 32 | [archiver](https://github.com/cloudfoundry/archiver/blob/2762da2677ce6ba931a6e9fff947c7541f470615/.dependabot/config.yml) 33 | is an example of the dependabot config. 34 | 35 | ### Add Github action integration for running the test for independent modules 36 | 37 | When a new module is created, let's also add github action integration for 38 | running the tests 39 | 40 | ### Module dependency versions 41 | 42 | BOSH releases have specific SHAs checkout out for each dependency. [Some 43 | repositories](https://github.com/cloudfoundry/diego-release/blob/68b60677acffd6ab241e2698f581c52f5da3ed83/.dependabot/config.yml) 44 | are setup to receive security update (if dependabot security bump is working as 45 | expected), but we are not bumping dependencies to the latest all the time. When 46 | these are converted to Go module's dependency: 47 | 48 | - Go mod tooling will automatically get the latest one from the remote 49 | repository 50 | - Ensure tests pass with the latest version 51 | - If Broken, let's lock it back to the submoduled SHA and exclude that from 52 | dependabot 53 | 54 | 55 | ### Vendoring external modules that use replace directive 56 | 57 | [`replace` 58 | directive](https://thewebivore.com/using-replace-in-go-mod-to-point-to-your-local-module/) 59 | is very handy when a developer is making changes to modules for BOSH releases 60 | and wants to see the changes applied after deploy without having to commit the 61 | change to upstream. Using `replace` directive works when it's internal to the 62 | release, but when the component is used outside of the release (e.g. 63 | `diego-release` uses `guardian` from `garden-runc-release`) we would have to 64 | manually add the replace commands before being able to vendor `guardian`. In 65 | order to use `guardian`, you must clone the needed repositories using submodules. Here 66 | is an example workflow when adding those components 67 | 68 | ``` 69 | cd inigo 70 | go mod init 71 | go mod edit \ 72 | -replace code.cloudfoundry.org/guardian=./guardian \ 73 | -replace code.cloudfoundry.org/garden=./garden \ 74 | -replace code.cloudfoundry.org/grootfs=./grootfs \ 75 | -replace code.cloudfoundry.org/idmapper=./idmapper 76 | go mod vendor 77 | ``` 78 | 79 | ### Add a script to update golang version for each module 80 | 81 | Each module would have a script to update go version. [More 82 | info](https://golang.org/ref/mod#go-mod-file-go) for how to do that. 83 | 84 | 85 | ## Work-In-Progress 86 | 87 | - [x] diego-release v2.51.0. dependabot is not syncing because of [this 88 | issue](https://github.com/dependabot/dependabot-core/issues/4249) 89 | - [x] routing-release 90 | [cloudfoundry/routing-release#211](https://github.com/cloudfoundry/routing-release/pull/211) 91 | - [x] cf-networking-release 92 | [cloudfoundry/cf-networking-release#90](https://github.com/cloudfoundry/cf-networking-release/pull/90) 93 | - [x] silk-release 94 | [cloudfoundry/silk-release#32](https://github.com/cloudfoundry/silk-release/pull/32) 95 | - [x] nats-release 96 | [cloudfoundry/nats-release#38](https://github.com/cloudfoundry/nats-release/pull/38) 97 | 98 | - [ ] code.cloudfoundry.org/lager need to be updated in diego-release and 99 | silk-release and all of the other consumers since the 2.0.0 tag was added we 100 | need to figure out a way to update this dependency 101 | - [ ] Make sure all scripts that bump a submodule also run `go mod tidy & go mod 102 | vendor` 103 | - [x] Make sure norsk pipelines are all green 104 | 105 | ### List of newly converted modules that have github-actions and dependabot 106 | - code.cloudfoundry.org/archiver 107 | - code.cloudfoundry.org/cf-networking-helpers 108 | - code.cloudfoundry.org/silk (CI on concourse instead) 109 | - code.cloudfoundry.org/diego-logging-client 110 | - code.cloudfoundry.org/certsplitter 111 | - code.cloudfoundry.org/bytefmt 112 | - code.cloudfoundry.org/durationjson 113 | - code.cloudfoundry.org/consuladapter 114 | - code.cloudfoundry.org/eventhub 115 | - code.cloudfoundry.org/debugserver 116 | - code.cloudfoundry.org/clock 117 | - code.cloudfoundry.org/localip 118 | 119 | ## TODO As a different track 120 | - Add a dependabot configuration for releases so they get the latest version of 121 | the dependencies. 122 | - Go mod fetches `code.cloudfoundry.org/lager` v2.0.0 as the latest version of 123 | library, instead of master/main branch with the latest updates. This is leading to 124 | a situation where we have to use a [replace 125 | directive](https://github.com/cloudfoundry/diego-release/blob/2eb134ea1227064fd4fc15671a34552a20a6e3f3/src/code.cloudfoundry.org/go.mod#L13). 126 | This works for now, but it has a side-effect of never getting updated when 127 | there is a commit on lager. [The following is a list of 128 | repositories](https://pkg.go.dev/code.cloudfoundry.org/lager?tab=importedby) that 129 | needs to remove the replace directive when we have a different solution 130 | - containernetworking/cni package needs to be updated on silk-release 131 | - Figure out a way to sync github action YAMLs. Currently we have an independent 132 | configuration that are mostly the same. If we need to update one, it would be 133 | better to be able to have a list of those repositories so that they all get 134 | the same version of improvement. This is also true to dependabot.yml. 135 | - certauthority is now a repository, let's move cert creation functionality out 136 | of inigo so that other repositories don't have to import inigo. Here is the 137 | list of repositories that do similar cert generation for tests: 138 | - diego-release 139 | - silk-release 140 | - cf-networking-helpers 141 | - cf-networking-release 142 | 143 | ## End Goal 144 | 145 | After migration is completed: 146 | 147 | - `GOPATH` in `.envrc` is removed 148 | - Release is using Go 1.16+ 149 | - `GO111MODULE` is unset 150 | - When we `ag GOPATH`, all results should be converted/removed including docs 151 | -------------------------------------------------------------------------------- /proposals/lifecycle-design.md: -------------------------------------------------------------------------------- 1 | ## App Lifecycles: Static or Dynamic? 2 | 3 | Lifecycles consist of two parts: a trinity of binaries (Builder, Runner, Healthcheck), and a Recipe-Generator interface which can generate a Diego recipe for the Staging Task (Builder) and DesiredLRP (Runner + Healthcheck). 4 | 5 | The binaries can simply be found in the blobstore. But the Recipe-Generator is invoked directly by the CC(-Bridge). So how do we identify the presence of a Lifecycle, and then invoke its Recipe-Generator? AKA, how do we load Lifecycles? We see three possible choices: 6 | 7 | **Choice #1:** A set of Lifecycles is part of the CC(-Bridge) codebase. The Recipe-Generators are directly compiled into the CC(-Bridge). This is very simple and reliable. But it is not very extensible without forking the code. 8 | 9 | **Choice #2:** The CC(-Bridge) performs some kind of DLL / shelling-out shenanigans to a local set of Recipe-Generator binaries. The options here are somewhat language-specific. How would we load these binaries? Co-located bosh jobs? 10 | 11 | **Choice #3:** Each Recipe-Generator runs as a service, which the CC(-Bridge) can discover dynamically. This requires service discovery, but is otherwise cleanly decoupled. 12 | 13 | The current implementation is a sort of hodgepodge of the above choices. The available lifecycle are passed in as arguments to the CC-Bridge, but the Recipe-Generators are hardwired into the code. 14 | -------------------------------------------------------------------------------- /proposals/per-application-crash-configuration-stories.csv: -------------------------------------------------------------------------------- 1 | Id,Title,Labels,Iteration,Iteration Start,Iteration End,Type,Estimate,Current State,Created at,Accepted at,Deadline,Requested By,Description,URL,Owned By,Owned By,Owned By 2 | 106626406,⬇ Crash Restart Policy ⬇,"",,,,release,,unscheduled,"Oct 26, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106626406,,, 3 | 87479698,When creating a DesiredLRP I can specify a Crash Restart Policy,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This story is purely about the Receptor API - nothing is actually going to use these values yet... 4 | 5 | ``` 6 | DesiredLRP.CrashRestartPolicy = { 7 | NumberOfImmediateRestarts: 3, 8 | MaxSecondsBetweenRestarts: 960, 9 | MaxRestartAttempts: 200 10 | } 11 | ``` 12 | 13 | The Receptor applies the following validations: 14 | 15 | - `NumberOfImmediateRestarts`: 16 | + Allowed to be `0` (means ""never restart immediately"") 17 | + Must be less than `MaxRestartAttempts` 18 | + Must be less than `10` 19 | - `MaxSecondsBetweenRestarts`: 20 | + Must be greater than `30` seconds 21 | - `MaxRestartAttempts` 22 | + Allowed to be 0 (means ""never stop restarting"") 23 | 24 | `CrashRestartPolicy` is optional.",https://www.pivotaltracker.com/story/show/87479698,,, 25 | 87479700,Diego honors the `CrashRestartPolicy` on a DesiredLRP when determining whether or not to restart an `ActualLRP`,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"Currently, Diego relies on the hard-coded default `CrashRestartPolicy` in runtime-schema. Instead, if the DesiredLRP has a `CrashRestartPolicy` Diego should use that policy. 26 | 27 | If the DesiredLRP has no `CrashRestartPolicy`, Diego should use the hard-coded default. 28 | 29 | > This affects both the Rep and the Converger but if we've done our homework, the change should be purely in Runtime-Schema.",https://www.pivotaltracker.com/story/show/87479700,,, 30 | 87479702,I can update a DesiredLRP's crash policy,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,The `DesiredLRPUpdateRequest` should support modifying the `CrashRestartPolicy`. Note that no further action needs to be taken. Runtime-Schema always looks up the `CrashRestartPolicy` on the Desired LRP when taking actions.,https://www.pivotaltracker.com/story/show/87479702,,, 31 | 87479704,The `DefaultCrashRestartPolicy` is stored in the BBS and can be modified via the Receptor API on a per-domain basis.,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"When upserting a Domain, I should be able to specify an optional `DefaultCrashRestartPolicy` on the Domain. 32 | 33 | When determining whether or not to restart the application Diego should do the following: 34 | 35 | - Use the `CrashRestartPolicy` on the `DesiredLRP` 36 | - If there is none, use the `DefaultCrashRestartPolicy` associated with the LRP's `domain` 37 | - If there is none, fallback on using the hard-coded `CrashRestartPolicy` in runtime-schema",https://www.pivotaltracker.com/story/show/87479704,,, 38 | 87479706,As a CF Admin I can CRUD the `DefaultCrashRestartPolicy` using the CC API,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This should end up in the CCDB. 39 | 40 | To flow this into Diego and keep Diego in sync with CF, NSYNC's bulker should fetch the `DefaultCrashRestartPolicy` and update it when it upserts the domain.",https://www.pivotaltracker.com/story/show/87479706,,, 41 | 87479708,"As a CF Admin, I can CRUD `CrashRestartPolicy`",crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,A la ApplicationSecurityGroups,https://www.pivotaltracker.com/story/show/87479708,,, 42 | 87479710,"As a CF Admin, I can associate a `CrashRestartPolicy` with a space and the see the new `CrashRestartPolicy` take effect",crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This has the effect of updating all applications in said space with the `CrashRestartPolicy`. This should emit a desire message to NSYNC's listener, which should update the `CrashRestartPolicy` associated with the `DesiredLRP`. 43 | 44 | In addition, NSYNC's bukler should eventually sync changes associated with `CrashRestartPolicy` (i.e. this should modify the etag associated with the applications in the space).",https://www.pivotaltracker.com/story/show/87479710,,, 45 | 87479712,The CF cli should support specifying the `DefaultCrashRestartPolicy`,"cli, crash-restart-policy",,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"``` 46 | cf default-crash-restart-policy --> gets the default crash restart policy 47 | cf set-default-crash-restart-policy --> sets the default crash restart policy 48 | ```",https://www.pivotaltracker.com/story/show/87479712,,, 49 | 87479714,The CF cli should support creating `CrashRestartPolicies` and associating them with spaces,"cli, crash-restart-policy",,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"A la Application Security Groups 50 | ``` 51 | cf crash-restart-policy --> get a particular crash restart policy 52 | cf crash-restart-policies --> get all crash restart policies 53 | cf create-crash-restart-policy --> creates a crash restart policy 54 | cf update-crash-restart-policy --> updates an existing crash restart policy 55 | cf delete-crash-restart-policy --> removes an existing crash restart policy 56 | cf bind-crash-restart-policy --> associates a crash restart policy with a space 57 | cf unbind-crash-restart-policy --> removes a crash restart policy from a space 58 | ```",https://www.pivotaltracker.com/story/show/87479714,,, 59 | 106626404,⬆︎ Crash Restart Policy ⬆︎,"",,,,release,,unscheduled,"Oct 26, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106626404,,, 60 | -------------------------------------------------------------------------------- /proposals/per-application-crash-configuration.md: -------------------------------------------------------------------------------- 1 | # Per-Application Crash Configuration 2 | 3 | A common feature request in CF is to have finer-grained control over how applications should be restarted in the face of crashes. 4 | 5 | Diego can easily support this. 6 | 7 | ## Crash Restart Policy 8 | 9 | A CrashRestartPolicy in Diego can be expressed as: 10 | 11 | ``` 12 | { 13 | NumberOfImmediateRestarts: 3, 14 | MaxSecondsBetweenRestarts: 960, 15 | MaxRestartAttempts: 200, 16 | } 17 | ``` 18 | 19 | This particular set of numbers constitutes Diego's default policy. Here's what they mean: 20 | 21 | - An application is immediately restarted for the first `NumberOfImmediateRestarts = 3` crashes. 22 | - After that we wait 30s, 60s, 120s, up-to `MaxSecondsBetweenRestarts = 960` between subsequent restarts. 23 | - After 200 restart attempts we give up on the application. 24 | 25 | > Note: the consumer cannot modify the minimum time between restarts (30s) or the fact that we impose an exponential backoff. Also the consumer cannot modify the 5-minute rule (i.e. we reset the crash count if you've been running for 5 minutes). 26 | 27 | ### Allowed Values: 28 | 29 | #### `NumberOfImmediateRestarts` 30 | 31 | - Allowed to be 0 (means never restart immediately) 32 | - Must be less than `MaxRestartAttempts` 33 | - Must be less than 10 34 | 35 | #### `MaxSecondsBetweenRestarts` 36 | 37 | - Must be greater then 30s 38 | 39 | #### `MaxRestartAttempts` 40 | 41 | - Allowed to be 0 (means never stop restarting) 42 | 43 | ## Setting the Crash Policy 44 | 45 | The crash policy is part of the `DesiredLRP`: 46 | 47 | ``` 48 | { 49 | ProcessGuid:..., 50 | ... 51 | CrashRestartPolicy: { 52 | ... 53 | }, 54 | ... 55 | } 56 | ``` 57 | 58 | - The crash policy can be updated after-the fact and, therefore, is part of `DesiredLRPUpdateRequest`. 59 | - `CrashRestartPolicy` is optional. If unspecified, the default will be used (see below). 60 | 61 | ## Setting the Default Crash Policy 62 | 63 | The `DefaultCrashRestartPolicy` is stored in the BBS and can be modified via the Receptor API on a *per-domain-basis* (note: this is not a BOSH property - the `DefaultCrashRestartPolicy` can be modified at runtime!) 64 | 65 | Only `DesiredLRP`s with *no* `CrashRestartPolicy` use the `DefaultCrashRestartPolicy`. If the `DefaultCrashRestartPolicy` is not specified via the Receptor API, Diego will use the hard-coded values listed above. 66 | 67 | ## Required CF Work 68 | 69 | We propose that only CF admins/operators will be allowed to set/modify crash policies. So: 70 | 71 | - As a CF Admin/Operator I can use the CC API to set the `DefaultCrashRestartPolicy`. 72 | - As a CF Admin/Operator I can set up a CrashRestartPolicy on an Org/Space/App level. 73 | 74 | > Notes: in addition to flowing an event through to NSYNC's listeneer, we'll probably want the NSYNC bulker to periodically (re)set the `DefaultCrashRestartPolicy` to make sure it's up-to-date. 75 | -------------------------------------------------------------------------------- /proposals/placement-constraints-stories.csv: -------------------------------------------------------------------------------- 1 | Id,Title,Labels,Iteration,Iteration Start,Iteration End,Type,Estimate,Current State,Created at,Accepted at,Deadline,Requested By,Description,URL,Owned By,Owned By,Owned By,Comment,Comment 2 | 106613240,⬇ Placement Constraints ⬇,"",,,,release,,unscheduled,"Oct 25, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106613240,,, 3 | 90748240,"As a consumer of the Receptor API, I should be able to specify a Constraint when requesting a Task or DesiredLRP",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"1. A constraint looks like: 4 | 5 | ``` 6 | type Constraint struct{ 7 | Require []string, 8 | } 9 | ``` 10 | 11 | 2. Cells (in particular, the Rep) should be BOSH configurable with a list of `Tags`. 12 | 13 | 3. When scheduling the DesiredLRP/Task the auction should only allocate the application on Cells that satisfy the constraint. If there are no Cells that satisfy the Constraint, the Auctioneer should attach a `diego_errors.CELL_MISMATCH_MESSAGE` placement error. 14 | 15 | To satisfy the constraint a Cell **MUST** have any tags in the `Require` list 16 | 17 | 4. Empty Constraints are allowed. 18 | 19 | 5. The Constraint cannot be modified on a `DesiredLRP` 20 | 21 | Here's the [accompanying Diego-Dev-Notes proposal](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/accepted_proposals/placement_pools.md).",https://www.pivotaltracker.com/story/show/90748240,,, 22 | 90748242,"As a consumer of the Receptor API, I should see that tags associated with the Cells I get from the `/v1/cells` endpoint",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,Here's the [accompanying Diego-Dev-Notes proposal](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/accepted_proposals/placement_pools.md).,https://www.pivotaltracker.com/story/show/90748242,,, 23 | 90748244,"As a consumer of the CC API, I can CRUD PlacementConstraint rules",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"This should mirror the organizational structure of [Application Security Groups](http://apidocs.cloudfoundry.org/203/). 24 | 25 | After this story I can: 26 | 27 | - Create a PlacementConstraint 28 | - Delete a PlacementConstraint 29 | - List all PlacementConstraints 30 | - Retrieve a Particular PlacementConstraint 31 | - Update a PlacementConstraint 32 | 33 | Add this point these are just objects in CCDB. No behavior changes (other than the presence of this new API). 34 | 35 | **PlacementConstraints should be under the `/v3` API**",https://www.pivotaltracker.com/story/show/90748244,,, 36 | 90748246,"As a consumer of the CC API, I can associate a PlacementConstraint with a Space (Staging)",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"A space can have one Staging PlacementConstraint. The staging tasks for applications in this space should be run on Cells that satisfy this constraint. 37 | 38 | After this story I can (If this is not what CC's DSL naturally gives us by default let me know): 39 | 40 | - Retrieve the Staging PlacementConstraint for a Space 41 | - Associate a PlacementConstraint as the Staging PlacementConstraint for a Space 42 | - Disassociate a PlacementConstraint as the Staging PlacementConstraint for a Space 43 | - Given a PlacementConstraint, list all Spaces that have it as a Staging PlacementConstraint 44 | 45 | Behaviorally: 46 | 47 | - If a Space has an associated PlacementConstraint then Diego should be informed of the PlacementConstraint when the staging task is created. 48 | - If a Space has no associated PlacementConstraint there should be no PlacementConstraint sent to Diego.",https://www.pivotaltracker.com/story/show/90748246,,, 49 | 90748666,"As a consumer of the CC API, I can associate a PlacementConstraint with a Space (Running) 50 | ",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"A space can have one Running PlacementConstraint. The staging tasks for applications in this space should be run on Cells that satisfy this constraint. 51 | 52 | After this story I can (If this is not what CC's DSL naturally gives us by default let me know): 53 | 54 | - Retrieve the Running PlacementConstraint for a Space 55 | - Associate a PlacementConstraint as the Running PlacementConstraint for a Space 56 | - Disassociate a PlacementConstraint as the Running PlacementConstraint for a Space 57 | - Given a PlacementConstraint, list all Spaces that have it as a Running PlacementConstraint 58 | 59 | Behaviorally: 60 | 61 | - If a Space has an associated PlacementConstraint then Diego should be informed of the PlacementConstraint when the DesiredLRP is created. 62 | - If a Space has no associated PlacementConstraint there should be no PlacementConstraint sent to Diego. 63 | 64 | Changing the PlacementConstraint associated with a space is allowed. At this point the applications in the space need to be restarted but CC has no way to orchestrate this. The same issue exists with the Application Security Groups.",https://www.pivotaltracker.com/story/show/90748666,,, 65 | 90748248,As a consumer of the CC API I can specify a default Staging PlacementConstraint,placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,This should be unspecified out of the box. Once specified it applies to any spaces that have no Staging PlacementConstraint.,https://www.pivotaltracker.com/story/show/90748248,,, 66 | 90748250,As a consumer of the CC API I can specify a default Running PlacementConstraint,placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,This should be unspecified out of the box. Once specified it applies to any spaces that have no Running PlacementConstraint.,https://www.pivotaltracker.com/story/show/90748250,,, 67 | 94711650,It should be possible to have different sets of routers service different pools of placement constraints,"placeholder, placement-constraints",,,,feature,,unscheduled,"May 15, 2015",,,Onsi Fakhouri,This is a placeholder that Onsi is injecting in that needs definition.,https://www.pivotaltracker.com/story/show/94711650,,,,"/cc @emalm -- this came up today and it's something that just needs to be on the roadmap so I've added this placeholder and injected it into the rough priority order. Happy to chat about it. (Onsi Fakhouri - May 15, 2015)","(also feel free to slap me on the wrist - am just trying to get it off of my plate and into tracker somewhere) (Onsi Fakhouri - May 15, 2015)" 68 | 106613356,⬆︎ Placement Constraints ⬆︎,"",,,,release,,unscheduled,"Oct 25, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106613356,,, 69 | -------------------------------------------------------------------------------- /proposals/placement_pools.md: -------------------------------------------------------------------------------- 1 | # Supporting multiple RootFSes. AKA what is Stack? 2 | 3 | Stack has been a confusing concept in the CC. I'd like to propose that stack simply correlates with the RootFS. I also propose that a single Cell should be able to support multiple RootFSes (this has many benefits including, for example, simplifying the process of upgrading from one RootFS to another). 4 | 5 | Where we need to head in the short-term is support for the following: 6 | 7 | - `lucid64` - a preloaded tarball based on `lucid` 8 | - `cflinxfs2` - a preloaded tarball based on `trusty` 9 | - `docker` - a dynamically download RootFS. 10 | 11 | I propose making this clearer in the Diego API by dropping `Stack` from the `DesiredLRP/Task` and, instead, relying on the existing `RootFS` field which would take on the values (e.g.): 12 | 13 | ``` 14 | lucid64: { 15 | RootFS: "preloaded://lucid64", 16 | } 17 | 18 | cflinxfs2: { 19 | RootFS: "preloaded://cflinuxfs2", 20 | } 21 | 22 | docker: { 23 | RootFS: "docker:///foo/bar#baz", 24 | } 25 | ``` 26 | 27 | ## Supporting multiple RootFSes 28 | 29 | Once the `DesiredLRP/Task` has this information we would modify the components as follows. 30 | 31 | ### Rep 32 | 33 | The Rep would take a list of preloaded RootFSes and a list of supported RootFSProviders. For example: 34 | 35 | ``` 36 | rep --preloaded-rootfses='{"lucid64":"/path/to/lucid64", "cflinuxfs2":"/path/to/cflinuxfs2"} --supported-root-fs-providers="docker" 37 | ``` 38 | 39 | This information could make its way onto the Cell Presence. Not strictly necessary at this point, though it is convenient to have access to this information from the API. 40 | 41 | ### Auction 42 | 43 | During the auction the Auctioneer will be given the RootFS and the Rep will furnish its available preloaded RootFSes and supported RootFS providers in its response to `State` requests: 44 | 45 | ``` 46 | State: { 47 | RootFSProviders = ["docker"], 48 | PreloadedRootFSes = ["lucid64", "cflinuxfs2"], 49 | } 50 | ``` 51 | 52 | The Auctioneer would then know whether or not a Cell could support the requested RootFS. In pseudocode: 53 | 54 | ``` 55 | func (c Cell) CanRunTask(task Task) bool { 56 | if task.RootFS.Scheme == "preloaded" { 57 | return c.PreloadedRootFSes.Contains(task.RootFSResource) 58 | } else { 59 | return c.RootFSProviders.Contains(task.RootFS.Scheme) 60 | } 61 | } 62 | ``` 63 | 64 | ### Stager/NSYNC 65 | 66 | Stager/NSYNC would translate CC's "Stack" requests to appropriate values for the two RootFS fields. 67 | 68 | --- 69 | 70 | # Placement Constraints (née Placement Pools) 71 | 72 | Placement Constraints are going to be one of the first new features that Diego brings to the platform. This proposal is intended to get that ball rolling. 73 | 74 | ## What are Placement Constraints? 75 | 76 | At the highest level Placement Constraints will allow operators to group Cells arbitrarily by painting them with tags. Diego workloads can then be constrainted to run on a certain set of tags. 77 | 78 | #### Tags 79 | 80 | A `tag` is a an arbitrary case-insensitive string (e.g. `"staging"`, `"production"`, `"skynet"`). Cells can be painted with arbitrarily many `tags`. For example 81 | 82 | Cell 1 | Cell 2 | Cell 3 | Cell 4 | Cell 5 | Cell 6 | Cell 7 | Cell 8 | Cell 9 83 | -------|--------|--------|--------|--------|--------|--------|--------|------- 84 | staging|staging|staging|staging|production|production|production|production| 85 | skynet|skynet|||skynet|skynet|||skynet 86 | 87 | A `tag` must be less than 64 characters long (arbitrary!) 88 | 89 | #### Constraints 90 | 91 | `constraint` is a set of rules associated with a Task or DesiredLRP. The `constraint` determines which Cells are eligible for running the given workload. 92 | 93 | Here is an empty `constraint`: 94 | ``` 95 | Constraint: { 96 | Require: [], 97 | } 98 | ``` 99 | This says that *no* tags are required. Cells 1-9 satisfy this `constraint`. 100 | 101 | Here is a `constraint` that requires a workload run on `staging` Cells: 102 | ``` 103 | Constraint: { 104 | Require: ["staging"], 105 | } 106 | ``` 107 | Cells 1-4 satisfy this `constraint`. 108 | 109 | Here is a `constraint` that only runs on `staging` Cells allocated to the `skynet` corporation: 110 | ``` 111 | Constraint: { 112 | Require: ["staging", "skynet"], 113 | } 114 | ``` 115 | Cells 1-2 satisfy this `constraint`. 116 | 117 | Finally, here are some `constraint`s that cannot be satisfied with the given Cell setup: 118 | ``` 119 | Constraint: { 120 | Require: ["alfalfa"], 121 | } 122 | ``` 123 | No cells could possiby satisfy these particular `constraint`s. Diego is not in the business of identifying these sorts of inconsistencies -- it is up to the consumer to coordinate their Placement Constraint `tags` and `constraint`s. Diego will, however, inform the user (asynchronously after a failed attempt to auction) when it fails to satisfy a constraint. 124 | 125 | #### How does this interact with `Stack`? 126 | 127 | It does not need to. CC's `Stack` is related to the RootFS (discussed above). 128 | 129 | #### Are Placement Constraints dynamic? 130 | 131 | No. Not for MVP. 132 | 133 | You cannot change the `constraint` on a DesiredLRP. You must request a new DesiredLRP. 134 | 135 | Also, you cannot change the tags on a running Cell. You will need to perform a rolling deploy to change the tags. 136 | 137 | Because of these constraints we do not need to make the converger aware about Placement Constraints: they can't change so there's nothing to keep consistent once ActualLRPs are scheduled on Cells. 138 | 139 | #### Querying Diego for `tags` 140 | 141 | For MVP we propose using the `/v1/cells` Receptor API to fetch all Cells and derive the set of tags. 142 | 143 | If/when we switch to a relational database we will be able to support `/v1/tags` to fetch all tags cheaply. 144 | 145 | ## Changes to Diego 146 | 147 | #### Task/DesiredLRP 148 | 149 | We add `constraint` to Tasks and DesiredLRPs. `constraint` will be **immutable** and is optional (leaving it off means "run this anywhere"). 150 | 151 | #### Rep 152 | 153 | The `rep` should take accept a new command line flag `-tags` -- a comma separated list of tags. We should make these tags bosh configurable. 154 | 155 | The `rep` will include the list of `tags` in `CellPresence` and responses to `State` requests from the Auctioneer. 156 | 157 | #### Auctioneer 158 | 159 | The `auctioneer` will be responsible for enforcing `constraint`s in addition to `rootfs` (see above). Extending it to apply the rules outlined above should be fairly straightforward. 160 | 161 | The only subtelty here is around the `PlacementError` that the `auctioneer` applies to ActualLRPs and Tasks that fail to be placed. There are two and these should be strictly interpreted as follows: 162 | 163 | - `diego_errors.CELL_MISMATCH_MESSAGE`: should be returned if there are *no* Cells satisfying the required `constraint`. 164 | - `diego_errors.INSUFFICIENT_RESOURCES_MESSAGE`: should be returned only if there *are* Cells satisfying the required `constraint` but those Cells do not have sufficient capacity to to run the requested work. 165 | 166 | ## Changes to CF/CC 167 | 168 | Like Application Security Groups (ASG), Placement Constraints (PC) will be a assigned on a per-space basis. I imagine we would mirror the organization of ASGs as closely as possible with the difference that the PC associated with an application will apply to staging and running applications. Looking at the [CC API docs](http://apidocs.cloudfoundry.org/197/) for ASG this would entail APIs that support: 169 | 170 | - CRUDding PCs 171 | - A space has one StagingPC and one RunningPC 172 | - Specifying a default StagingPC and a default RunningPC 173 | 174 | For CC, a PC would look like identical to a Diego `constraint`: 175 | 176 | ``` 177 | PC = { 178 | Require: [], 179 | } 180 | ``` 181 | 182 | As with ASGs, modifications to a PC will only go through once applications are restarted. 183 | 184 | #### Validations 185 | 186 | For MVP I don't think that CC should enforce any rules on PlacementConstraints. One could imagine a world in which the CC is taught by an operator about which `tags` are deployed (or reaches out to Diego to learn about available tags) -- I don't think we need to go there quite yet. 187 | 188 | #### Placement Constraints & Restarts vs Restages 189 | 190 | The CC is somewhat unclear (and confused) about what actions must trigger restages and what actions must trigger restarts. 191 | 192 | We propose that modifications to Placement Constraints need only trigger a restart. 193 | -------------------------------------------------------------------------------- /proposals/private-docker-registry.md: -------------------------------------------------------------------------------- 1 | # Private Docker Registry 2 | 3 | Diego's docker support today is limited to fetching images from the public docker registry. This has a number of issues: 4 | 5 | - No way to guarantee that we fetch the same layers when the user scales an application up. 6 | - No way to guarantee uptime (if the docker hub goes down we can't start a new instance) 7 | - No (good) way to support private docker registries 8 | 9 | To overcome these issues we propose running a private Docker registry backed by Riak CS. One of the most compelling aspects of our proposed solution is that it requires *no* changes to any core Diego components (including Garden-Linux). Instead, only CC and the Docker-Circus need to change. 10 | 11 | **Goals** 12 | - solve the aforementioned problems 13 | - do so without making Diego more docker-specific 14 | 15 | **None-Goals** 16 | - providing CF consumers with a docker registry that they `docker push` to 17 | 18 | ## The Proposal 19 | 20 | The basic premise here is that we would augment the existing staging step in the Docker flow to *fetch* the Docker layers and then re-upload them to a private Docker registry -- we would push to the registry using a uniquely generated guid (e.g the ProcessGuid) to guarantee that scaling the instance up/down always results in an identical set of layers. 21 | 22 | To implement this we would modify the Docker-Tailor and Stager/CC to do the following: 23 | 24 | 1. When staging a Dockerimage the Docker-Tailor fetches the layers and uploads them to the internal registry. 25 | - For a V0 implementation we would always fetch all the layers and attempt to push them to the internal registry. Docker's built-in caching support will naturally ensure we only push new layers, however will be required to download all the layers. 26 | - For a V1 implementation we might try to minimize the layers we *download* as well. Perhaps we can query the registry for the set of layers that comprise the image and only download ones that are *not* in our private registry? 27 | 2. Upon staging the Docker-Tailor should return a Dockerimage url that uniquely points to the image created in the internal registry. This will be unique to each CF push. 28 | 3. CC should store off this Dockerimage url and use it when requesting instances of the application. 29 | 30 | Note that the Task the Stager requests will need to be able to communicate with the private docker registry (presumably in the VPC). We'll need to use the Receptor's egress rules to poke holes to communicate with the private docker registry. We'll need to discover these at runtime in the Stager in order to poke *just* the holes to communicate with the docker registry. This is not an issue when it comes to *fetching* the Dockerimage as Garden-Linux does the downloading and has complete access to the VPC. 31 | 32 | ## The Road to MVP 33 | 34 | We begin by using the [Python docker-registry](https://github.com/docker/docker-registry). We can use the community-generated (docker-registry BOSH release)[https://github.com/cloudfoundry-community/docker-registry-boshrelease] to bosh deploy the python registry. The first iteration would be a spike in which we back the docker-registry with local storage. 35 | 36 | With an internal registry in place we can work on the code necessary to have the Tailor download and re-upload the Dockerimage. It would be best if this could be optional (maybe even on an app-by-app basis for flexibility for now). Applications that opt-into the internal registry will be copied into it, apps that aren't will always be fetched from the Dockerhub. 37 | 38 | Once this approach is validated we can make the registry more robust and add new features. 39 | 40 | ## The Roadmap past MVP 41 | 42 | - MVP: private docker registry backed by local disk 43 | - MVP: opt-in support for storing Dockerimages in the private registry 44 | - private docker registry backed by [Riak CS](https://github.com/cloudfoundry/cf-riak-cs-release) 45 | - highly-available private docker registry 46 | - ability to stage *private* images (i.e. images that require auth to download. Basic idea: user supplies credentials which we use *at staging time* to fetch their dockerimage - a short-lived token is OK since we only need it for staging) 47 | - support for [Docker's V2 API](https://github.com/docker/distribution) -- currently not ready for primetime, but when it is we'll want our private registry to speak V2 and use the new Docker go codebase 48 | - ability to prune the private registry of unused Dockerimages (e.g. grab DesiredLRPs in fresh domains and flag any unused layers for removal)? -------------------------------------------------------------------------------- /proposals/relational-bbs-db-stories-2016-02-16.prolific.md: -------------------------------------------------------------------------------- 1 | As a Diego developer, I expect the Diego BBS-benchmark test suite to include per-record reads and writes in the rep bulk loop 2 | 3 | The rep bulk loop tests in the BBS benchmark suite should perform individual reads and some writes (perhaps proportional to the number of instances) after their bulk loop, to match the actual rep code more closely. 4 | 5 | For 200K instances, can we do these requests all from the same ginkgo process, if its host has enough CPU/bandwidth/connections available, or do we need to distribute this test across several VMs to avoid resource bottlenecks at the test site? 6 | 7 | Acceptance: Benchmark test suites running in CI include these per-record writes and reads in their rep bulk loops. 8 | 9 | L: perf, perf:breadth 10 | 11 | --- 12 | 13 | As a Diego developer, I can run BBS unit tests against MySQL on my local workstation 14 | 15 | The BBS should be able to pass its unit tests when run locally against a MySQL database, as well as against etcd. For the component integration test, pass in the required MySQL connection information via flags. 16 | 17 | This should enclude support for encrypting the parts of Task, DesiredLRP, and ActualLRP records that should be encrypted and managing the encryption key name in the MySQL database. 18 | 19 | Also document the steps needed to install MySQL on a development workstation. 20 | 21 | Acceptance: I can follow the MySQL installation steps in the CONTRIBUTING document in diego-release and then successfully run the BBS unit tests. 22 | 23 | L: bbs:relational 24 | 25 | --- 26 | 27 | As a Diego developer, I expect BBS unit tests to run against both MySQL and etcd in CI 28 | 29 | Now that the BBS unit tests run against both MySQL and etcd, configure CI to run them both against etcd and MySQL in the 'units' concourse job. This also entails installing any new dependencies in the appropriate pipeline Docker image. 30 | 31 | Acceptance: I can verify from the 'units' job output that the expected tests suites are run. 32 | 33 | L: bbs:relational 34 | 35 | --- 36 | 37 | As a Diego developer, I expect inigo and component integration test suites to run against both MySQL and etcd in CI 38 | 39 | Extend MySQL support to the other non-BOSH-based integration tests that our components run against the BBS. This includes inigo, the 'cmd' integration tests in each component, and a few other test suites (such as the operation test suites in the rep). 40 | 41 | Acceptance: I can verify from the 'units' and 'inigo' job output that the expected tests suites are run. 42 | 43 | L: bbs:relational 44 | 45 | --- 46 | 47 | As a Diego operator, I can run a CF+Diego deployment backed by a MySQL DB instance 48 | 49 | diego-release should allow the BBS to be BOSH-configured to use a MySQL database outside the Diego deployment as its store, with etcd still the default. This option should be configurable in the spiff-based manifest generation through the property-overrides stub. 50 | 51 | Provide instructions and an auto-generated manifest for deploying a singleton MySQL instance to BOSH-Lite and configuring Diego to use it as its backend. 52 | 53 | Acceptance: I can follow instructions to deploy CF+Diego to BOSH-Lite using a MySQL DB and then run CATs and vizzini against it successfully. 54 | 55 | L: bbs:relational 56 | 57 | --- 58 | 59 | As a Diego developer, I expect Diego CI to run CATs and vizzini against a 'catsup' AWS environment with the BBS backed by an RDS MySQL instance 60 | 61 | Create a new 'catsup' CF+Diego deployment on AWS analogous to ketchup, with a MySQL RDS instance as its backing store. Set up a pipeline parallel to ketchup that runs CATs and vizzini against catsup, both on changes to diego-release or cf-release and periodically. 62 | 63 | Acceptance: CATs and vizzini against catsup are green. 64 | 65 | L: bbs:relational 66 | 67 | --- 68 | 69 | [RELEASE] Diego with relational BBS functional but experimental 70 | 71 | L: bbs:relational 72 | 73 | --- 74 | 75 | As a Diego developer, I expect to run Diego BBS benchmarks against an AWS environment with the BBS backed by an RDS MySQL instance 76 | Create a new CF+Diego deployment on AWS analogous to the 'diego-3' environment and set up the BBS benchmarks to run against it, both on changes to diego-release at the appropriate level of stability (release-candidate?) and periodically. Configure it to emit metrics to Datadog and to store the test-run artifacts in a separate S3 bucket. 77 | 78 | Aim to support 200K instances across 1000 reps initially. 79 | 80 | Acceptance: BBS benchmarks pass consistently at the appropriate scale, and results are recorded in S3. 81 | 82 | L: bbs:relational 83 | 84 | --- 85 | 86 | As a Diego developer, I expect 'catsup' and the relational-benchmarks deployment to be backed by HA MySQL deployments 87 | 88 | Deploy the CF MySQL release in a suitably HA configuration to catsup and the new relational benchmarks environment and configure the Diego deployment to use that as its backing store instead of RDS. Avoid usage of anything specific to AWS, such as ELBs, to make the MySQL deployment HA. 89 | 90 | Acceptance: Catsup and the relational benchmarks environment function without RDS instances provisioned. 91 | 92 | L: bbs:relational 93 | 94 | --- 95 | 96 | BBS communicates to the relational store over SSL 97 | 98 | The BBS can connect to its relational store using SSL. Any additional relevant information should be configurable via BOSH properties. Do we need to be able to supply a separate CA for use with that connection, or should we rely on the globally trusted or BOSH-deployed trusted certificates? 99 | 100 | Acceptance: catsup and the relational benchmarks connect to their relational stores over SSL. 101 | 102 | L: bbs:relational, security 103 | 104 | --- 105 | 106 | [RELEASE] Diego with relational BBS supports 200K instances 107 | 108 | L: bbs:relational, perf, perf:breadth 109 | 110 | --- 111 | 112 | Placeholder: remaining relational-BBS topics (scale, Postgres, migration from etcd, manifest generation) 113 | 114 | - BBS can use either MySQL or Postgres as a BBS store (BBS units run against all 3 backends) 115 | - Diego manifest generation supports relational stores (BOSH-Lite: CF's postgres; catsup: HA MySQL) 116 | - Diego migrates existing BBS data from etcd to its relational store 117 | - Diego CI runs another DUSTs suite in etcd-to-relational migration mode 118 | - [RELEASE] Diego with relational BBS fully supported 119 | - Diego CI runs a DUSTs suite with only MySQL 120 | - ketchup migrates to relational BBS store, catsup obsolete 121 | 122 | 123 | L: bbs:relational, placeholder 124 | -------------------------------------------------------------------------------- /proposals/relational-bbs-db-stories-2016-04-18.prolific.md: -------------------------------------------------------------------------------- 1 | As a Diego operator, I expect the BBS to migrate existing data from etcd to the relational store 2 | 3 | If the BBS is configured with connection information and credentials for both etcd and a relational store, it will migrate existing data in etcd to the relational store and then serve data only from the relational store. The migration mechanism should also ensure that if a BBS server becomes master that understands only how to serve data from etcd, it should realize that it is not current and relinquish the BBS lock. It should also ensure that this migration happens exactly once, so that subsequent BBS servers that start with the dual 4 | 5 | This likely needs to be done using the versioning system already implemented in the etcd store, so that BBSes all the way back to 0.1434.0 will function correctly. Consider the `CurrentVersion`/`TargetVersion` rules implemented in [the existing BBS migration mechanism](https://github.com/cloudfoundry/diego-dev-notes/blob/master/accepted_proposals/bbs-migrations.md#the-bbs-migration-mechanism). 6 | 7 | Acceptance: 8 | - I can configure a multi-instance BBS Diego first to populate etcd with data, then redeploy it with a relational configuration and observe that data is migrated from etcd to the relational store correctly (preserved, migration happens only once). 9 | - I can arrange the deployment with BBS both in etcd and in etcd+relational configurations and observe that once the data is migrated to the relational store, the etcd-only BBSes do not retain the lock during BBS elections. 10 | - I can cause the etcd-to-relational migration to fail on one BBS node, and observe that it does not prevent etcd-only BBSes from taking over and serving requests or other etcd-and-relational BBSes from performing the migration later. 11 | - If relational credentials are supplied to the BBS but a BBS node cannot connect on start-up, the deploy should fail on that node. 12 | 13 | 14 | L: bbs:relational 15 | 16 | --- 17 | 18 | As a Diego operator, I should be able to generate a Diego manifest with connection info for a relational store 19 | 20 | An operator should be able to supply connection information to connect to the relational store they have provisioned externally. This information should come in as parameters in a stub (either in property-overrides or a new optional stub argument). 21 | 22 | Acceptance: 23 | - I can follow documentation in diego-release to place the relational connection information in the appropriate stub as an input to manifest-generation, and then use that to deploy Diego against MySQL on BOSH-Lite. 24 | 25 | L: bbs:relational 26 | 27 | --- 28 | 29 | Explore Diego system behavior with several CF MySQL nodes to discover how deadlock errors and rollbacks affect Diego correctness 30 | 31 | We'd like to know how Diego deals with consistency errors if it's communicating with several different CF MySQL nodes simultaneously. Core Services has said we'll get deadlocks, but they will roll back and manifest as a failure client-side. Investigate how this affects the correctness of the Diego system, and generate stories with recommendations of how to mitigate the problems in the BBS, if possible. If this is impossibly bad, we may need to change the switchboard proxy to do leader election via, say, consul locks. 32 | 33 | Also determine if we can configure the mysql client to accept several IPs, or if it requires only a single hostname. If the latter, we should explore colocating the consul agent on the proxy nodes to register them via Consul DNS. 34 | 35 | Timebox to 2 days initially. 36 | 37 | L: bbs:relational, charter 38 | 39 | --- 40 | 41 | As a Diego operator, I should be able to follow documentation to deploy an single-node standalone CF-MySQL cluster on my infrastructure 42 | 43 | The diego-release documentation should explain how to produce a deployment manifest for a single-node standalone CF-MySQL cluster. Ideally, we can link to documentation in the cf-mysql release itself with a high-level explanation of the required deployment parameters (instance counts, job types). If the cf-mysql-release makes this real difficult, talk to Core Services about getting it changed in their release. (Spoiler: Luan says they need this help for a minimal+standalone deployment option.) 44 | 45 | Acceptance: 46 | - I can follow documentation to provision a single-node standalone CF-MySQL cluster on AWS and then successfully deploy Diego to use that as its datastore via the manifest-generation scripts. 47 | 48 | L: bbs:relational 49 | 50 | --- 51 | 52 | As a Diego operator, I should be able to follow documentation to provision an RDS MySQL instance to support my AWS Diego deployment 53 | 54 | The diego-release AWS documentation should explain how to provision an RDS MySQL instance manually in the VPC created for the BOSH+CF+Diego stack, and then how to provide the connection information to the Diego manifest-generation scripts. This configuration should be optional for now. 55 | 56 | Let's also make this manual for now, and consider how to provision this with CloudFormation/BOOSH/bbl later. 57 | 58 | Acceptance: 59 | - I can follow the AWS-specific documentation in diego-release to provision this RDS instance and connect my AWS Diego deployment to it. 60 | 61 | L: bbs:relational 62 | 63 | --- 64 | 65 | As a Diego operator, I should be able to follow documentation to deploy an HA standalone CF-MySQL cluster on my infrastructure 66 | 67 | The diego-release documentation should explain how to produce a deployment manifest for an HA standalone CF-MySQL cluster. Ideally, we can link to documentation in the cf-mysql release itself with a high-level explanation of the required deployment parameters (instance counts, job types). 68 | 69 | BLOCKED on outcome of the charter #117854575 to explore Diego behavior when interacting with multiple masters. 70 | 71 | Acceptance: 72 | - I can follow documentation to provision an HA standalone CF-MySQL cluster on AWS and then successfully deploy Diego to use that as its datastore via the manifest-generation scripts. 73 | 74 | L: bbs:relational 75 | 76 | --- 77 | 78 | As a Diego developer, CI should be configured to run the DUSTs to upgrade the BBS from Diego 0.1434.0 targeting etcd to latest targeting MySQL 79 | 80 | This DUSTs run should not yet block delivery of diego-release. 81 | 82 | L: bbs:relational 83 | 84 | --- 85 | 86 | As a Diego operator, I should be able to generate a Diego manifest that uses only a relational store 87 | 88 | As a Diego operator who either has finished migrating an existing etcd-backed deployment to MySQL or is starting a new deployment with MySQL only, I should be able to produce a Diego deployment manifest that does not require the etcd-release and does not set etcd deployment properties. The manifest-generation script should also require minimal relational configuration to be provided. 89 | 90 | Acceptance: 91 | - I can follow documentation to generate a deployment manifest intended to support only a relational store, and then deploy Diego successfully without an etcd release present on the BOSH director. The manifest should also not contain any extraneous etcd-specific parameters, and should not configure the BBS to communicate with etcd at all. 92 | 93 | L: bbs:relational 94 | 95 | --- 96 | 97 | Placeholder: remaining relational-BBS topics (pipeline work, scale validation, validation of robustness to failure) 98 | 99 | - Validate Diego system correctness in face of MySQL deployment failures 100 | - warp-drive, etcd-to-MySQL DUST failures block delivery 101 | - basic end-to-end perf validation? 102 | - make sure BBS can run SQL migrations (need migration to drive this for sure, could use the timeout-units one) 103 | - [RELEASE] Diego with relational BBS fully supported 104 | - Diego CI runs a DUSTs suite with only MySQL 105 | - ketchup migrates to relational BBS store, catsup obsolete 106 | 107 | 108 | L: bbs:relational, placeholder 109 | -------------------------------------------------------------------------------- /proposals/relational-bbs-db-stories-2016-05-10.prolific.md: -------------------------------------------------------------------------------- 1 | As a Diego operator, I should be able to use Postgres as a backing relational store for the Diego BBS 2 | 3 | We have received feedback that some operators would prefer to use Postgres as their backing relational store for the BBS API, as they have operational familiarity with that database instead of MySQL. CC and UAA also provide top-level support for both MySQL and postgres, so supporting both for Diego's BBS database is reasonable. 4 | 5 | ### Acceptance 6 | 7 | - I can follow documentation to install and configure postgres on my local workstation and then run BBS test suites against it. 8 | - I can observe from test runs in CI that appropriate test coverage is run against Postgres. 9 | 10 | L: bbs:relational 11 | 12 | --- 13 | 14 | As a Diego PM, I expect that failures in the super-laser tests and etcd-to-MySQL DUSTs run should block delivery of diego-release 15 | 16 | In order to guarantee support for Diego when backed by a relational store, we should arrange for test runs in CI against a relational store to gate delivery of diego-release. 17 | 18 | BLOCKED on #117843841: running an etcd-to-MySQL DUSTs run. 19 | 20 | L: bbs:relational, blocked 21 | 22 | --- 23 | 24 | [RELEASE] Diego with relational BBS fully supported 25 | 26 | L: bbs:relational 27 | 28 | --- 29 | 30 | Update performance-testing protocol for 250K-instance, 1000+-cell end-to-end performance test 31 | 32 | Update the [performance-testing protocol in the diego-dev-notes](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/measuring_performance.md) for the 250K-instance experiment we intend to run. Address the following points: 33 | 34 | - Distribution of and resource limits for app instances: at only 1000 cells, container density will be 250 per cell, which is right at the default limit for garden-linux (and garden-runc?) 35 | - Prescribed IOPS configuration for cell disk at that scale 36 | - Consistency of app behavior (we seem to have turned off substantial app logging in later experiments?) 37 | - Collection of logs and metrics from the environment: for example, we learned that Datadog is not ideal for later analysis of metrics 38 | 39 | Let's have a separate story to add extra experiments that we think would be valuable to run. 40 | 41 | 42 | ### Acceptance 43 | 44 | The performance testing protocol has explicit configuration parameters for the 250K-instance test and is up-to-date and consistent in its Diego terminology. It also should account for any deficiencies we observed in data processing capabilities after the 10K-instance test run. 45 | 46 | L: bbs:relational, perf, perf:breadth 47 | 48 | --- 49 | 50 | As a Diego operator, I expect to be able to upgrade my MySQL-backed deployment from the earliest supported version to the latest without downtime 51 | 52 | We can ensure this by configuring CI to run the DUSTs from a MySQL-backed deployment on that earliest supported version to a MySQL-backed deployment on the latest candidate version. 53 | 54 | BLOCKED on declaring support for a relational store in a future Diego final release. 55 | 56 | L: bbs:relational, blocked, pipeline 57 | 58 | --- 59 | 60 | [CHORE] Migrate ketchup to a relational store and destroy the super-laser environment 61 | 62 | BLOCKED on declaring support for a relational store in a future Diego final release. 63 | 64 | L: bbs:relational, blocked, pipeline 65 | 66 | --- 67 | 68 | Placeholder: remaining relational-BBS topics (migration validation, especially on failure) 69 | 70 | ### Pre-support 71 | 72 | - charter: investigate migration failure scenarios 73 | - make sure BBS can run SQL migrations (need migration to drive this for sure, could use the timeout-units one) 74 | 75 | 76 | L: bbs:relational, placeholder 77 | -------------------------------------------------------------------------------- /proposals/release-versioning-testing-stories.prolific.md: -------------------------------------------------------------------------------- 1 | As a Diego developer, I can deploy CF v220 + Diego 0.1434.0 + GL 0.307.0 ('V0') as a specific collection of coordinated piecewise deployments 2 | 3 | This CF + Diego deployment should consist of 5 separate BOSH deployments: 4 | 5 | - CF 6 | - `D1`: 1 database VM 7 | - `D2`: 1 brain, 1 cc-bridge, 1 access, 1 route-emitter 8 | - `D3`: 1 cell 9 | - `D4`: 1 cell 10 | 11 | Acceptance: 12 | 13 | - I can run a script to generate these manifests that I can easily configure to produce BOSH-Lite-specific manifests. 14 | - After uploading the designated releases, I can use the manifests to deploy CF + Diego at these versions to a BOSH-Lite, even if other releases exist on its BOSH director. 15 | - I can validate that the deployment is correct by running DATs against it successfully. 16 | 17 | 18 | L: pipeline:upgrade-stable 19 | 20 | --- 21 | 22 | As a Diego developer, I can upgrade my piecewise V0 deployment to a given CF/Diego/Garden/ETCD version combination later than V0 (V-prime) 23 | 24 | For a given collection of CF and Diego versions, I can use scripts to generate a set of manifests designed to upgrade the piecewise V0 deployment to the new CF/Diego versions. 25 | 26 | 27 | Acceptance: 28 | 29 | - After checking out cf-release and diego-release at later versions, generating releases, and uploading them, I can generate these CF and Diego manifests and use them to upgrade my piecewise V0 deployment to V-prime. 30 | 31 | 32 | L: pipeline:upgrade-stable 33 | 34 | --- 35 | 36 | As a Diego developer, I have an 'upgrade-from-stable' test suite to create a piecewise V0 deployment against a BOSH director 37 | 38 | This test suite should take the piecewise V0 deployment and deploy CF and Diego to the target director. We can assume that the correct V0 releases are already uploaded to the director. 39 | 40 | Acceptance: 41 | 42 | - After uploading the V0 releases to my BOSH-Lite, I can run the test suite with the appropriate configuration to deploy the piecewise V0 deployment. 43 | - The suite should fail if the deployments already exist or if the BOSH commands themselves fail. 44 | 45 | 46 | L: pipeline:upgrade-stable 47 | 48 | --- 49 | 50 | As a Diego developer, I expect that the 'upgrade-from-stable' suite upgrades my V0 deployment to V-prime 51 | 52 | The suite should assume that the V-prime releases are also already uploaded to the target BOSH director. The upgrade should proceed in the following order: 53 | 54 | - Upgrade `D1` (databases) 55 | - Upgrade `D3` (cell bank 1) 56 | - Upgrade `D2` (brain and pals) 57 | - Upgrade CF 58 | - Upgrade `D4` (cell bank 2) 59 | 60 | Acceptance: 61 | 62 | - I can run this test suite against a BOSH-Lite after uploading the required V0 and V-prime releases and verify that it deploys a piecewise V0 deployment and then upgrades it to V-prime. 63 | 64 | 65 | L: pipeline:upgrade-stable 66 | 67 | --- 68 | 69 | As a Diego developer, I expect that the 'upgrade-from-stable' suite runs smoke tests against my piecewise deployment at each step between upgrades from V0 to V-prime 70 | 71 | We also need to change the deployment strategy to stop and start groups of cells intentionally to ensure that the apps in the smoke tests are running on cells of the intended version. 72 | 73 | Deployment changes: 74 | 75 | - After upgrading `D3`, stop `D4`. 76 | - After upgrading `D2`, start `D4` and stop `D3`. 77 | - Before upgrading `D4`, start `D3`. 78 | 79 | Smoke test to run after each set of deployment changes: 80 | - Push 2 instances of new test app (e.g., Dora) with SSH enabled 81 | - Verify all instances run, are routable, have visible logs, and are accessible over SSH (run some trivial command) 82 | - Destroy the instances and their routes 83 | 84 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the suite output and inspecting the system during the run that the smoke tests are running. 85 | 86 | 87 | L: pipeline:upgrade-stable 88 | 89 | --- 90 | 91 | As a Diego developer, I expect that the 'upgrade-from-stable' suite pushes an app to my piecewise deployment after V0 is deployed and checks its scalability at each step between upgrades from V0 to V-prime 92 | 93 | After the suite deploys V0, it pushes one instance of a test app (e.g., Dora) and verifies it is running and routable. Call it A0. 94 | 95 | After each step in the deployment upgrade, the suite scales the app up to 2 instances, verifies they are both running and routable, and then scales the app back down to 1 instance and verifies the second instance is gone. 96 | 97 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the suite output that the A0 app is created and exercised throughout the test run. 98 | 99 | 100 | L: pipeline:upgrade-stable 101 | 102 | --- 103 | 104 | As a Diego developer, I expect the 'upgrade-from-stable' suite to check the initial V0 app for continual routability during Diego deployment upgrades (but not CF) 105 | 106 | During all of the BOSH operations on the Diego deployment, the test suite should continually verify that the A0 app is routable. The BOSH operations from previous stories are already structured so that the instance is always able to evacuate to an active cell. 107 | 108 | This routing verification should not be done during the CF deployment, as parts of the routing tier (HAproxy and/or the gorouter) may not be available then. 109 | 110 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the test output (and by tailing logs on the A0 app) that these routing tests are performed. 111 | 112 | 113 | L: pipeline:upgrade-stable 114 | 115 | --- 116 | 117 | As a Diego developer, I expect to run the 'upgrade-from-stable' suite in CI against a new BOSH-Lite instance provisioned on AWS 118 | 119 | In this case, V-prime is the collection of releases in the candidate build in the pipeline. 120 | 121 | A job in the Diego CI should provision a new BOSH-Lite instance on AWS, upload the V0 and V-prime releases to it, then generate the V0 and V-prime piecewise manifests and run the 'upgrade-from-stable' test suite. Once it is done, we should destroy the BOSH-Lite instance. 122 | 123 | We should run this job as part of the main Diego pipeline, in parallel with CATs and DATs after the ketchup deployment and smoke tests succeed. If it fails, we fail to promote the diego-release candidate through the pipeline 124 | 125 | 126 | L: pipeline:upgrade-stable 127 | 128 | -------------------------------------------------------------------------------- /proposals/release-versioning-testing-v2.md: -------------------------------------------------------------------------------- 1 | # Upgrade Testing V2 2 | 3 | ## Updates from [the previous version](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md) 4 | 5 | We revisit release version testing and the current upgrade testing strategy (the DUSTs) with the following changes and considerations in mind: 6 | 7 | - The recommended way to deploy Cloud Foundry is now [cf-deployment](https://github.com/cloudfoundry/cf-deployment) instead of the manifest-generation systems in [cf-release](https://github.com/cloudfoundry/cf-release) and [diego-release](https://github.com/cloudfoundry/diego-release), which will be removed soon. 8 | - As a result, the instance groups are implicitly spread across AZs, instead of explicitly grouped by AZ, although some instance groups may still end up being parallelized. 9 | - Breaking changes to other systems in a CF deployment have caused the DUSTs to fail much more often than an actual breaking change between the APIs or components under the control of the Diego team. 10 | - Reliance on BOSH makes the initial deploy and subsequent updates slow. An inigo-style test suite written in Ginkgo could allow the Diego team to iterate faster and to run more focused subset of the test suite. 11 | 12 | ## Aspects remaining from [the previous version](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md) 13 | 14 | - We perceive there to be value in maintaining deployments at different heterogeneous combinations of versions and running an appropriate set of tests to validate functionality, instead of coincidentally running tests while proceeding through an upgrade of a single BOSH deployment. 15 | - Operational guarantees have not changed regarding compatibility of versions of diego-release. Refer to [this section](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md#operational-guarantees) in the original version of the document. 16 | - We still do not require testing every possible upgrade path (i.e. every `(V0, V1)` pair) See [this section](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md#selection-of-versions-for-testing) for the rationale. 17 | 18 | # Test configurations 19 | 20 | The new dusts suite separate the two concerns, explained in the following sections. 21 | 22 | We select two main configurations of importance, based on the initial major versions of the release and major architectural changes: 23 | 24 | - Diego starting at v1.0.0, configured to use Consul and global route-emitters, 25 | - Diego starting at v1.25.2, configured to use Locket with local route-emitters. 26 | 27 | ## App Routability during Upgrades 28 | 29 | As in the current [DUSTs](https://github.com/cloudfoundry/diego-upgrade-stability-tests), there will always be an app pushed right after the initial configuration (C0) is deployed and constantly checked for routability. 30 | 31 | ### BBS running with Consul (from v1.0.0 to develop) 32 | 33 | - Initial Diego version is v1.0.0. 34 | - Infrastructure components: MySQL database, NATS, Consul, GoRouter. 35 | - Route-emitter is configured in global mode, with Consul lock. 36 | - Each Cell includes the Diego cell rep and garden-runc. 37 | 38 | | Configuration | BBS | BBS Client | Auctioneer | Cell 0 | Cell 1 | RouteEmitter | Notes | 39 | |---------------|-----|------------|------------|--------|---------|--------------|-------------------------------------| 40 | | C0 | v0 | v0 | v0 | v0 | v0 | v0 | Initial configuration | 41 | | C1 | v1 | v0 | v0 | v0 | v0 | v0 | Simulates upgrading diego-api | 42 | | C2 | v1 | v1 | v0 | v0 | v0 | v0 | Simulates API upgrading | 43 | | C3 | v1 | v1 | v1 | v0 | v0 | v0 | Simulates scheduler upgrading | 44 | | C4' | v1 | v1 | v1 | v1 | v0 | v0 | Simulates Cell evacuation/upgrading | 45 | | C4 | v1 | v1 | v1 | v1 | v1 | v0 | Simulates cell evacuation/upgrading | 46 | | C5 | v1 | v1 | v1 | v1 | v1 | v1 | Simulates route-emitter upgrade | 47 | 48 | ### BBS running with Locket (from v1.25.2 to develop) 49 | 50 | - Initial Diego version is v1.25.2. 51 | - Infrastructure components: MySQL database, NATS, GoRouter. 52 | - Route-emitter is configured in local mode. 53 | - Each Cell includes the Diego cell rep, garden-runc, grootfs, and local route-emitters. 54 | 55 | | Configuration | Locket | BBS | BBS Client (Vizzini) | Auctioneer | Cell 0 | Cell 1 | Notes | 56 | |---------------|--------|-----|----------------------|------------|--------|--------|-------------------------------------| 57 | | C0 | v0 | v0 | v0 | v0 | v0 | v0 | Initial configuration | 58 | | C1 | v1 | v0 | v0 | v0 | v0 | v0 | Simulates upgrading diego-api | 59 | | C2 | v0 | v1 | v0 | v0 | v0 | v0 | | 60 | | C3 | v1 | v1 | v1 | v0 | v0 | v0 | Simulates API upgrading | 61 | | C4 | v1 | v1 | v1 | v1 | v0 | v0 | Simulates scheduler upgrading | 62 | | C5' | v1 | v1 | v1 | v1 | v1 | v0 | Simulates cell evacuation/upgrading | 63 | | C5 | v1 | v1 | v1 | v1 | v1 | v1 | Simulates cell evacuation/upgrading | 64 | 65 | ## Smoke Tests 66 | 67 | Run the vizzini test suite at the appropriate version to verify core functionality of the Diego BBS API. This test could be structured as its own isolated context, in which it starts the appropriate versions of the components and runs the vizzini tests, instead of coupling it to the sequential upgrade of the components. 68 | 69 | ### BBS running with Consul (from v1.0.0 to develop) 70 | 71 | - Initial Diego version is v1.0.0. 72 | - Infrastructure components: MySQL database, NATS, Consul, GoRouter. 73 | - Route-emitter is configured in global mode, with Consul lock. 74 | - Each Cell includes the Diego cell rep and garden-runc. 75 | 76 | | Configuration | BBS | BBS Client | SSH proxy + Auctioneer | Cell | RouteEmitter | Notes | 77 | |---------------|-----|------------|------------------------|------|--------------|---------------------------------| 78 | | C0 | v0 | v0 | v0 | v0 | v0 | Initial configuration | 79 | | C1 | v1 | v0 | v0 | v0 | v0 | Simulates upgrading diego-api | 80 | | C2 | v1 | v1 | v0 | v0 | v0 | Simulates API upgrading | 81 | | C3 | v1 | v1 | v1 | v0 | v0 | Simulates scheduler upgrading | 82 | | C4 | v1 | v1 | v1 | v1 | v0 | Simulates cell upgrade | 83 | | C5 | v1 | v1 | v1 | v1 | v1 | Simulates route-emitter upgrade | 84 | 85 | ### BBS running with Locket (from v1.25.2 to develop) 86 | 87 | - Initial Diego version is v1.25.2. 88 | - Infrastructure components: MySQL database, NATS. 89 | - Route-emitter is configured in local mode. 90 | - Each Cell includes the Diego cell rep, garden-runc, grootfs, and local route-emitters. 91 | 92 | | Configuration | Locket | BBS | BBS Client (Vizzini) | SSH proxy + Auctioneer | Cell | Notes | 93 | |---------------|--------|-----|----------------------|------------------------|------|-------------------------------| 94 | | C0 | v0 | v0 | v0 | v0 | v0 | Initial configuration | 95 | | C1 | v1 | v0 | v0 | v0 | v0 | Simulates upgrading diego-api | 96 | | C2 | v0 | v1 | v0 | v0 | v0 | | 97 | | C3 | v1 | v1 | v1 | v0 | v0 | Simulates API upgrading | 98 | | C4 | v1 | v1 | v1 | v1 | v0 | Simulates scheduler upgrading | 99 | | C5 | v1 | v1 | v1 | v1 | v1 | Simulates cell upgrading | 100 | 101 | ## Notes 102 | 103 | - The above configurations have the added benefit of testing the contract between rep and BBS regarding cell presences (namely, that the newer BBS can still parse old rep cell presences in both Consul or Locket). 104 | - Deploy only one instance for all instance groups except the Cell to ensure routability to the test app at all times. 105 | - Cells are to be updated gracefuly (that is, each cell evacuates before it stops) to ensure routability to the test app at all times. 106 | - Configurations follow the order of instance group updates as of [this version of cf-deployment](https://github.com/cloudfoundry/cf-deployment/commit/9be2644da8de08540891e24856bbdb88f9a83f67). 107 | - We should test both combinations of different versions of BBS and Locket since different versions can exist during a rolling update. 108 | - The SSH proxy and auctioneer do not communicate with each other and exist on the same VM, hence we update them at the same time. 109 | 110 | 111 | # Concerns 112 | 113 | - Routing: We think testing HTTP routing is sufficient, as that is the main routing tier of importance in CF at present. The Routing team is primarily responsible for ensuring interoperability of the clients and servers in the routing control plane. 114 | - Locket: We test with a separate configuration that includes Locket because of the importance of this architectural change. 115 | - Cell update order: The most common update order in the deployment is to update the Diego control plane first (BBS, auctioneer, locket) and then the cells, so we focus on those configurations. 116 | - BOSH upgradability: We do lose test coverage for ensuring that the components restart correctly, and we have encountered issues around pid-file management, but we could handle this with a different test suite. 117 | - Quota enforcement: We did regress on enforcing instance quotas with a breaking change to the cell rep API, but there were vizzini tests that would have caught that regression. 118 | - Diego client: Running the appropriate version of vizzini will provide sufficient coverage of the Diego BBS client functionality. 119 | -------------------------------------------------------------------------------- /proposals/release-versioning-testing.md: -------------------------------------------------------------------------------- 1 | # Release Versioning 2 | 3 | ## Operational guarantees 4 | 5 | We intend the BOSH release of Diego to have a simple policy regarding upgradability, communicated clearly through its version: operators will have a stable upgrade path through versions of Diego as long as they deploy every major version of the release. This stability includes keeping application instances running through upgrades and maintaining the integrity of data submitted through Diego's BBS API. 6 | 7 | To signal to operators the ramifications of upgrading their deployments to new versions of the release, we will follow `MAJOR.MINOR.PATCH`-style [semantic versioning](http://semver.org) with the following conventions: 8 | 9 | - A **major version increment** may include the removal of deprecated APIs or functionality from a previous major release version. In accordance with our goal of allowing operators to upgrade seamlessly across major versions, we will typically remove functionality only after it is deprecated for a full major version. 10 | 11 | - A **minor version increment** will indicate the addition of new functionality to the release, typically in the form of new APIs or behavior. 12 | 13 | - A **patch version increment** will indicate no new addition of functionality, and the introduction only of bugfixes and refinements to behavior of existing APIs. 14 | 15 | We call a pair of versions `(V1, V2)` *upgradable* if `V1 <= V2` lexicographically and if `0 <= MAJOR(V2) - MAJOR(V1) <= 1`. An upgradable version pair is *strict* if `V1 < V2`. 16 | 17 | ## Timelines 18 | 19 | We have yet to set a precedent for publishing major versions of Diego, and to a large extent our cadence there will depend on the scope and future direction of the Diego project within the CF ecosystem. That said, we understand that there is a natural tension between our desire to keep the Diego codebase and features clean and coherent, and the needs of the CF ecosystem and community to have a sufficiently stable set of APIs on which to build our platform. To that end, we expect to release major versions of Diego no more frequently than every three months, and more realistically approximately every six months to a year. 20 | 21 | 22 | ## Testing strategies 23 | 24 | For a given upgradable version pair `(V1, V2)`, we intend to conduct a battery of tests to ensure that deployments with different combinations of components at `V1` and `V2` function correctly. These tests should include being able to run, scale up, and scale down existing apps created in the `V1` deployment, and to stage and run apps created in the mixed deployment. It will be most important to verify that `V1` cells are compatible with `V2` inputs and vice-versa, so we will test these operations against the following combinations of components: 25 | 26 | 27 | | Configuration | BBS | Cells | Brain/Bridge | CF | 28 | |----------------|------|-------|--------------|------| 29 | | `C0` (Initial) | `V1` | `V1` | `V1` | `V1` | 30 | | `C1` | `V2` | `V1` | `V1` | `V1` | 31 | | `C2` | `V2` | `V2` | `V1` | `V1` | 32 | | `C3` | `V2` | `V1` | `V2` | `V1` | 33 | | `C4` | `V2` | `V1` | `V2` | `V2` | 34 | | `C5` | `V2` | `V2` | `V2` | `V2` | 35 | 36 | 37 | Rather than attempting these operations against a deployment in flight during a monolithic upgrade of the Diego deployment, we can progress through these combinations intentionally and deterministically by arranging the Diego deployment as four separate, coordinated deployments: 38 | 39 | - `D1`: Database nodes 40 | - `D2`: Brain, CC-Bridge, Route-Emitter, and Access nodes 41 | - `D3`: Cell group 1 42 | - `D4`: Cell group 2 43 | 44 | The CF deployment will be separate, as it is today. The configurations above can then be achieved with the following sequence of BOSH operations: 45 | 46 | - `C1`: Deploy `V2` to `D1` 47 | - `C2`: Deploy `V2` to `D3`, stop `D4` 48 | - `C3`: Deploy `V2` to `D2`, start `D4`, stop `D3` 49 | - `C4`: Deploy `V2` to CF 50 | - `C5`: Start `D3`, deploy `V2` to `D4` 51 | 52 | Assuming that BOSH stopping a VM triggers draining, so that the Diego cells evacuate correctly, these operations should allow existing app instances to be routable continually. 53 | 54 | We detail the tests for each one of these configurations below: 55 | 56 | ### `C0`: System at `V1` 57 | 58 | With the system in its initial state, we seed it with an instance of Dora, Grace, or another test application asset and verify it is routable. For the sake of illustration below, assume that the app is Dora. 59 | 60 | 61 | ### `C1`: BBS at `V2`, Cells and Brain/Bridge at `V1` 62 | 63 | During the upgrade of the BBS to `V2`, we verify that the Dora instance is always routable. 64 | 65 | After upgrading the BBS to `V2`, we verify that the system functions correctly: 66 | 67 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1. 68 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it. 69 | 70 | 71 | ### `C2`: BBS and Cells at `V2`, Brain/Bridge/CF at `V1` 72 | 73 | During the upgrade of the `D3` cells to `V2` and the shutdown of the `D4` cells, we verify that the Dora instance is always routable. 74 | 75 | With all the active Cells at `V2`, we verify that the system functions correctly: 76 | 77 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1. 78 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it. 79 | 80 | 81 | ### `C3`: BBS and Brain/Bridge at `V2`, CF and Cells at `V1` 82 | 83 | During the upgrade of the Brain and CC-Bridge to `V2`, we verify that the Dora instance is always routable. 84 | 85 | During the re-start of the `D4` cells and the stop of the `D3` cells, the Dora instance should be routable, as it will evacuate from the `D3` cells to the `D4` cells. 86 | 87 | With CF and all the active Cells at `V1` and the rest of the system at `V2`, we verify that the system functions correctly: 88 | 89 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1. 90 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it. 91 | 92 | > If we introduce changes in the RunInfo that cannot be represented appropriately in the `V1` BBS API, do we expect that `V1` cells will be able to run that work? If not, is there any way we can ensure that older, incompatible cells won't be assigned work they can't do? 93 | 94 | 95 | ### `C4`: BBS and Brain/Bridge/CF at `V2`, Cells at `V1` 96 | 97 | During the upgrade of the CF deployment to `V2`, the gorouter and HAproxy will likely be offline for some amount of time, so the Dora instance is not expected to be routable externally then. 98 | 99 | With all the active Cells at `V1` and the rest of the system at `V2`, we verify that the system functions correctly: 100 | 101 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1. 102 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it. 103 | 104 | 105 | ### `C5`: BBS, Cells, and Brain/Bridge/CF all at `V2` 106 | 107 | With the entire system at `V2`, we verify that the system functions correctly: 108 | 109 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1. 110 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it. 111 | 112 | 113 | ## Minimal Deployment on BOSH-Lite 114 | 115 | In order to run this test suite effectively in CI, we propose that the target environment be an instance of BOSH-Lite running on AWS, which will be easy to provision on demand and fast to deploy `V1` and then to migrate to `V2`. The deployments above can then consist of the following groups of VMs: 116 | 117 | - `D1`: 3-node database cluster (to ensure minimal downtime during the update to `V2`) 118 | - `D2`: 1 node each of the Brain, CC-Bridge, Access, and Route-Emitter 119 | - `D3`: 1 Cell 120 | - `D4`: 1 Cell 121 | - CF: Minimal CF, with no need for the DEAs or HM9000 122 | 123 | We may even be able to capture an image of the BOSH-Lite instance after deploying `V1` and pushing the initial set of apps, and begin the test suite by restarting a copy of that instance and verifying that the system is still functional before proceeding (although with BOSH-Lite this may require that the instance always have the same IP). 124 | 125 | 126 | ## Selection of Versions for Testing 127 | 128 | While we would ideally conduct these tests for every strict upgradable version pair `(V1, V2)`, as we generate new minor and patch versions of the release this will be prohibitively expensive to implement, and would likely have strongly diminishing returns. Instead, we intend to test upgrades where `V1` is the earliest supported version within its major version, and `V2` is the HEAD of the develop branch (after passing smoke tests or perhaps CATs/DATs on ketchup). In the near-term, these `V1` versions will be: 129 | 130 | - Version 0: earliest supported stable Diego (0.1434.0) 131 | - Version 1: 1.0.0 132 | 133 | When we release Version 2 of Diego, we will cease testing upgrades from 0.1434.0, but continue to test upgrades from 1.0.0 to the latest develop branch. 134 | 135 | We also recognize that the addition of significant new functionality in a minor version may also warrant its promotion to a `V1` against which we test upgrades as part of CI. 136 | -------------------------------------------------------------------------------- /proposals/rolling-out-diego.md: -------------------------------------------------------------------------------- 1 | # Rolling out Diego 2 | 3 | The vision here is simple. We roll out Diego in a given environment incrementally. We'll have a couple of versions where Diego is deployed *alongside* the DEAs. Operators will be told to encourage their developers to try the new backend and send us feedback, etc. Operators will be able to choose whether opting applications into Diego is under *their* control or under their user's control. 4 | 5 | After some period of having both backends deployed, we deploy CF so that Diego is the only show in town. This deployment could be accompanied by a migration that forces all applications onto Diego, though ideally (to avoid downtime) most applications will be running on Diego by then. 6 | 7 | ## How do applications opt into Diego? 8 | 9 | Our use of environment variables is a source of complexity and is difficult to audit. 10 | 11 | We propose introducing a new application column to the DB called, simply, `diego`. This would be a boolean settable via the CC API (though see below). The boolean would be around for the transitional CF releases and will then be removed with Diego is the only backend. 12 | 13 | Since Diego can run DEA-staged droplets, modifying the `diego` boolean can automatically move an application from the DEA stack to the Diego stack. I would propose that modifying the `diego` boolean always sends an event to the chosen backend to run the application and relies on eventual consistency to wind down the applications on the former backend. To be clear: 14 | 15 | `diego => true`: emits a start message to Diego and allows the DEAs to clean up naturally via HM9000. 16 | `diego => false`: emits a start message to the DEAs and allows Diego to clean up naturallly via NSYNC. 17 | 18 | This minimizes downtime and complexity. 19 | 20 | We also propose removing the configuration flags from CC. Diego support is turned on in CC with no way to turn it off. 21 | 22 | ### Who sets the `diego` boolean? 23 | 24 | Some operators may want to give their users control over this. Others may want to do it themselves. Some may want to have a graduated phase-in plan where they give developers control to begin with but then grab control themselves later on. 25 | 26 | To accomplish this I propose we add a bosh-configurable option called `users_can_select_backend`. If `true` then all users can modify the `diego` boolean. If `false` then *only* operators can modify the `diego` boolean. Note that flip-flopping this option does not modify where the applications are running -- only *who* can modify this state. 27 | 28 | ### How do operators audit the `diego` boolean? 29 | 30 | Operators/admins will use the CC API (we may need to add to it?) to list all applications with `diego=true` or `diego=false` 31 | 32 | OSS tooling can be built on top of this to slowly bleed load from the DEAs onto Diego. For example, public clouds might use this tooling to move applications from the DEAs to Diego in a controlled manner. We could investigate downtime-less approaches to doing this. Such a tool would not be necessary for the initial deployment of Diego alongside the DEAs. 33 | 34 | ### How do we get visibility into the chosen backend? 35 | 36 | The CLI could be modified such that `cf apps` and `cf app` list the selected backend. Not sure that this is MVP, however as it would be transitory (when Diego is the only game in town we'd rip this feature out). 37 | -------------------------------------------------------------------------------- /proposals/routing.md: -------------------------------------------------------------------------------- 1 | # Improved Routing API Proposal 2 | 3 | Diego's story around routing, while sufficient for the CF use-case, has much room for improvement. 4 | 5 | To set the stage, let's clarify some roles and responsibilities. 6 | 7 | ### Roles and Responsibilities 8 | 9 | #### Diego 10 | 11 | In the context of routing Diego (i.e. Cell+BBS+Brain) is solely responsible for: 12 | 13 | 1. Opening ports on containers. 14 | 2. Making routing-related information available to consumers. 15 | 16 | Here "routing information" refers specifically to: 17 | 18 | - The `DesiredLRP` and its routing related fields: 19 | + `Routes`: currently, an array of strings 20 | + `Ports`: the set of container-side ports to open up on the container. 21 | - The `ActualLRP` and its routing related fields: 22 | + `Address`: the IP address of the host machine running the container. 23 | + `Ports`: an array of port mappings of the form `{ "container_port": XXX, "host_port": YYY }`. There will an entry in the `ActualLRP` `Ports` array for every corresponding entry in the `DesiredLRP` `Ports` array. To connect to a particular container-side port, you must first lookup the corresponding host-side port and then construct an address of the form `actualLRP.Address:actualLRP.Ports[i].HostPort` 24 | 25 | Consumers of Diego are free to specify arbitrary `Ports` on the `DesiredLRP`. Diego will dutifully open up the corresponding ports on the container and populate the `ActualLRP` `Ports` array appropriately. Diego does not allow modifying the `DesiredLRP`'s `Ports` array after-the-fact. 26 | 27 | Consumers of Diego are *also* free to specify an arbitrary array of `Routes` strings on the `DesiredLRP`. Diego doesn't actually care about these at all and does nothing with them. It simply holds onto the array on the consumer's behalf. Unlike `Ports`, the `Routes` array can be modified dynamically. 28 | 29 | It is up to Diego's consumer to fetch the set of `DesiredLRP`s and `ActualLRP`s and construct a routing table. This is done by: 30 | 31 | 1. Periodically fetching the entire set of `ActualLRP`s and `DesiredLRP`s via the Receptor API. 32 | 2. Attaching to an event stream emanating from the Receptor API. This event stream emits changes to `ActualLRP`s and `DesiredLRP`s soon after they occur. 33 | 34 | Both are necessary to ensure efficiency (real-time events) and robustness (periodic polling to catch for missed events). 35 | 36 | #### Router & Route-Emitter 37 | 38 | The Route-Emitter communicates with Diego via the Receptor API to construct a routing table. It then emits this routing table to the router via NATS (there are plans to eventually have the routers communicate directly with the Receptor and cut out the route-emitter). 39 | 40 | For a given `ProcessGuid` the route-emitter connects all the FQDNs provided in the `DesiredLRP`s `Routes` array to the **first** port in the `Ports` array on the `ActualLRP`s. 41 | 42 | So, concretely, if the consumer requests a `DesiredLRP` with: 43 | 44 | ``` 45 | { 46 | ... 47 | "routes": ["foo.com", "bar.com"], 48 | "ports": [4000, 5000], 49 | ... 50 | } 51 | 52 | ``` 53 | 54 | and Diego starts an `ActualLRP` with: 55 | ``` 56 | { 57 | ... 58 | "address": "10.10.1.2", 59 | "ports": [ 60 | {"container_port":4000, "host_port":59001}, 61 | {"container_port":5000, "host_port":59002}, 62 | ], 63 | ... 64 | } 65 | ``` 66 | 67 | Given this, requests to the router for `foo.com` and `bar.com` will both proxy through to `10.10.1.2:59001`. There is currently no way to connect to port 5000. 68 | 69 | > Note: the description in this section applies *after* the stories for adding an [event stream](https://www.pivotaltracker.com/story/show/84607000) and updating the [route-emitter](https://www.pivotaltracker.com/story/show/84607028) are complete. 70 | 71 | 72 | ### Routing 2.0 73 | 74 | #### Diego Changes 75 | 76 | [Tracker Story](https://www.pivotaltracker.com/story/show/86337946) 77 | 78 | The fact that `DesiredLRP`'s `Routes` is an array of strings is an accident of history. What Diego supports today is the minimum necessary to get the CF usecase to work. In truth, Diego is routing agnostic and we can leave it up to the consumer to encode routing information as they see fit. This opens up several possibilities, including implementing custom service discovery solutions on top of Diego. 79 | 80 | The first part of this proposal, then, is to make Diego less restrictive around what can go into `DesiredLRP.Routes`. I propose turning `DesiredLRP.Routes` into (in Go parlance) a `map[string]string`. The key in the map would correspond to a routes provider and the string would correspond to arbitrary metadata associated with said provider. This allows multiple service-discovery/routing providers to live alongside one-another in Diego. It also frees up the consumer to define an arbitrary schema to suit their needs 81 | 82 | Here's an example that supports the CF Router and a DNS service (e.g. skydns): 83 | 84 | ``` 85 | { 86 | ... 87 | "routes": { 88 | "cf-router": "[{\"port\": 4000, \"routes\": [\"foo.com\", \"bar.com\"]}, {\"port\": 5000, \"routes\":n[\"admin.foo.com\"]}]", 89 | "skydns" "[{\"port\":5000, \"host\":\"admin-api.service.skydns.local\", \"priority\":20}]" 90 | }, 91 | "ports": [4000, 5000], 92 | ... 93 | } 94 | ``` 95 | 96 | The important thing here is that Diego does not care about what goes into `DesiredLRP.Routes` at all. This frees the user to cook up whatever schema they deem fit. 97 | 98 | Diego should be defensive and apply a limit to the size of the `routes` entry. I propose a(n arbitrary) 4K limit for now. 99 | 100 | #### Route-Emitter Changes 101 | 102 | ##### Supporting multiple ports 103 | 104 | [Tracker Story](https://www.pivotaltracker.com/story/show/86338588) 105 | 106 | The most basic modification to the route-emitter that I would propose would be to support routing to multiple ports on the same container. 107 | 108 | For this the schema for `cf-router`'s entry in `DesiredLRP.Routes` would look like (from above): 109 | 110 | ``` 111 | [ 112 | { 113 | "port": 4000, 114 | "routes": ["foo.com", "bar.com"] 115 | }, 116 | { 117 | "port": 5000, 118 | "routes": ["admin.foo.com"] 119 | } 120 | ] 121 | ``` 122 | 123 | Given the sample `ActualLRP` given above, requests to `admin.foo.com` will now be routed to `10.10.1.2:59002` 124 | 125 | ##### Supporting Index-Specific-Routing 126 | 127 | [Tracker Story](https://www.pivotaltracker.com/story/show/86338996) 128 | 129 | Another (straightforward) addition to route-emitter would be support for routing to a *particular* container index. This might be useful (for example) to target admin/metric panels for a particular instance or (alternatively) to deterministically refer to specific instances of a database (e.g. giving individual member addresses to an etcd cluster). The schema for `cf-router` might look like: 130 | 131 | ``` 132 | [ 133 | { 134 | "port": 4000, 135 | "routes": ["foo.com", "bar.com"] 136 | }, 137 | { 138 | "port": 5000, 139 | "routes": ["admin.foo.com"], 140 | "route_to_instances": true 141 | } 142 | ] 143 | ``` 144 | 145 | Now requests to `admin.foo.com` will be balanced across all containers, but requests to `0.admin.foo.com` will *only* go to the container at index 0. 146 | 147 | ##### Future Extensions 148 | 149 | The aforementioned additions can be trivially implementd with the existing gorouter. Potential future features can also be expressed easily with this flexible schema. Consider two features: 150 | 151 | 1. The ability to require `ssl` for a given route. 152 | 2. The ability to route `tcp`/`udp` traffic. 153 | 154 | These could be expressed via (for example) 155 | 156 | ``` 157 | [ 158 | { 159 | "port": 4000, 160 | "protocol": "tcp", 161 | "incoming_port": 62312 162 | }, 163 | { 164 | "port": 4000, 165 | "protocol": "udp", 166 | "incoming_port": 43218 167 | }, 168 | { 169 | "port": 5000, 170 | "routes": ["admin.foo.com"], 171 | "ssl": true 172 | } 173 | ] 174 | ``` 175 | 176 | > As an aside: `incoming_port` here reflects a potential implementation of `tcp/udp` routing whereby a user first checks out a load-balancer port. This checked-out port can then be associated with a particular application - the gorouter could use the incoming port, and the information expresed in the schema above, to figure out which application to route to. 177 | -------------------------------------------------------------------------------- /proposals/ssh-one-time-auth-code.md: -------------------------------------------------------------------------------- 1 | # SSH Access via One-Time Authorization Codes 2 | 3 | The use of a user's access token as the password for SSH access to Diego containers is problematic for two reasons: 4 | 5 | 1. A coincidental one: These access tokens are frequently longer than 1KB in length, but the password buffer on the OpenSSH client is hard-coded to 1KB. The token is then truncated (or not even sent), and authentication fails. 6 | 2. A security one: These access tokens are issued for the 'cf' UAA client, but are being used by a service (the proxy) that is not that client. 7 | 8 | We therefore propose an entirely different authentication flow for CF app instances, similar to that presented in [this proposal](https://docs.google.com/document/d/1DTLNW-0twYnIHs9z7OE0v5BDNQhgNIO9kZ0fyqWiMo0/edit): 9 | 10 | - The Diego SSH Proxy is registered as a UAA client with a specific name (say, 'ssh-proxy'). 11 | - The end user obtains from UAA a one-time authorization code issued for that SSH Proxy client and sends that as the SSH password. 12 | - The Diego SSH Proxy then sends a request to UAA as that client to exchange the one-time code for an access token, and uses that token to authorize the user's access to the CF app instance. 13 | 14 | This flow should be the only supported one for the CF authenticator, and we should remove the access-token-as-SSH-password option that is currently implemented in the SSH proxy. Fortunately, the UAA can now issue such tokens as of [story #102931196](https://www.pivotaltracker.com/story/show/102931196), but this work must be included in the next UAA release, which then must be updated in cf-release. 15 | 16 | Implementing this flow and removing the old one requires the following stories: 17 | 18 | --- 19 | 20 | The Diego SSH Proxy can receive an authorization code as the SSH password to access a CF app instance 21 | 22 | 23 | This requires the proxy to be configured with a client name and secret to present to UAA along with the token. Spiff-based generation of the Diego BOSH manifest should be updated to retrieve this secret from the UAA client configuration in the CF manifest. 24 | 25 | 26 | Acceptance: 27 | 28 | - I can follow documentation in the diego-ssh README to obtain a one-time authorization code from UAA for the SSH proxy's client. 29 | - I can then use the code as the password to establish an SSH connection to a CF app instance for which I am authorized access. 30 | - The existing access-token-as-password behavior should continue to work for now (until we update the SSH plugin to use the new flow). 31 | 32 | L: ssh 33 | 34 | 35 | --- 36 | 37 | CC presents the Diego SSH Proxy client name in the /v2/info endpoint 38 | 39 | - The endpoint response should present this name in its `app_ssh_oauth_client` field. 40 | - This name should be BOSH configurable in the CC configuration. 41 | 42 | L: ssh 43 | 44 | 45 | --- 46 | 47 | The SSH plugin establishes SSH connections to CF app instances by sending an authorization code as the SSH password 48 | 49 | - For each connection attempt, the plugin retrieves a new one-time authorization code for the SSH proxy client and uses it as the password. 50 | - The plugin should not have the SSH proxy UAA client name hardcoded, but should instead read it from the CC `/v2/info` endpoint. 51 | - Release version 0.2.0 of the plugin and submit to the [Community Plugin Repo](https://github.com/cloudfoundry-incubator/cli-plugin-repo). 52 | 53 | L: ssh 54 | 55 | 56 | --- 57 | 58 | The SSH plugin provides a command to print a one-time authorization code issued for the SSH proxy client 59 | 60 | Acceptance: 61 | 62 | - I can run `cf get-ssh-code` and use the output as my password for connection with the OpenSSH `ssh` and `scp` clients. 63 | - This command should also look up client info from CC's `/v2/info` endpoint. 64 | 65 | L: ssh 66 | 67 | 68 | --- 69 | 70 | The Diego SSH Proxy no longer accepts a user's access token as an SSH password for CF app instances 71 | 72 | Acceptance: 73 | 74 | - Version 0.1.x of the SSH plugin no longer allows connections. 75 | 76 | 77 | L: ssh 78 | 79 | --- 80 | 81 | -------------------------------------------------------------------------------- /proposals/tuning-health-checks-stories.prolific.md: -------------------------------------------------------------------------------- 1 | Reproduce slow app startup times during evacuation 2 | 3 | We've observed evacuation taking longer than expected on systems with sufficiently high app-instance densities (say, 50 app-instances per cell or higher). Here's the behavior we've seen from the logs: 4 | 5 | - Previously evacuated cell comes back up while next cell is evacuating 6 | - Auction places lots of containers simultaneously on the new cell 7 | - Container-creation time increases to about 30 seconds 8 | - The cell rep downloads droplets for the new app instances 9 | - The cell rep starts the apps and health-checks them aggressively 10 | - The health-check processes take 3-5 seconds to complete instead of the usual 50ms 11 | - The app processes also take longer to come up, as the system is under load, and some of them fail to come up within the default 60s start timeout and are crashed 12 | - Container deletion time also takes a long time (up to 30s) 13 | - Eventually the cell stabilizes, but this can take as long as 10 to 15 minutes 14 | 15 | Let's try to reproduce this phenomenon in an isolated environment so we can explore ways to mitigate it. Here's a starting suggestion about how to make this reproduction environment realistic: 16 | 17 | - Deploy a Diego cluster to AWS with 4 cells on r3.xlarge VMs. This can likely be a one-AZ deployment, but then let's arrange the deployment manifest so that the cell job updates serially (not in parallel with the CC-Bridge/Access/Route-Emitter VMs, as is the default today). 18 | - Deploy separate CF apps to fill up the cluster to a density of 50 instances per cell. The apps should have only 1 instance each, so that new cells incur a realistic number of droplet downloads (which are cached downloads). A large enough proportion of these apps should have large droplets that take a long time to start up, such as Java apps (for example, spring-music, or some Spring Boot examples). 19 | - Deploy a change that causes an update to the cells and therefore evacuation. Measure how long it take cells to evacuate, and how long it takes individual instances on the new cells to start up and become healthy. 20 | 21 | We could also potentially explore this behavior on a 1-cell deployment by starting a lot of already-staged apps simultaneously, but the time required for SSH-key generation might interfere too much. 22 | 23 | Onsi's [analyzer](https://github.com/onsi/analyzer) may be helpful for analyzing the aggregate timing behavior of these starts, but we'll likely need some assistance to get started with it. [Cicerone](https://github.com/onsi/cicerone) may also help with timeline analysis on a per-instance basis. 24 | 25 | 26 | Acceptance: We can demonstrate reproduction of the described slow evacuation phenomenon. 27 | 28 | 29 | L: charter, perf, net-check-action, start-placement 30 | 31 | --- 32 | 33 | Evaluate the effect of reduced health-check frequency on evacuation time 34 | 35 | Once we can reproduce the slow evacuation times in an isolated environment, let's evaluate how much of an effect increasing the time between health checks for unhealthy instances has. It's currently 500ms, so let's try 1s, 2s, and 5s instead with varying instance densities (50/cell, 100/cell, 150/cell?). Do any of those changes reduce the instance start-up times significantly? 36 | 37 | This timing parameter can already be BOSH-configured on the rep, so this evaluation shouldn't require any code changes. 38 | 39 | 40 | Acceptance: report on the differences in evacuation recovery time with different unhealthy check intervals and different instance densities, as well as differences in other relevant system metrics (such as system load, disk I/O, and CPU utilization). 41 | 42 | 43 | L: charter, perf, net-check-action 44 | 45 | --- 46 | 47 | Evaluate the effect of a spiked-out native net-check action on evacuation time 48 | 49 | Once we can reproduce the slow evacuation times in an isolated environment, let's evaluate what effect a native net-check action has on improving them. For this spike, we need implement only the `Port` field on the proposed `NetCheckAction`, as we're not concerned about backwards-compatibility, HTTP checks, or timeout configurability. Nsync should then also use it in its LRP generation. 50 | 51 | 52 | Acceptance: report on the differences in evacuation recovery time and system metrics when using this new action compared to the old action. 53 | 54 | L: charter, perf, net-check-action 55 | 56 | 57 | --- 58 | 59 | As a BBS client, I want to be able to specify a NetCheckAction on my LRP Monitor action and have the executor run a net-check directly instead of invoking a container-side process 60 | 61 | As described in the 'Tuning Health Checks proposal', the BBS should support a new `NetCheckAction` action with the following fields: 62 | 63 | - `Port` 64 | - `Endpoint` 65 | - `TimeoutInMilliseconds` 66 | - `FallbackAction` 67 | 68 | See the proposal for the details about types, optionality, and intent. 69 | 70 | The BBS should also support new endpoints for tasks and LRP run-infos that are capable of returning this new action. If an client requests an LRP or Task from an old endpoint, the BBS should collapse all `NetCheckAction`s to their fallback actions. 71 | 72 | The executor should be updated to understand this action and to perform a port-connectivity or HTTP endpoint check on the container with its own in-process client, instead of invoking a separate process inside the container to perform this check. We do not want this action to log on every attempt. 73 | 74 | 75 | Acceptance: 76 | 77 | - I can modify vizzini to use the `NetCheckAction` and it still passes against both old and new cells as long as the BBS in the deployment is updated to understand it. 78 | 79 | 80 | L: net-check-action 81 | 82 | 83 | --- 84 | 85 | Nsync should use the native NetCheckAction in the Monitor action for LRPs with a port-based health-check 86 | 87 | 88 | Acceptance: 89 | 90 | - I can use veritas to verify that newly desired CF apps are created using the `NetCheckAction` with an appropriate fallback RunAction for compatibility with old cells. 91 | 92 | L: net-check-action 93 | 94 | --- 95 | 96 | The executor's Monitor step should log on health transitions 97 | 98 | The monitor step should log to the instance's log stream on the following events: 99 | 100 | - Starting the monitor step: "Starting health monitoring of container" 101 | - Failing the monitor step because the container does not become healthy within the N-second start timeout: "Container failed to become healthy within N seconds" 102 | - Verifying the instance is healthy: "Container became healthy" 103 | - Detecting the instance is now unhealthy after having previously been healthy: "Container became unhealthy" 104 | 105 | The log source for these log lines should be the top-level LogSource on the executor container. 106 | 107 | 108 | Acceptance: 109 | 110 | - I can observe these log lines in the log stream for CF app instances when they: 111 | - become healthy within the timeout 112 | - become healthy and then unhealthy 113 | - never become healthy within the start timeout 114 | 115 | 116 | L: logging, net-check-action 117 | 118 | -------------------------------------------------------------------------------- /proposals/tuning-health-checks.md: -------------------------------------------------------------------------------- 1 | # Tuning Health Checks 2 | 3 | From running larger and more public deployments of Diego, the Diego team has observed a few issues with the current approach to health-checking: 4 | 5 | - When starting many containers simultaneously on the same cell, invoking the port-based health-check every half-second puts the system under undue stress, and can cause containers to fail to be verified as healthy before their start timeout. This situation occurs naturally from cell evacuation during a rolling deploy of a Diego cluster with a sufficiently high density of LRP instances. 6 | - App developers have reported that the frequent health log lines are more noisy than helpful, and can prevent them from seeing application logs. This is especially true during the aggressive initial health-checking before instance health is verified, when the volume of logs can overwhelm the loggregator system's buffer of recent log messages. 7 | - Garden-Windows currently has some special one-off argument-processing logic to configure the port-based health-check correctly, which the Greenhouse team has expressed interest in removing. 8 | 9 | In light of these points, we propose some changes to how Diego handles both the timing and the method of health-checking instances: 10 | 11 | - Cell performs native net-check action without logging 12 | - Cell reduces frequency of initial health-check 13 | - Monitor action logs only health transitions 14 | - BBS client can configure health-check timing parameters on the DesiredLRP 15 | 16 | We will explain these in more detail below. 17 | 18 | 19 | ## Native Net-Check Actions 20 | 21 | We currently perform network-based health checks of services run in containers by invoking a separate process inside the container to connect to the service's port over TCP. While appealing in its simplicity, especially in the context of the plugin model we use for the platform- and app-lifecycle-specific details of performing staging tasks and running app instances, this network task can be done far more efficiently by the executor component itself. This is particularly true with the strong support for these network operations in the Golang standard library. 22 | 23 | The `healthcheck` executable currently operates in two different modes: 24 | 25 | - checking establishment of a TCP connection to a particular port, on the first detected non-loopback IP address, 26 | - making an HTTP GET request to a particular endpoint on this address and port and checking that the response is successful and has a `200 OK` status code. 27 | 28 | A configurable timeout applies to both modes, defaulting to 1s. 29 | 30 | Consequently, we propose introducing a new executor action, `NetCheckAction`, to take on this functionality. It will support the following fields: 31 | 32 | | Field | Required? | Type | Description | 33 | |------------|-----------|------------|-------------| 34 | | `Port` | required | `uint32` | Container-side port on which to check connectivity. | 35 | | `Endpoint` | optional | `string` | If present, endpoint against which to check for a successful HTTP response. | 36 | | `TimeoutInMilliseconds` | optional | `uint32` | If present, different timeout in milliseconds to apply to the connection attempt or request. If absent, the executor instead uses a default duration for the timeout (1 second, perhaps configurable on the executor). | 37 | 38 | 39 | ### Backwards Compatibility 40 | 41 | With the introduction of this new action, we also require cells from previous stable releases to be able to run a sufficiently equivalent action in its place. Unfortunately, the only actions at our disposal are the previous `Download`, `Run`, and `Upload` actions, as well as the various combining and wrapping actions. Two possible options for backwards compatibility are as follows: 42 | 43 | - Backfill an action provided by the BBS that cannot fail, such as a `TryAction` wrapping a `RunAction` that runs `echo "backfill: no-op action"`. 44 | - Backfill an arbitrary action provided by the client, which can be customized to provide equivalent functionality for the desired NetCheckAction behavior in terms of, say, a RunAction. 45 | 46 | We prefer the second option, as it ensures that older cells can perform a real health-check on their app instances correctly. This option results in the following additional field on the NetCheckAction: 47 | 48 | | Field | Required? | Type | Description | 49 | |------------|-----------|------------|-------------| 50 | | `FallbackAction` | required | `Action` | Fallback action for older BBS clients that do not understand this new action type. | 51 | 52 | 53 | In any case, the presence of this new action will require either new BBS endpoints for complete DesiredLRP and Task records or a new, optional parameter to be sent in requests to existing endpoints to indicate that the client is capable of understanding this new action. The old endpoints or responses should instead substitute this fallback action for each `NetCheckAction` in the action tree. We will of course have to make sure that the BBS correctly replaces all `NetCheckAction` instances in the tree with their fallbacks, in case a client perversely uses a `NetCheckAction` within the fallback action for another `NetCheckAction`. 54 | 55 | 56 | #### Deprecation Plan for `FallbackAction` field 57 | 58 | Obviously, this `Fallback` field is a wart on this action that we would like to freeze off as soon as possible. We expect the following transition plan to deprecate and eliminate this field: 59 | 60 | - Diego `0.N`: 61 | - Release in which Diego cells understand `NetCheckAction` natively. 62 | - New BBS endpoints are introduced to handle the new action, and older BBS endpoints are deprecated. 63 | - Diego `0.X` (`X` ≥  `N`) and `1.X`: 64 | - BBS validation requires the `Fallback` field, as it must be present for deployed cells from releases prior to `0.N`. Later cells ignore it and perform the native net-check. 65 | - Diego `2.0` and later: 66 | - BBS validation does not require the `Fallback` field, and the BBS now ignores it, as no support is guaranteed for cells from a Diego `0.X` release. 67 | - Previously deprecated BBS endpoints are removed from the API. 68 | - Clients whose upgrades are coordinated with the BBS upgrades (say, the CC-Bridge) may safely drop the `Fallback` field, as they expect the BBS API to be upgraded to version 2 before they are, given Diego deployment order constraints. 69 | 70 | 71 | ### Justification of Effort 72 | 73 | Before we make these changes, we should validate that they mitigate the problematic behavior we have observed during evacuation at scale. We should first characterize this evacuation behavior generally via logs and metrics from production systems where it is observed, then independently reproduce it in an isolated environment. Once we have this control baseline, we can spike out the `NetCheckAction` to validate that it has a significant mitigation benefit before we proceed with the work to implement it in a backward-compatible, test-driven way. 74 | 75 | 76 | ### Implementation Notes 77 | 78 | We expect the implementation of this action to be straightforward. The executor already has access to the external IP and the port mappings of the container, which is all that it requires to connect to a given container-side port from outside the container. We should certainly validate that this network information is correct on Windows cells as well, but we expect it to be, as app instances on Windows cells publish this information to the routing tiers to receive external requests. 79 | 80 | The executor step corresponding to this action will also have to be appropriately cancelable. 81 | 82 | 83 | ## Reduced Frequency of Initial Health-Check 84 | 85 | We have also received feedback that the 500ms duration between health checks while the instance is initially unhealthy may be too aggressive. We intend to increase this duration to a longer 2 seconds by default, both in the executor default configuration and in the property spec for the rep job in the Diego BOSH release. 86 | 87 | 88 | ## Logging Only Health Transitions 89 | 90 | In the absence of any logging from the `NetCheckAction` itself, we intend for the executor's Monitor step to provide a minimal amount of logging, and only at the following health transitions: 91 | 92 | - when it starts monitoring the instance, 93 | - when the monitor step times out before the instance is considered healthy, 94 | - when the instance becomes healthy, 95 | - when a healthy instance becomes unhealthy by failing the monitor action. 96 | 97 | We expect this should provide an adequate level of visibility into the health status of the instance without overwhelming the log stream emanating from the instance. 98 | 99 | One issue that may arise is with the LogSource to be used for this Monitor step logging. We would prefer that the current `HEALTH` log source be used for these logs as well, but it is currently set only on the `RunAction` provided as the LRP's Monitor action. There is a container-wide LogSource field, but it sets the LogSource for any logs emitted by cell or container actions unless explicitly overriden on an action, so it is unsuited for customizing the LogSource only on logs coming from the Monitor step. We may wish to introduce an optional LogSource identifier on the LRP to be used for this Monitor action logging, or with the context included in the Monitor step logs, it may suffice to use the current container-wide LogSource (currently set to `CELL` for CF app instances). 100 | 101 | 102 | ## Configurable Health-Check Timing 103 | 104 | We also realize that the default parameters of the Monitor step may not apply universally to all long-running workloads, especially if CF app developers are allowed to customize the health-check action to, say, run a custom script instead of the default port check. We will therefore expose some of these monitoring timing parameters on the DesiredLRP itself, such as: 105 | 106 | - the duration to wait before initiating health-checking (currently `0`), 107 | - the duration to wait between health checks before the instance is healthy (currently `500ms`, proposed to increase to `2s` by default), 108 | - the duration to wait between health checks after the instance is healthy (currently `30s`). If `0`, the health check is never performed after the instance is considered healthy. 109 | 110 | Each one of these new DesiredLRP properties would be optional, and if not specified, would default to the configuration of the rep (or its internal executor). Corresponding changes would also be required on the executor's Container model. 111 | 112 | As additional motivation, since some of these parameters are configurable on the rep, the Diego BBS client may wish to specify these parameters explicitly on the DesiredLRP it submits, rather than relying on configuration that may vary with the Diego deployment. 113 | 114 | -------------------------------------------------------------------------------- /proposals/versioning.md: -------------------------------------------------------------------------------- 1 | # Versioning 2 | 3 | ## BBS Semantics 4 | 5 | There are likely to be a class of updates that require a modification to the 6 | structure or semantics of the BBS. An example might be the movement of egress 7 | rules from the desired LRP to a separate part of the BBS. This change would 8 | effectively create a new set of keys for egress rules and would change the 9 | semantics of a desired LRP with the requirement to reference the egress rules. 10 | 11 | Changes like this would required a more involved upgrade process such as: 12 | 13 | 1. Code is deployed with a BBS version that understands old and new semantics 14 | 2. While running, code operates with the old BBS version semantics, can read 15 | new, writes old 16 | 3. An operator (or some other entity) explicitly bumps the BBS version via an 17 | errand 18 | 4. The BBS code in each component eventually discovers the new operating mode 19 | and generates new data structures and uses new semantics 20 | 5. Migration phase to move to new levels - errand does the migration after 21 | setting the new active version 22 | 23 | The BBS version needs to be stored in a consistent, persistent place outside of 24 | the BBS. We can't put the version directly into the BBS since a loss of our 25 | persistent store would leave us unable to determine the appropriate semantics 26 | to use when the diego jobs restart. 27 | 28 | Options to where to store this version are consul or the manifest. Problem with 29 | the manifest being the necessity of a manual manifest-only deploy. 30 | 31 | ## Schema Models 32 | 33 | Each model has its own version that can grow independently of the Diego release 34 | version. A bump in a model version may (and in most cases must) trigger a major 35 | bump on the Diego release version. 36 | 37 | A new version of a model requires creating a new structure (code) in the BBS 38 | with the version number in its name. That structure, when marshalled should 39 | write a `version` field. Unmarshalling of any JSON payload should read the 40 | version and create the appropirate object. Version 1 is implied if no version 41 | is present on the payload. 42 | 43 | Adding an optional field does not require bumping the version. 44 | Removing an optional field does not require bumping the version. 45 | Adding a model does not require bumping the version. 46 | 47 | ## Internal APIs 48 | 49 | APIs themselves are not explicitly versioned, we inherit the version from the 50 | schema models being passed through the requests. 51 | 52 | Internal servers should respond with the same version that was received. 53 | 54 | We do not preclude the addition of new endpoints and messages; if the semantics 55 | of an operation change significantly, a new endpoint is likely warranted and 56 | the (updated) client will need a mechanism to detect the new capability. (Can 57 | use bbs?) 58 | 59 | ## Upgrades 60 | 61 | Consider a current major version N. When you deploy version N+1 it has the 62 | ability to read and write the new and old BBS data models. Prior to migration, 63 | it will always write the old data model but it can understand the new model; 64 | after migration, it can still understand the old model but it will only write 65 | the new data model. Once the migration completes, code running version N will 66 | no longer function correctly. 67 | 68 | This results in two simple rules for operators: 69 | 70 | 1. You **SHOULD** run the migration errand *after* deploying N + 1. 71 | 2. You **MUST** run the migration errand *before* deploying N + 2. 72 | 73 | 74 | An alternative BBS Schema Migration Approach 75 | -------------------------------------------- 76 | 77 | ![diagram](diego_versioning.svg) 78 | 79 | # Steps 80 | 81 | 1. Bosh deploy Diego Release version N + 1. 82 | 2. Run migration errand 83 | 84 | # Changes to Jobs 85 | 86 | Jobs need a way to detect if they're running in a mode where they dont' have to 87 | worry about backwards compatibility. We could model that as a flag on the 88 | command line or we could find a way to do it in the BBS. Any of those 89 | mechanisms could work... 90 | 91 | With that in place: 92 | 93 | 1. Version N + 1 Jobs read from version N BBS first, then read from version N + 94 | 1 if that operation fails 95 | 2. Version N + 1 Jobs to write to version N and N + 1 BBS 96 | 3. Version N + 1 jobs to read/write from/to version N + 1 BBS when the 97 | migration flag is turned off. Migration flag is turned off when version N 98 | BBS is not available. 99 | 100 | 101 | # Migration Errand 102 | 103 | 1. Build version N + 1 BBS 104 | 2. Scan version N BBS nodes for discrepencies by comparing its node index to 105 | N + 1 BBS nodes (version N nodes are always going to be greater than or equal 106 | to version N + 1 nodes) 107 | 3. Update out of sync node in N + 1 BBS 108 | 4. Build a nodes-in-sync lookup table for nodes with the same node index 109 | between version N and N + 1. 110 | 5. Repeat 2 until all nodes are in sync 111 | 6. Delete version N BBS 112 | --------------------------------------------------------------------------------