├── .gitignore
├── CODEOWNERS
├── LICENSE
├── NOTICE
├── README.md
├── adr
    ├── 0000-use-architectural-decision-records.md
    ├── README.md
    ├── index.md
    └── template.md
├── notes
    ├── 113340231-spike-benchmarks-cf-mysql.md
    ├── 114130447-cf-mysql-stress-tests.md
    ├── 115893407-diego-and-consul-failure-modes.md
    ├── 117935875-cf-mysql-with-consul.md
    ├── 123482771-spike-influxdb-graphana.md
    ├── 125596231-usability-of-context.md
    ├── debug-app-crash.md
    ├── diego_deployments_with_longer_term_period.md
    ├── distributed-locking-and-presence.md
    ├── grafana-sample.json
    ├── logging-guidance.md
    └── lrp-task-states-and-transitions.md
└── proposals
    ├── README.md
    ├── bbs-migration-stories.prolific.md
    ├── bbs-migrations.md
    ├── better-buildpack-caching.md
    ├── bind-mounting-downloads-stories.prolific.md
    ├── bind-mounting-downloads.md
    ├── bosh-deployments.graffle
    ├── bosh-deployments.md
    ├── bosh-deployments.png
    ├── desired_lrp_update_API.md
    ├── desired_lrp_update_extension.md
    ├── diego_versioning.svg
    ├── docker_registry_caching.md
    ├── docker_registry_configuration.md
    ├── faster-missing-cell-recovery.md
    ├── go-modules-migrations.md
    ├── lifecycle-design.md
    ├── measuring_performance.md
    ├── per-application-crash-configuration-stories.csv
    ├── per-application-crash-configuration.md
    ├── placement-constraints-stories.csv
    ├── placement_pools.md
    ├── private-docker-registry.md
    ├── relational-bbs-db-stories-2016-02-16.prolific.md
    ├── relational-bbs-db-stories-2016-04-18.prolific.md
    ├── relational-bbs-db-stories-2016-05-10.prolific.md
    ├── relational-bbs-db.md
    ├── release-versioning-testing-stories.prolific.md
    ├── release-versioning-testing-v2.md
    ├── release-versioning-testing.md
    ├── rolling-out-diego.md
    ├── routing.md
    ├── schema-migrations-for-locket.md
    ├── secure-auctioneer-cell-apis.md
    ├── ssh-one-time-auth-code.md
    ├── tuning-health-checks-stories.prolific.md
    ├── tuning-health-checks.md
    ├── updatable-lrp-api.md
    └── versioning.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store


--------------------------------------------------------------------------------
/CODEOWNERS:
--------------------------------------------------------------------------------
1 | * @cloudfoundry/wg-app-runtime-platform-diego-approvers
2 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015-Present CloudFoundry.org Foundation, Inc. All Rights Reserved.
 2 | 
 3 | This project contains software that is Copyright (c) 2014-2015 Pivotal Software, Inc.
 4 | 
 5 | Licensed under the Apache License, Version 2.0 (the "License");
 6 | you may not use this file except in compliance with the License.
 7 | You may obtain a copy of the License at
 8 | 
 9 |    http://www.apache.org/licenses/LICENSE-2.0
10 | 
11 | Unless required by applicable law or agreed to in writing, software
12 | distributed under the License is distributed on an "AS IS" BASIS,
13 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | See the License for the specific language governing permissions and
15 | limitations under the License.
16 | 
17 | This project may include a number of subcomponents with separate
18 | copyright notices and license terms. Your use of these subcomponents
19 | is subject to the terms and conditions of each subcomponent's license,
20 | as noted in the LICENSE file.
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Diego Notes
 2 | 
 3 | This folder contains development notes on Diego project. 
 4 | 
 5 | - [adr](./adr) folder contains architectural decision records. See its
 6 |   [README.md](./adr/README.md) for more information.
 7 | - [notes](./notes) folder contains an archived documentation related to
 8 |   specific subject/story. The purpose of these documents is to get an idea of
 9 |   the initial motivation and architectural decisions related to that topic at
10 |   that point of time. These documents have not been updated since the work on a
11 |   story has been finished and should not be treated as a source of truth of the
12 |   current state of system.
13 | - [proposals](./proposals) folder contains documents that describe ideas and
14 |   solutions that could solve a particular problem.
15 | 
16 | If you would like to see a documentation on a specific topic please open an
17 | issue in this repo.
18 | 


--------------------------------------------------------------------------------
/adr/0000-use-architectural-decision-records.md:
--------------------------------------------------------------------------------
 1 | # Use Markdown Architectural Decision Records
 2 | 
 3 | ## Context and Problem Statement
 4 | 
 5 | We want to record architectural decisions made in this project.
 6 | Which format and structure should these records follow?
 7 | 
 8 | ## Considered Options
 9 | 
10 | * [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
11 | * [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
12 | * [Sustainable Architectural Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) - The Y-Statements
13 | * Other templates listed at <https://github.com/joelparkerhenderson/architecture_decision_record>
14 | * Formless - No conventions for file format and structure
15 | 
16 | ## Decision Outcome
17 | 
18 | Chosen option: "MADR 2.1.0", because
19 | 
20 | * Implicit assumptions should be made explicit.
21 |   Design documentation is important to enable people understanding the decisions later on.
22 |   See also [A rational design process: How and why to fake it](https://doi.org/10.1109/TSE.1986.6312940).
23 | * The MADR format is lean and fits our development style.
24 | * The MADR structure is comprehensible and facilitates usage & maintenance.
25 | * The MADR project is vivid.
26 | * Version 2.1.0 is the latest one available when starting to document ADRs.
27 | 


--------------------------------------------------------------------------------
/adr/README.md:
--------------------------------------------------------------------------------
 1 | # Diego Architectural Decision Records
 2 | 
 3 | This folder contains [ADRs (Architectural Decision Records)](https://adr.github.io/) for Diego project.
 4 | 
 5 | ## Create a new ADR
 6 | 
 7 | 1. Copy `template.md` to `NNNN-title-with-dashes.md`, where `NNNN` indicates the next number in sequence.
 8 | 1. Edit `NNNN-title-with-dashes.md`.
 9 | 1. Update `index.md`, e.g., by executing `adr-log -i -d .` You can get [adr-log](https://github.com/adr/adr-log) by executing `npm install -g adr-log`.
10 | 


--------------------------------------------------------------------------------
/adr/index.md:
--------------------------------------------------------------------------------
 1 | # Architectural Decision Log
 2 | 
 3 | This lists the architectural decisions for Diego.
 4 | 
 5 | <!-- adrlog -->
 6 | 
 7 | - [ADR-0000](0000-use-architectural-decision-records.md) - Use Markdown Architectural Decision Records
 8 | 
 9 | <!-- adrlogstop -->
10 | 
11 | [template.md](template.md) contains the template.
12 | More information on MADR is available at <https://adr.github.io/madr/>.
13 | 


--------------------------------------------------------------------------------
/adr/template.md:
--------------------------------------------------------------------------------
 1 | # [short title of solved problem and solution]
 2 | 
 3 | * Status: [accepted | superseeded by [ADR-0005](0005-example.md) | deprecated | …] <!-- optional -->
 4 | * Deciders: [list everyone involved in the decision] <!-- optional -->
 5 | * Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
 6 | 
 7 | Technical Story: [description | ticket/issue URL] <!-- optional -->
 8 | 
 9 | ## Context and Problem Statement
10 | 
11 | [Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
12 | 
13 | ## Decision Drivers <!-- optional -->
14 | 
15 | * [driver 1, e.g., a force, facing concern, …]
16 | * [driver 2, e.g., a force, facing concern, …]
17 | * … <!-- numbers of drivers can vary -->
18 | 
19 | ## Considered Options
20 | 
21 | * [option 1]
22 | * [option 2]
23 | * [option 3]
24 | * … <!-- numbers of options can vary -->
25 | 
26 | ## Decision Outcome
27 | 
28 | Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
29 | 
30 | ### Positive Consequences <!-- optional -->
31 | 
32 | * [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
33 | * …
34 | 
35 | ### Negative consequences <!-- optional -->
36 | 
37 | * [e.g., compromising quality attribute, follow-up decisions required, …]
38 | * …
39 | 
40 | ## Pros and Cons of the Options <!-- optional -->
41 | 
42 | ### [option 1]
43 | 
44 | [example | description | pointer to more information | …] <!-- optional -->
45 | 
46 | * Good, because [argument a]
47 | * Good, because [argument b]
48 | * Bad, because [argument c]
49 | * … <!-- numbers of pros and cons can vary -->
50 | 
51 | ### [option 2]
52 | 
53 | [example | description | pointer to more information | …] <!-- optional -->
54 | 
55 | * Good, because [argument a]
56 | * Good, because [argument b]
57 | * Bad, because [argument c]
58 | * … <!-- numbers of pros and cons can vary -->
59 | 
60 | ### [option 3]
61 | 
62 | [example | description | pointer to more information | …] <!-- optional -->
63 | 
64 | * Good, because [argument a]
65 | * Good, because [argument b]
66 | * Bad, because [argument c]
67 | * … <!-- numbers of pros and cons can vary -->
68 | 
69 | ## Links <!-- optional -->
70 | 
71 | * [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
72 | * … <!-- numbers of links can vary -->
73 | 


--------------------------------------------------------------------------------
/notes/113340231-spike-benchmarks-cf-mysql.md:
--------------------------------------------------------------------------------
  1 | Running everything from the same VM is hard. We had to implement semaphores to
  2 | get around a problem with having too many connections to the BBS in parallel.
  3 | We had satisfactory results with the CF-MySQL release even though they're
  4 | slightly worse than RDS. We think that once the real BBS schema is fully
  5 | flushed out we can optimize certain aspects of the complexity we have with
  6 | convergence, and the other bulkers.
  7 | 
  8 | One thing to note from the experiment is that there might be artificial
  9 | slowdowns because of the single VM acting as multiple components. We decided
 10 | that for now the complexity of devising a distributed benchmarks suite is too
 11 | costly to be justifiable.
 12 | 
 13 | ## Experiment #5 (full benchmark run)
 14 | 
 15 | ### Config:
 16 | 
 17 | - CF-MySQL 3 nodes talking to one node directly
 18 | - `num_populate_workers`: `500`
 19 | - `num_reps`: `1000`
 20 | - `desired_lrps`: `200000`
 21 | - Semaphore around all HTTP calls on the BBS client (25,000 resources)
 22 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client
 23 | 
 24 | ### Results:
 25 | 
 26 | ```
 27 | Running Suite: Benchmark BBS Suite
 28 | ==================================
 29 | Random Seed: 1457549669
 30 | Will run 4 of 4 specs
 31 | 
 32 | • [MEASUREMENT]
 33 | Fetching for Route Emitter
 34 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/route_emitter_fetch_test.go:33
 35 |   data for route emitter
 36 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/route_emitter_fetch_test.go:32
 37 | 
 38 |   Ran 10 samples:
 39 |   {FetchActualLRPsAndSchedulingInfos}
 40 |   fetch all actualLRPs:
 41 |     Fastest Time: 17.671s
 42 |     Slowest Time: 51.122s
 43 |     Average Time: 37.447s ± 15.995s
 44 | ------------------------------
 45 | [====================================================================================================] 10000/10000
 46 | Done with all loops, waiting for queue to clear out...
 47 | [====================================================================================================] 2000000/2000000
 48 | • [MEASUREMENT]
 49 | Fetching for rep bulk loop
 50 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114
 51 |   data for rep bulk
 52 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113
 53 | 
 54 |   Ran 1 samples:
 55 |   {RepBulkFetching}
 56 |   rep bulk fetch:
 57 |     Fastest Time: 0.006s
 58 |     Slowest Time: 7.160s
 59 |     Average Time: 0.078s ± 0.269s
 60 |   {RepBulkLoop}
 61 |   rep bulk loop:
 62 |     Fastest Time: 0.008s
 63 |     Slowest Time: 7.163s
 64 |     Average Time: 0.089s ± 0.299s
 65 |   {RepStartActualLRP}
 66 |   start actual LRP:
 67 |     Fastest Time: 0.002s
 68 |     Slowest Time: 120.636s
 69 |     Average Time: 0.163s ± 2.066s
 70 |   {RepClaimActualLRP}
 71 |   claim actual LRP:
 72 |     Fastest Time: 0.007s
 73 |     Slowest Time: 120.522s
 74 |     Average Time: 0.434s ± 0.896s
 75 | ------------------------------
 76 | • [MEASUREMENT]
 77 | Fetching for nsync bulker
 78 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/nsync_bulker_fetch_test.go:29
 79 |   DesiredLRPs
 80 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/nsync_bulker_fetch_test.go:28
 81 | 
 82 |   Ran 10 samples:
 83 |   {NsyncBulkerFetching}
 84 |   fetch all desired LRP scheduling info:
 85 |     Fastest Time: 14.950s
 86 |     Slowest Time: 15.450s
 87 |     Average Time: 15.127s ± 0.140s
 88 | ------------------------------
 89 | • [MEASUREMENT]
 90 | Gathering
 91 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/lrp_convergence_gather_test.go:33
 92 |   data for convergence
 93 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/lrp_convergence_gather_test.go:32
 94 | 
 95 |   Ran 10 samples:
 96 |   {ConvergenceGathering}
 97 |   BBS' internal gathering of LRPs:
 98 |     Fastest Time: 21.318s
 99 |     Slowest Time: 28.017s
100 |     Average Time: 26.993s ± 1.904s
101 | ------------------------------
102 | 
103 | Ran 4 of 4 Specs in 3514.024 seconds
104 | SUCCESS! -- 4 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
105 | 
106 | Ginkgo ran 1 suite in 58m35.914623967s
107 | Test Suite Passed
108 | ```
109 | 
110 | ### Conclusions:
111 | 
112 | We can make the HA MySQL release works assuming they solve the proxy/load
113 | balancing issues. Convergence is a little slow still, but there's hope that
114 | with a different schema we can really improve how we do convergence in general,
115 | same holds true for nsync and route-emitter.
116 | 
117 | ## Experiment #4 (focused on rep-bulk cycles)
118 | 
119 | ### Config:
120 | 
121 | - CF-MySQL 3 nodes talking to one node directly
122 | - `num_populate_workers`: `500`
123 | - `num_reps`: `1000`
124 | - `desired_lrps`: `200000`
125 | - Semaphore around all HTTP calls on the BBS client (25,000 resources)
126 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client
127 | 
128 | ### Results:
129 | 
130 | ```
131 | • [MEASUREMENT]
132 | Fetching for rep bulk loop
133 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114
134 |   data for rep bulk
135 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113
136 | 
137 |   Ran 1 samples:
138 |   {RepBulkFetching}
139 |   rep bulk fetch:
140 |     Fastest Time: 0.005s
141 |     Slowest Time: 3.409s
142 |     Average Time: 0.060s ± 0.171s
143 |   {RepBulkLoop}
144 |   rep bulk loop:
145 |     Fastest Time: 0.008s
146 |     Slowest Time: 3.413s
147 |     Average Time: 0.066s ± 0.175s
148 |   {RepStartActualLRP}
149 |   start actual LRP:
150 |     Fastest Time: 0.002s
151 |     Slowest Time: 121.628s
152 |     Average Time: 0.116s ± 2.025s
153 |   {RepClaimActualLRP}
154 |   claim actual LRP:
155 |     Fastest Time: 0.007s
156 |     Slowest Time: 121.213s
157 |     Average Time: 0.345s ± 0.979s
158 | ------------------------------
159 | SS
160 | Ran 1 of 4 Specs in 1820.280 seconds
161 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED
162 | ```
163 | 
164 | ### Conclusions:
165 | 
166 | "Slowest" on claim/start is due to the artificial limits we have with the
167 | semaphore on the BBS client. Could be mitigated by having a distributed
168 | benchmark suite.  Going forward we should probably move that semaphore up to
169 | the benchmark code, since it doesn't make too much sense for it to exist on the
170 | BBS client (that many requests in parallel is not a realistic use case).
171 | 
172 | ## Experiment #3 (focused on rep-bulk cycles)
173 | 
174 | ### Config:
175 | 
176 | - CF-MySQL 3 nodes talking to one node directly
177 | - `num_populate_workers`: `500`
178 | - `num_reps`: `250`
179 | - `desired_lrps`: `50000`
180 | - Semaphore around all HTTP calls on the BBS client (25,000 resources)
181 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client
182 | 
183 | ### Results:
184 | 
185 | ```
186 | ------------------------------
187 | • [MEASUREMENT]
188 | Fetching for rep bulk loop
189 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:114
190 |   data for rep bulk
191 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:113
192 | 
193 |   Ran 1 samples:
194 |   {RepBulkFetching}
195 |   rep bulk fetch:
196 |     Fastest Time: 0.005s
197 |     Slowest Time: 1.093s
198 |     Average Time: 0.011s ± 0.048s
199 |   {RepBulkLoop}
200 |   rep bulk loop:
201 |     Fastest Time: 0.007s
202 |     Slowest Time: 1.096s
203 |     Average Time: 0.014s ± 0.049s
204 |   {RepStartActualLRP}
205 |   start actual LRP:
206 |     Fastest Time: 0.002s
207 |     Slowest Time: 120.516s
208 |     Average Time: 0.157s ± 0.609s
209 |   {RepClaimActualLRP}
210 |   claim actual LRP:
211 |     Fastest Time: 0.007s
212 |     Slowest Time: 3.162s
213 |     Average Time: 0.028s ± 0.087s
214 | ------------------------------
215 | S
216 | Ran 1 of 4 Specs in 661.736 seconds
217 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED
218 | ```
219 | 
220 | Conclusions:
221 | 
222 | Semaphores work, let's move on to 200,000 again.
223 | 
224 | ## Experiment #2 (focused on rep-bulk cycles)
225 | 
226 | ### Config:
227 | 
228 | - CF-MySQL 3 nodes talking to one node directly
229 | - `num_populate_workers`: `100`
230 | - `num_reps`: `250`
231 | - `desired_lrps`: `50000`
232 | - Workpool of size 25,000 only around BBS client calls
233 | - `MaxIdleConnsPerHost` set to 25,000 as well on the BBS client
234 | 
235 | ### Results:
236 | 
237 | **None**
238 | 
239 | ### Conclusions:
240 | 
241 | We cancelled this experiment because our workpool implementation seems to have
242 | issues with large pool sizes. The tests were taking way too long to finish and
243 | we aborted them.  Next experiment will replace the workpool with a semaphore.
244 | 
245 | ## Experiment #1 (focused on rep-bulk cycles)
246 | 
247 | ### Config:
248 | 
249 | - CF-MySQL 3 nodes talking to one node directly
250 | - `num_populate_workers`: `100`
251 | - `num_reps`: `250`
252 | - `desired_lrps`: `50000`
253 | - No workpool
254 | 
255 | ### Results:
256 | 
257 | ```
258 | • [MEASUREMENT]
259 | Fetching for rep bulk loop
260 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:84
261 |   data for rep bulk
262 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/rep_bulk_fetch_test.go:83
263 | 
264 |   Ran 1 samples:
265 |   {RepBulkFetching}
266 |   rep bulk fetch:
267 |     Fastest Time: 0.005s
268 |     Slowest Time: 1.610s
269 |     Average Time: 0.016s ± 0.086s
270 |   {RepBulkLoop}
271 |   rep bulk loop:
272 |     Fastest Time: 0.007s
273 |     Slowest Time: 1.612s
274 |     Average Time: 0.021s ± 0.088s
275 |   {RepStartActualLRP}
276 |   start actual LRP:
277 |     Fastest Time: 0.003s
278 |     Slowest Time: 4.093s
279 |     Average Time: 0.054s ± 0.164s
280 |   {RepClaimActualLRP}
281 |   claim actual LRP:
282 |     Fastest Time: 0.007s
283 |     Slowest Time: 1.639s
284 |     Average Time: 0.038s ± 0.128s
285 | ------------------------------
286 | SSS
287 | Ran 1 of 4 Specs in 661.140 seconds
288 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 3 Skipped PASS | FOCUSED
289 | ```
290 | 
291 | ### Conclusions:
292 | 
293 | Given the success running the experiment at this size, we thought 50,000/250
294 | would be a decent size to split the experiment on if we were to distribute it.
295 | We thought about running multiple instances of the benchmark-bbs errand in
296 | parallel, but that is not supported by BOSH.
297 | The alternative would be making multiple deployments with the same errand with
298 | different parameters and synchronizing the startup of those. Seems fragile.
299 | We could also try to devise an smarter orchestrated test suite that spins up
300 | tasks on a Diego deployment, but also seems hard. Trying with a workpool first.
301 | 
302 | 


--------------------------------------------------------------------------------
/notes/114130447-cf-mysql-stress-tests.md:
--------------------------------------------------------------------------------
 1 | We want to know how SQL behaves in golang when using the CF-MySQL HA release.
 2 | We're also interested in knowing what behavior the current proxy (switchboard)
 3 | presents in face of a cluster outage.
 4 | 
 5 | We devised scenarios to observe those behaviors. Those are documented
 6 | [here](https://github.com/luan/cf-mysql-proxy-stress-test).
 7 | 
 8 | ## Grand Conclusion
 9 | 
10 | The go sql library does the right thing: it fails existing connections when the
11 | pipe is broken, and recreates them next time you try to use it. Switchboard
12 | seems to react pretty quickly to a server going down, allowing the go sql
13 | library to do its job.  No major conclusions as far as next steps, which is
14 | good, looks like we don't have to do much or don't have a lot to complain about
15 | to MySQL just now.
16 | 
17 | ## Scenario 1
18 | 
19 | success on all cases
20 | 
21 | ## Scenario 2
22 | stop non leader: success
23 | 
24 | stop leader: flakes when you restart because the connection gets dropped, moves
25 | on to succeeding since switchboard starts routing them to the new leader.
26 | 
27 | ## Scenario 3
28 | stop non leader: success
29 | 
30 | stop leader: 2 records lost
31 | 
32 | ```
33 | Kill the leader now and press ENTER2016/03/15 22:32:39 failed to commit Transaction: Error 1047: WSREP has not yet prepared node for application use
34 | [mysql] 2016/03/15 22:32:39 packets.go:33: unexpected EOF
35 | [mysql] 2016/03/15 22:32:39 packets.go:33: unexpected EOF
36 | [mysql] 2016/03/15 22:32:39 packets.go:124: write tcp 10.10.22.11:37713->10.10.22.11:3306: write: broken pipe
37 | [mysql] 2016/03/15 22:32:41 packets.go:33: unexpected EOF
38 | 2016/03/15 22:32:41 failed to begin transaction: driver: bad connection
39 | [mysql] 2016/03/15 22:32:41 packets.go:33: unexpected EOF
40 | 
41 | 2016/03/15 22:32:50 Expected
42 |     <int>: 2799
43 |     to equal
44 |         <int>: 2801
45 | ```
46 | 
47 | ## Scenario 4
48 | 
49 | stop non leader: success
50 | restart non leader: success
51 | 
52 | stop leader: flakes when you restart because the connection gets dropped, moves
53 | on to succeeding since switchboard starts routing them to the new leader.
54 | 
55 | ## Scenario 5
56 | 
57 | stop non leader: lost 8 records (with 10 maxConnections) lost 98 records (with 100 maxConnections)
58 | 
59 | stop leader: lost ~1000 records
60 | restart leader: Causes the VM to be stuck in monit hell. We cannot stop/start even after reloading the monit deamon. Possible bug in ctl script?
61 | 
62 | Often it drops n-1 connections and thus has that many fewer records. ex 10, 20 100
63 | At 500, only 495 failed. it's great! When the db dies, the the error
64 | 
65 | ```
66 | 2016/03/15 18:26:12 failed write: Error 1213: Deadlock found when trying to get lock; try restarting transaction
67 | ```
68 | 
69 | is given.  It doesn't matter which db is killed, the result is the same. Note
70 | that this is consistent with the results from scenario 3, because =1, thus
71 | n-1=0=number of errors that occured.
72 | 
73 | At 1000 connections, there is the following error:
74 | 
75 | ```
76 | [mysql] 2016/03/15 18:28:36 statement.go:27: invalid connection
77 | [mysql] 2016/03/15 18:28:36 packets.go:33: unexpected EOF
78 | ```
79 | 
80 | ## Scenario 6
81 | 
82 | same as above
83 | 
84 | 


--------------------------------------------------------------------------------
/notes/115893407-diego-and-consul-failure-modes.md:
--------------------------------------------------------------------------------
 1 | # Diego and Consul Failure Modes
 2 | 
 3 | ## Presence (Cell)
 4 | 
 5 | ### Experiment 1
 6 | In this experiment, we performed a rolling deploy of a 3 node
 7 | consul cluster.  We took the consul nodes down one-by-one and noticed that once
 8 | we are in the middle of doing leader election, reps stop functioning.
 9 | 
10 | We also noted that leader election may occasionally take longer than usual in
11 | which case the system took even longer to get back to normal.A
12 | 
13 | ## Stories to Create
14 | 
15 | 1. If rep cannot set Presence do not indicate it is ready [#118500389](https://www.pivotaltracker.com/story/show/118500389)
16 | 1. It is possible to set duplicate presence similar to BBS locket error seen in
17 |    [#117967437](https://www.pivotaltracker.com/story/show/117967437), [#118957221](https://www.pivotaltracker.com/story/show/118957221)
18 | 
19 | 
20 | ## Locks (route_emitter, bbs, auctioneer, tps-watcher, nsync-bulker, converger)
21 | 
22 | ### Experiment 1
23 | In this experiment, we performed a rolling deploy of a 3 node consul cluster.
24 | We took the consul nodes down one-by-one and noticed that once we are in the
25 | middle of doing leader election, the component currently holding the lock lost
26 | the lock.  The components then attempt to reacquire the lock and during the
27 | leader election they are not able to but will retry correctly and eventually
28 | once consul has a new leader the lock can be acquired by one of the instances.
29 | 
30 | Impact is that the leaders may change and components will be down until the
31 | leader election completes.  
32 | 
33 | ## Stories to Create
34 | 
35 | 1. Losing the lock during migrations of the BBS does not cause the migration to
36 |    exit immediately [#118957301](https://www.pivotaltracker.com/story/show/118957301)
37 | 
38 | ## Service registration
39 | 
40 | ### Experiment 1
41 | In this experiment, we performed a rolling deploy of a 3 node
42 | consul cluster.  We took the consul nodes down one-by-one and noticed that once
43 | we are in the middle of doing leader election the service registrations seem to
44 | persist but are not available during the leader election.
45 | 
46 | Once a new leader is elected we can see the service registrations again.  
47 | 
48 | ## Stories to Create
49 | 
50 | 1. Service registration should reregister if consul drops the service
51 |    registration [#118957335](https://www.pivotaltracker.com/story/show/118957335)
52 | 


--------------------------------------------------------------------------------
/notes/117935875-cf-mysql-with-consul.md:
--------------------------------------------------------------------------------
  1 | We want to know how the CF-MySQL release can take advantage of consul for
  2 | service discovery.
  3 | 
  4 | The release ships with a proxy job that can be horizontally scaled for high
  5 | availability, their responsibility is to chose one "master" node from the
  6 | Galera cluster and only connect clients to that one until it's no longer
  7 | available.
  8 | 
  9 | There is no way to make the proxy HA, though. The CF-MySQL team suggests using
 10 | a Load Balancer like Amazon ELB or HAProxy but cautions against it since load
 11 | balancing across multiple proxies might cause clients to connect to different
 12 | "master" MySQL nodes.  That is a problem because it can cause "deadlock" errors
 13 | when writing consistently to multiple masters on Galera.
 14 | 
 15 | We first considered removing the proxies entirely and relying on consul
 16 | directly on the MySQL nodes. However, if a MySQL node becomes unavailable,
 17 | consul will only prevent new connections to the dead node. There's no way for
 18 | the consul agent to close existing client connections like the proxies do.
 19 | Another problem is when shutting down the master MySQL node the MySQL server
 20 | can get into an irrecoverable state that causes established connections from
 21 | the clients to hang forever. This seems to be a known issue with tcp
 22 | connections that never really got addressed, see
 23 | [this](http://www.evanjones.ca/tcp-stuck-connection-mystery.html) for more
 24 | details.
 25 | 
 26 | So we experimented with putting consul agents on the proxies instead of the
 27 | consul nodes so that only one proxy is being used at a time. It seems that a
 28 | single proxy node will consistently connect to one MySQL node (as long as that
 29 | node is up), so we shouldn't run into deadlocks.
 30 | 
 31 | The first few experiments were not properly tracked so we're starting from the
 32 | middle here.
 33 | 
 34 | ## Experiment 1
 35 | 
 36 | ### Config
 37 | 
 38 | * 2 MySQL Nodes
 39 | * 2 Proxies (1 with consul agent running)
 40 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker)
 41 | * Patched BBS with a retry logic on Deadlock errors
 42 | 
 43 | We stopped the master MySQL during seeding roughly after 20k insertions and
 44 | restarted it after 30k.
 45 | 
 46 | The results seem to indicate that this is a viable solution. There were some
 47 | deadlock errors when bringing the MySQL node back online but the retry logic
 48 | spiked on the BBS mitigated that.
 49 | 
 50 | ### Results
 51 | 
 52 | ```ruby
 53 | • [MEASUREMENT]
 54 | main benchmark test
 55 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:210
 56 |   data for benchamrks
 57 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:209
 58 | 
 59 |   Ran 1 samples:
 60 |   map[MetricName:RepBulkFetching]
 61 |   rep bulk fetch:
 62 |     Fastest Time: 0.008s
 63 |     Slowest Time: 1.831s
 64 |     Average Time: 0.087s ± 0.159s
 65 |   map[MetricName:RepBulkLoop]
 66 |   rep bulk loop:
 67 |     Fastest Time: 0.014s
 68 |     Slowest Time: 1.837s
 69 |     Average Time: 0.099s ± 0.161s
 70 |   map[MetricName:RepStartActualLRP]
 71 |   start actual LRP:
 72 |     Fastest Time: 0.010s
 73 |     Slowest Time: 30.599s
 74 |     Average Time: 0.292s ± 1.438s
 75 |   map[MetricName:RepClaimActualLRP]
 76 |   claim actual LRP:
 77 |     Fastest Time: 0.018s
 78 |     Slowest Time: 1.738s
 79 |     Average Time: 0.118s ± 0.133s
 80 |   map[MetricName:FetchActualLRPsAndSchedulingInfos]
 81 |   fetch all actualLRPs:
 82 |     Fastest Time: 5.633s
 83 |     Slowest Time: 5.633s
 84 |     Average Time: 5.633s ± 0.000s
 85 |   map[MetricName:ConvergenceGathering]
 86 |   BBS internal gathering of LRPs:
 87 |     Fastest Time: 6.244s
 88 |     Slowest Time: 6.244s
 89 |     Average Time: 6.244s ± 0.000s
 90 |   map[MetricName:NsyncBulkerFetching]
 91 |   fetch all desired LRP scheduling info:
 92 |     Fastest Time: 2.002s
 93 |     Slowest Time: 2.002s
 94 |     Average Time: 2.002s ± 0.000s
 95 | ------------------------------
 96 | 
 97 | Ran 1 of 1 Specs in 627.672 seconds
 98 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
 99 | 
100 | Ginkgo ran 1 suite in 10m31.388780701s
101 | ```
102 | 
103 | ## Experiment 2
104 | 
105 | ### Config
106 | 
107 | * 2 MySQL Nodes
108 | * 2 Proxies (using the consul lock functionality)
109 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker)
110 | * Patched BBS with a retry logic on Deadlock errors
111 | 
112 | We stopped the master Proxy during seeding roughly after 20k insertions and
113 | restarted it after 30k.
114 | 
115 | The results seem to indicate that this is a viable solution. There were some
116 | deadlock errors when bringing the MySQL node back online but the retry logic
117 | spiked on the BBS mitigated that.
118 | 
119 | ### Results
120 | 
121 | ```ruby
122 | • [MEASUREMENT]
123 | main benchmark test
124 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:210
125 |   data for benchamrks
126 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_test.go:209
127 | 
128 |   Ran 1 samples:
129 |   map[MetricName:RepBulkFetching]
130 |   rep bulk fetch:
131 |     Fastest Time: 0.007s
132 |     Slowest Time: 1.772s
133 |     Average Time: 0.084s ± 0.160s
134 |   map[MetricName:RepStartActualLRP]
135 |   start actual LRP:
136 |     Fastest Time: 0.009s
137 |     Slowest Time: 4.640s
138 |     Average Time: 0.168s ± 0.247s
139 |   map[MetricName:RepBulkLoop]
140 |   rep bulk loop:
141 |     Fastest Time: 0.014s
142 |     Slowest Time: 1.779s
143 |     Average Time: 0.097s ± 0.163s
144 |   map[MetricName:RepClaimActualLRP]
145 |   claim actual LRP:
146 |     Fastest Time: 0.017s
147 |     Slowest Time: 0.456s
148 |     Average Time: 0.100s ± 0.070s
149 |   map[MetricName:NsyncBulkerFetching]
150 |   fetch all desired LRP scheduling info:
151 |     Fastest Time: 2.940s
152 |     Slowest Time: 2.940s
153 |     Average Time: 2.940s ± 0.000s
154 |   map[MetricName:ConvergenceGathering]
155 |   BBS' internal gathering of LRPs:
156 |     Fastest Time: 5.598s
157 |     Slowest Time: 5.598s
158 |     Average Time: 5.598s ± 0.000s
159 |   map[MetricName:FetchActualLRPsAndSchedulingInfos]
160 |   fetch all actualLRPs:
161 |     Fastest Time: 2.627s
162 |     Slowest Time: 2.627s
163 |     Average Time: 2.627s ± 0.000s
164 | ------------------------------
165 | 
166 | Ran 1 of 1 Specs in 712.872 seconds
167 | SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
168 | 
169 | Ginkgo ran 1 suite in 11m56.614700533s
170 | Test Suite Passed
171 | ```
172 | 
173 | ## Experiment 3
174 | 
175 | ### Config
176 | 
177 | * 2 MySQL Nodes
178 | * 2 Proxies (using the consul lock functionality)
179 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker)
180 | * Patched BBS with a retry logic on Deadlock errors
181 | 
182 | We stopped the master MySQL node during seeding roughly after 20k insertions and
183 | restarted it after 30k. 
184 | 
185 | The Proxy successfully transferred all of the active connections to the available MySQL
186 | node immediately after bringing the master node down.
187 | 
188 | However, this time we watched the status of the MySQL node on restart,
189 | and noticed that the MySQL node was unable to start back up until there was no
190 | load on the system. Every time that the MySQL node failed to start, it would
191 | stop/start the internal mysqld server which would result in a decent number of
192 | deadlock errors.
193 | 
194 | ### Results
195 | 
196 | ```ruby
197 | Failure [592.772 seconds]
198 | [BeforeSuite] BeforeSuite
199 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
200 | 
201 |   Expected error:
202 |       <*errors.errorString | 0xc825ad59c0>: {
203 |           s: "Error rate of 0.060 for actuals exceeds tolerance of 0.050",
204 |       }
205 |       Error rate of 0.060 for actuals exceeds tolerance of 0.050
206 |   not to have occurred
207 | 
208 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:244
209 | ------------------------------
210 | Failure [592.781 seconds]
211 | [BeforeSuite] BeforeSuite
212 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
213 | 
214 |   BeforeSuite on Node 1 failed
215 | 
216 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
217 | ------------------------------
218 | Failure [592.775 seconds]
219 | [BeforeSuite] BeforeSuite
220 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
221 | 
222 |   BeforeSuite on Node 1 failed
223 | 
224 |   /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
225 | ------------------------------
226 | Failure [592.782 seconds]
227 | [BeforeSuite] BeforeSuite
228 | /var/vcap/packages/benchmark-bbs/src/github.com/cloudfoundry-incubator/benchmark-bbs/benchmark_bbs_suite_test.go:298
229 | ```
230 | 
231 | ## Experiment 4
232 | 
233 | ### Config
234 | 
235 | * 2 MySQL Nodes
236 | * 2 Proxies (using the consul lock functionality)
237 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker)
238 | * Patched BBS with a retry logic on Deadlock errors
239 | 
240 | We did a rolling deploy during the seeding phase.
241 | 
242 | ### Results
243 | 
244 | The rolling deploy never finished. A mysql node is unable to restart during the seeding phase.
245 | 
246 | ## Experiment 5
247 | 
248 | ### Config
249 | 
250 | * 2 MySQL Nodes
251 | * 2 Proxies (using the consul lock functionality)
252 | * 1 Arbitrator (a fake MySQL node that serves as a tie breaker)
253 | * Patched BBS with a retry logic on Deadlock errors
254 | 
255 | We did a rolling deploy during the measurement phase with 50k lrps and 8% writes.
256 | 
257 | ### Results
258 | 
259 | It worked.
260 | 


--------------------------------------------------------------------------------
/notes/123482771-spike-influxdb-graphana.md:
--------------------------------------------------------------------------------
 1 | A version of the influxdb-firehose-nozzle that works with current loggregator
 2 | can be found at: https://github.com/luan/influxdb-firehose-nozzle. That fork
 3 | updates loggregator libraries and code with most recent datadog-firehose-nozzle
 4 | 
 5 | The nozzle could be made into a bosh release or deployed as a CF app, problem
 6 | with CF app is that it can interfere with performance tests.
 7 | 
 8 | Grafana (http://grafana.org) can be either deployed with bosh
 9 | (https://github.com/vito/grafana-boshrelease) or ran locally and pointed to a
10 | remote InfluxDB.
11 | 
12 | InfluxDB can be deployed via bosh
13 | (https://github.com/vito/influxdb-boshrelease) its configuration is really
14 | minimal and should straightforward.
15 | 
16 | There is sample dashboard for Grafana in the same directory as this document as
17 | `grafana-sample.json`.
18 | 


--------------------------------------------------------------------------------
/notes/125596231-usability-of-context.md:
--------------------------------------------------------------------------------
  1 | ## Story
  2 | See [story](https://www.pivotaltracker.com/story/show/125596231)
  3 | 
  4 | ## Background
  5 | #### Advantage of `context.Context`
  6 | 
  7 | 1. It is an accepted, standard way of doing things.
  8 | 1. We force the user to provide a loggregator, when they may not want anything
  9 | logged. With context, the user doesn't have to pass loggregator and can just not
 10 | specify a value on the `context.Context` object.
 11 | 1. Concurrency issues can be handled using `context.Context`. Currently, we use `ifrit` to handle concurrency. `context.Context` can be used instead.
 12 | 
 13 | #### Disadvantage of `context.Context`
 14 | 
 15 | 1. Implicit arguments. We can place request-scoped variables in the `context.Context` object and retrieve them with `ctx.Value("key")`. It might seem odd to have something passed implicitly
 16 | instead of just saying that this function uses this argument.
 17 | However, according to [this guy]
 18 | (https://medium.com/@cep21/how-to-correctly-use-context-context-in-go-1-7-8f2c0fafdf39#.9o93w6x2m),
 19 | using `context.Context` for logging is an acceptable paradigm, as the presence/lack
 20 | of a logger should not affect our program flow. The obvious candidates for placing in `context.Context` objects are:
 21 |   * logger
 22 |   * metadata objects that are not necessary for program flow, such as the `checksum` variable on the `cacheddownloader`
 23 | 1. Type-safety is a possible concern for passing varibles in `context.Context` object.
 24 | 1. Need to update to Go 1.7 if we want to use `Context` from the standard library.
 25 | Otherwise, we can just import from here `golang.org/x/net/context` pre-1.7.
 26 | 
 27 | ## Benefits of Context in Diego Release
 28 | 
 29 | ### 1. Cross Cutting Concerns
 30 | Context is useful when we want to pass aspects that are not defining the behavior of the code. This includes cross-cutting
 31 | concerns such as logging and security checks. Also meta-data related to the behavior of the an object can be passed in
 32 | through context. The benefit is that, metadata is not directly affecting the behavior of the system and over time may grow
 33 | or shrink. Passing metadata through context allows the code signature to stay the same and potentially for the complexity of the code to be reduced.
 34 | 
 35 | #### Logging
 36 | One of the immediate advantages of using context, is to allow for passing the `logger` object through to the handlers, without having the object be a parameter for every endpoint.
 37 | 
 38 | There are two options for consideration:
 39 | 
 40 | *Option 1:*
 41 | 
 42 | This can be easily achieved by modifying the `middleware` in `bbs` where the following change is possible:
 43 | 
 44 | ```go
 45 | func LogWrap(logger lager.Logger, loggableHandlerFunc LoggableHandlerFunc) http.HandlerFunc {
 46 | 	return func(w http.ResponseWriter, r *http.Request) {
 47 | 		requestLog := logger.Session("request", lager.Data{
 48 | 			"method":  r.Method,
 49 | 			"request": r.URL.String(),
 50 | 		})
 51 | 
 52 | 		// two new lines below
 53 | 		ctx := context.WithValue(req.Context(), "logger", &requestLog)
 54 | 		r = r.WithContext(ctx)
 55 | 
 56 | 		requestLog.Debug("serving")
 57 | 		loggableHandlerFunc(requestLog, w, r)
 58 | 		requestLog.Debug("done")
 59 | 	}
 60 | }
 61 | ```
 62 | 
 63 | And then in the handler, the logger can be used like the following:
 64 | 
 65 | ```go
 66 | // the logger is removed from the function signature below
 67 | func (h *PingHandler) Ping(w http.ResponseWriter, req *http.Request) {
 68 |   // the logger is read from the context of the Request
 69 |   // passing it with pointer, allows for the session information to persist
 70 |     logger := req.Context().Value("logger").(*lager.Logger)
 71 | 	response := &models.PingResponse{}
 72 | 	response.Available = true
 73 | 	writeResponse(w, response)
 74 | }
 75 | ```
 76 | 
 77 | Similar change can be done to `auctioneer`'s `handlers.go` as well. Alternatively the logger can be taken out of `LogWrap` and context to be passed in.
 78 | 
 79 | *Option 2:*
 80 | 
 81 | We can modify each handler to have a `context` object passed to it as a first argument. This will be in line with the pattern Google follows as [discussed here](https://news.ycombinator.com/item?id=8103128)
 82 | 
 83 | Overall, the code change involves passing a Context object instead of a
 84 | `loggregator` object around, and it doesn't seem like a difficult change, just a
 85 | tedious one.
 86 | 
 87 | In case of the above handler the change would be as follows:
 88 | ```go
 89 | // the logger is removed from the function signature below
 90 | func (h *PingHandler) Ping(ctx context.Context, w http.ResponseWriter, req *http.Request) {
 91 |   // the logger is read from the context of the Request
 92 |   // passing it with pointer, allows for the session information to persist
 93 |     logger := ctx.Value("logger").(*lager.Logger)
 94 | 	response := &models.PingResponse{}
 95 | 	response.Available = true
 96 | 	writeResponse(w, response)
 97 | }
 98 | ```
 99 | 
100 | *Option 3:*
101 | 
102 | Rather than including the `logger` object into the `context`, we can alternatively pass the session information and create a new logger object in each handler. On the negative side, this would result in having multiple `logger` objects created in different functions throughout the flow of code execution, which may not necessarily be desirable. On the positive side, passing session names around results in having immutable objects stored in the `context` which is potentially a better practice.
103 | 
104 | **Suggestion**: We advocate for the second option, because it prevents unnecessary objects getting created, and also offers a cleaner implementation. Reconstructing the logging object seems to be unnecessary and of little value.
105 | 
106 | This is however a fairly big change the signature of functions in `bbs` and `auctioneer`. The following repos will be affected"
107 | 
108 | 1. `bbs` (all the handlers need to be changed to have `Context` and not loggregator
109 |     and related test code. Our main function will create a `context.Background` and
110 |     we can get rid of the middleware.`LogWrap` function.)
111 | 
112 | 1. `auctioneer` (same as `bbs`)
113 | 
114 |   These need to be changed because they use the `bbsClient`.
115 | 
116 |   1. `benchmarkbbs`
117 |   1. `cfdot`
118 |   1. `route-emitter`
119 |   1. `diego-ssh`
120 |   1. `inigo`
121 |   1. `rep`
122 |   1. `vizzini`
123 | 
124 | ### 2. Concurrency
125 | 
126 | Functionality offered by `Context` can also be used to simplify some of the logic when dealing with concurrency. We explored a few areas in the code where `Context` can be helpful with simplifying the code.
127 | 
128 | _For an example, consider the scenario below:_
129 | 
130 | In the `executor`, the `StepRunner` and all the associated steps from the
131 | `Step` interface can be simplified by having the `Perform` function
132 | take a `context.Context` as a variable. This in turn, would allow us
133 | to get rid of the `Cancel` function, because signalling a cancel can be
134 | handled by using the context. The same context, can also be used to push
135 | the logger into each of the steps, hence simplifying the structs that we
136 | use to define each step.
137 | 
138 | To prove this, we spiked on modifying the `timeoutStep` to take advantage of the
139 | new model, and the code changed as follows:
140 | 
141 | The struct changed from:
142 | 
143 | ```go
144 |  type timeoutStep struct {
145 |        substep    Step
146 |        timeout    time.Duration
147 |        cancelChan chan struct{}
148 |        logger     lager.Logger
149 |  }
150 | ```
151 | 
152 | to the following:
153 | 
154 | ```go
155 | type timeoutStep struct {
156 |        substep Step
157 |        timeout time.Duration
158 | }
159 | ```
160 | 
161 | Also the `Perform` function was modified from the following:
162 | ```go
163 | func (step *timeoutStep) Perform() error {
164 | 	resultChan := make(chan error, 1)
165 | 	timer := time.NewTimer(step.timeout)
166 | 	defer timer.Stop()
167 | 
168 | 	go func() {
169 | 		resultChan <- step.substep.Perform()
170 | 	}()
171 | 
172 | 	for {
173 | 		select {
174 | 		case err := <-resultChan:
175 | 			return err
176 | 
177 | 		case <-timer.C:
178 | 			step.logger.Error("timed-out", nil)
179 | 
180 | 			step.substep.Cancel()
181 | 
182 | 			err := <-resultChan
183 | 			return NewEmittableError(err, emittableMessage(step.timeout, err))
184 | 		}
185 | 	}
186 | }
187 | 
188 | func (step *timeoutStep) Cancel() {
189 | 	step.substep.Cancel()
190 | }
191 | ```
192 | 
193 | to the simplified version below:
194 | 
195 | ```go
196 | func (step *timeoutStep) Perform(ctxt context.Context) error {
197 | 	resultChan := make(chan error, 1)
198 | 
199 | 	ctxt, _ := context.WithTimeout(ctxt, step.timeout)
200 | 
201 | 	go func() {
202 | 		resultChan <- step.substep.Perform(ctxt)
203 | 	}()
204 | 
205 | 	for {
206 | 		select {
207 | 		case err := <-resultChan:
208 | 			return err
209 | 
210 | 		case <-ctxt.Done():		
211 | 			err := <-resultChan
212 | 			return NewEmittableError(err, emittableMessage(step.timeout, err))
213 | 		}
214 | 	}
215 | }
216 | 
217 | ```
218 | 
219 | We no longer have a logger in the struct as it is retrieved by the
220 | `context`, nor a `cancelChan` since a context provides us with a mechanism
221 | for canceling by using `context.WithCancel`. Instead, we would pass a context object to the `Perform` function of
222 | `timeoutStep`. We also don't have to create timers, since `context`
223 | has support for `Timeout`.
224 | 
225 | More interestingly, `context` gets passed to the substeps, so since these `context`
226 | is shared, canceling the parent `context` will propagate through and to the children,
227 | which exempts us from having to write code below and make sure we call it on the child steps.
228 | 
229 | ```go
230 | func (step *timeoutStep) Cancel() {
231 |   step.substep.Cancel()
232 | }
233 | ```
234 | 
235 | The above is only one possibility for how `context` can be used in `diego release`. Our investigation only looked at a few places including the `StepRunner` and `CachedDownloader`. In case we decide that `context` is useful, we can dig deeper and look into other places where `context` can be helpful.
236 | 
237 | ## Conclusion
238 | 
239 | There are clear benefits in using `context` and achieving simplicity in the code. However the cost of refactoring could be rather expensive. For example, adding `logger` to the code is a fairly trivial change that can be easily achieved. On the other hand, using `context` to achieve concurrency allows for better code but at the cost of more significant change to the code base.
240 | 


--------------------------------------------------------------------------------
/notes/debug-app-crash.md:
--------------------------------------------------------------------------------
 1 | # Understanding App Crashes in diego
 2 | 
 3 | ## Overview
 4 | When an App (ActualLRP) crashes, Rep will notify BBS of the event using
 5 | `/v1/actual_lrps/crash` endpoint. In turn, BBS will generate an
 6 | [ActualLRPCrashedEvent](https://github.com/cloudfoundry/bbs/blob/master/doc/events.md#actuallrpcrashedevent) that is then consumed by [tps](https://github.com/cloudfoundry/tps) to record and notify the clients.
 7 | 
 8 | ## Logs
 9 | - Start by looking for a `bbs-start-actual-lrp` log in Rep for the application
10 |   in question. This will tell us when the instance was started.
11 |   ```json
12 |   {"timestamp":"some-time","level":"info","source":"rep","message":"rep.executing-container-operation.ordinary-lrp-processor.process-running-container.bbs-start-actual-lrp","data":....}
13 |   ```
14 | - The next thing to look for would be `run-step.process-exit` log line for the
15 |   application. This will tell us when the process exited the container.
16 |   ```json
17 |   {"timestamp":"some-time","level":"info","source":"rep","message":"rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-run.node-run.action.run-step.process-exit","data":{"cancelled":true,...}}
18 |   ```
19 | - If `log_level: debug` is set for Rep, we would even see when the rep called
20 |   the crash endpoint
21 |   ```json
22 |   {"timestamp":"some-time","level":"debug","source":"rep",
23 |   "message":"rep.executing-container-operation.ordinary-lrp-processor.process-c
24 |   ompleted-container.do-request.complete","data":{.... ,"request_path":"/v1/actual_lrps/crash","session":"61.1.1.1"}}
25 |   ```
26 | - Next, we would want to look at the BBS logs and look for `crash-actual-lrp`. This will let us know when the BBS received notification and when it generated the event.
27 |   ```json
28 |   {"timestamp":"some-time","level":"info","source":"bbs","message":"bbs.request.crash-actual-lrp.db-crash-actual-lrp.starting","data":{"crash_reason":"APP/PROC/WEB: Exited with status 2",....}}
29 |   {"timestamp":"some-time","level":"info","source":"bbs","message":"bbs.request.crash-actual-lrp.db-crash-actual-lrp.complete","data":{"crash_reason":"APP/PROC/WEB: Exited with status 2",...}}
30 |   ```
31 | - `route_emitter` is one of those consumers. The following would be captured
32 |   by route-emitter when crash has happened.
33 |   ```json
34 |   {"timestamp":"some-time","level":"error","source":"route-emitter","message":"route-emitter.watcher.handling-event.did-not-handle-unrecognizable-event","data":{"error":"unrecognizable-event","event-type":"actual_lrp_crashed","session":"8.123"}}
35 |   {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.watcher.handling-event.received-actual-lrp-instance-removed-event","data":{}}
36 |   {"timestamp":"some-time","level":"info","source":"route-emitter","message":"route-emitter.watcher.handling-event.removed","data":{}}
37 |   {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.watcher.handling-event.emit-messages","data":{}}
38 |   {"timestamp":"some-time","level":"debug","source":"route-emitter","message":"route-emitter.nats-emitter.emit","data":{}}
39 |   ```
40 | 
41 | - tps-watcher is the another component in line for retrieving the
42 |   [ActualLRPCrashedEvent](https://github.com/cloudfoundry/bbs/blob/master/doc/events.md#actuallrpcrashedevent) and notifying CAPI.
43 |   ```json
44 |   {"timestamp":"some-time","level":"info","source":"tps-watcher","message":"tps-watcher.watcher.app-crashed","data":{}}
45 |   {"timestamp":"some-time","level":"info","source":"tps-watcher","message":"tps-watcher.watcher.recording-app-crashed","data":{}}
46 |   ```
47 | 


--------------------------------------------------------------------------------
/notes/distributed-locking-and-presence.md:
--------------------------------------------------------------------------------
 1 | # Distributed Locking/Presence with Consul
 2 | 
 3 | Diego defers to Consul/etcd for the distributed systems magic.  We have two basic needs that are continuing to give us trouble.  These are both very different and we should pick the best tool/abstraction for the job.
 4 | 
 5 | ## Distributed Locking
 6 | 
 7 | We have a handful of processes that are actually meant to be singletons: `nsync-bulker`, `tps-watcher`, `converger`, `auctioneer`, and `route-emitter`.
 8 | 
 9 | Some terms:
10 | 
11 | - **worker** will refer to a process that is doing work.
12 | - **slumberer** will refer to a process that is not doing work.  **slumberers** wait to become **workers** in the event of failure.
13 | - *immediately* means "ASAP" or "in under a second"
14 | - *eventually* means within some predefined threshold.  Ideal targets here are 5-10 seconds.
15 | 
16 | The desired behavior here is:
17 | 
18 | 1. Given a set of processes, at most one may be the **worker**
19 | 2. If the **worker** shuts down cleanly (`SIGINT`, `SIGTERM`, `monit stop`, etc..) exactly one **slumberer** should become the next **worker** *immediately*.
20 | 3. If the **worker** dies hard (`SIGKILL`, VM explodes, etc...) exactly one **slumberer** should become the next **worker** *eventually*.
21 | 4. If a **worker** loses confidence that it is the *only* **worker**, the **worker** should shut down.
22 | 
23 | #### Where we are today:
24 | 
25 | 1 and 2 are working.  3 is taking ~30s despite TTLs of 10s having been explicitly specified.  Onsi has not reproduced 4 yet.
26 | 
27 | ## Presence
28 | 
29 | Cells maintain their presence in a centralized store.  This serves two purposes:
30 | 1. the converger uses this information to determine whether the Cell is still running.  If it is not, then applications associated with that Cell are rescheduled.
31 | 2. the auctioneer uses this information to discover and communicate with the Cell.
32 | 
33 | The desired behavior here is:
34 | 
35 | 1. When a Cell dies hard (`SIGKILL`, VM explodes, etc...):
36 |     a) the converger should be notified within 5-10 seconds. i.e. the converger should not wait until a convergence loop to act.  If it is notified that a cell has disappeared it should act immediately.
37 |     b) when the convergence loop runs the Cell should not be present
38 |     c) when the auctioneer schedules work it should not attempt to talk to the (now missing) Cell
39 | 2. When a Cell is shut down cleanly (`SIGTERM`, `monit stop`, rolling deploy):
40 |     a) the converger should not be notified immediately.  There is no need to trigger a convergence loop.
41 |     b) when the convergence loop runs the Cell should not be present
42 |     c) when the auctioneer schedules work it should not attempt to talk to the (now missing) Cell
43 |     
44 | #### Where we are today:
45 | 
46 | 2b and 2c are working.  2a does not work as specified (a convergence loop is triggered when the cell presence is taken away) -- this is OK if not ideal.
47 | 
48 | 1a, 1b, and 1c are not working at all.  This is a substantial regression and we should add inigo coverage to make sure we have this important edge case covered.
49 | 
50 | ## Current implementation of locks using Consul API
51 | 
52 | Distributed Locking and Presence are using the same set of APIs. The Consul Lock API is used to acquire and release some piece of data. In the case of a lock, the data is meaningless. For presence, any useful presence information is stored.
53 | 
54 | At the center of Consul is the concept of a Session. A session some important characteristics including a TTL, LockDelay, Checks, and a Behavior. 
55 | If a session expires, exceeds its TTL, it is invalidated. 
56 | The Behavior describes how data owned by the session is handled when invalidated. It can be released (default) or deleted. 
57 | The LockDelay is used to mark certain locks as unacquirable. When a lock is forcefully released (failing health check, destroyed session, etc), it is subject to the LockDelay impossed by the session. This prevents another session from acquiring the lock for some period of time as a protection against split-brains.
58 | It is not clear how Checks interact with Sessions. They do refer to defined checks but I am not sure if they are on the node or a service, so we will ignore them for now since I do not believe we need to be concerned with them (yet).
59 | 
60 | The current code creates locks to represent a lock or presence. The lock api will create a session if one is not provided. If a session is created, it is also automatically renewed on an interval of ttl/2. A key/value is then acquired. If already acquired, the lock will keep retrying until it becomes available. Any error will fail the Lock(). A conflict may be returned if you are locking a semaphore. The retry has a fixed delay of 5s. On acquisition, the lock is monitored. If for any reason, this fails, you are notified that the lock is lost. Note that if the connection to the agent times out (which it does), you will think you lost the lock prematurely.
61 | 
62 | An Unlock() does the reverse. If a session is being automatically renewed, it is stopped. The key/value is released, not deleted.
63 | 
64 | A Destroy() assumes you have Unlock()ed and remove the key/value if possible.
65 | 
66 | ## Where do we need to go
67 | 
68 | Its pretty obvious that we need to center around a Session. There should only be one session per process. The ideal solution would allow us to create or re-acquire the session for a process. Our behavior would be to delete rather than release. For presence, a lock delay of 0 makes perfect sense since there is no concern for split brain. We can argue whether we need some lock delay for locks but we can begin with 0. If we require a lock delay for locks, then we will need a separate session for locks.
69 | 
70 | Locks and presence become nothing more than acquiring and releasing a key/value beyond this point. The focus is all about keeping the session current.
71 | 
72 | A limitation or possible enhancement to discuss with Hashicorp would be to perform a release&delete which is currently two operations. This may not be needed if all we ever do is just whack the session which will auto-destruct all our data.
73 | 
74 | It seems that if data exists, we should also be checking the Leader(Session) of the data to determine if its still owned. Given the data may still exist in the KV store, it may not however be acquired (see above, 2 step process).
75 | 
76 | To prevent multiple WATCHes, the parent location should be created. If there is nothing to watch with a given prefix, Consul will return immediately with a 404. Our only recourse is to spin or put a slight delay, or inject the root node.
77 | 
78 | ### Presence
79 | 
80 | An alternative for presence can be done via Services. One can register/deregister a service instance that will be accumulated into a given Service. Health checks can be added to manage the lifetime of the service. The check can be script, HTTP or TTL. Script and HTTP checks are performed on an interval. TTL checks must be updated via the UpdateTTL with a status.
81 | 
82 | We will need to decide how much we allow Consul to do versus ourselves. There is obviously a balance of coupling between the ability to change our implementation and relying on something else.
83 | 


--------------------------------------------------------------------------------
/notes/logging-guidance.md:
--------------------------------------------------------------------------------
 1 | # Logging Guidance in Diego
 2 | 
 3 | ## Formatting
 4 | 
 5 | Logger session names and messages should be hyphenated. Keys in a message data payload should use snake-case (`key_name`) instead of hyphens (`key-name`) or camel-case (`keyName`).
 6 | 
 7 | ```go
 8 | logger = outer_logger.Session("handling-request")
 9 | logger.Info("message-to-show", lager.Data{"key_to_show": data})
10 | ```
11 | 
12 | ## Sinks
13 | 
14 | Typically each Diego component has only one logging sink registered, which writes all log messages to stdout.
15 | 
16 | 
17 | ## Levels
18 | 
19 | The `lager` Logger type supports the following logging levels:
20 | 
21 | * Fatal
22 | * Error
23 | * Info
24 | * Debug
25 | 
26 | 
27 | ### Fatal
28 | 
29 | Log a message at the fatal level to cause the logger to panic, writing a log line with the stack trace at the panic site. Reserve this level only for errors on which it is impossible for the program to continue operating normally. These should be used in place of `panic(err)` and `os.Exit(1)`.
30 | 
31 | ### Error
32 | 
33 | Log a message at the error level to indicate something important failed.
34 | 
35 | 
36 | ### Info
37 | 
38 | Log a message at the info level to indicate some normal but significant event, such as beginning or ending an important section of logic. Info logs are also usually appropriate to mark the boundaries of various internal and external APIs.
39 | 
40 | If we're calling a function in our own code base, we should be wary of wrapping functions that log in additional logging. For example, if we have a synchronous function that logs using the preferred started/finished pattern:
41 | ```go
42 | func foo(logger lager.Logger) error {
43 |   logger := logger.Session("foo")
44 |   logger.Info("started")
45 |   defer logger.Info("finished")
46 |   return nil
47 | }
48 | ```
49 | We should be careful not to log around its success when we call out to it:
50 | ```go
51 | logger.Info("calling-foo") // this line is excessive logging
52 | err := foo(logger)
53 | if err != nil {
54 |   logger.Error("failed-fooing", err) // this line is great!
55 | }
56 | logger.Info("finished-fooing") // this line is excessive logging
57 | ```
58 | Instead, this should be:
59 | ```go
60 | err := foo(logger)
61 | if err != nil {
62 |   logger.Error("failed-fooing", err) // this line is great!
63 | }
64 | ```
65 | 
66 | 
67 | 
68 | ### Debug
69 | 
70 | Log a message at the Debug level to trace ordinary or frequent events, such as pings, metrics, heartbeats, subscriptions, polling, and notifications. If it happens on a timer in a component that is usually driven by a client, internal or otherwise, it should probably log at the Debug level. If you can't imagine caring, but some other developer did in the past, you might want to log at Debug level. 
71 | 


--------------------------------------------------------------------------------
/proposals/bbs-migration-stories.prolific.md:
--------------------------------------------------------------------------------
  1 | [CHORE] Remove legacyBBS from the Auctioneer
  2 | 
  3 | The Auctioneer is still referencing `legacyBBS` (i.e. `runtime-schema`).  It does this to fetch the set of Cells.  This should be an API on the BBS (it can reach out to consul to get the relevant data).
  4 | 
  5 | L: versioning, diego:ga
  6 | 
  7 | ---
  8 | 
  9 | Remove the Receptor
 10 | 
 11 | Acceptance:
 12 | - The access VM should not have a Receptor!
 13 | 
 14 | L: versioning, diego:ga
 15 | 
 16 | ---
 17 | 
 18 | All access to the BBS should go through one, master-elected, BBS server
 19 | 
 20 | We set this up to drive out the rest of the migration work.
 21 | 
 22 | All slave BBS servers should redirect to the master.
 23 | 
 24 | Acceptance: 
 25 | - I get a redirect when I talk to a slave.
 26 | 
 27 | L: versioning, diego:ga
 28 | 
 29 | ---
 30 | 
 31 | The master-elected BBS server should never allow multiple simultaneous LRP/Task Convergence runs
 32 | 
 33 | Currently, the endpoints that trigger convergence will always spin up a convergence goroutine.  Instead, if a convergence goroutine is still running we should queue up at most *one* additional convergence request to run.  (i.e. if during a given convergence run *multiple convergence requests come in* we should only queue up *one* additional convergence request).
 34 | 
 35 | L: versioning, diego:ga
 36 | 
 37 | ---
 38 | 
 39 | BBS server should emit metrics
 40 | 
 41 | - Convergence-related metrics (these were lost when we gutted the converger)
 42 | - BBS requests (both # of requests so we can compute request rates, and request latencies)
 43 | - All metrics emitted by the metrics server could, and should, just be emitted by the BBS during the convergence loop.
 44 | 
 45 | L: versioning, diego:ga
 46 | 
 47 | ---
 48 | 
 49 | After a BOSH deploy, all data in the BBS should be stored in base64 encoded protobuf format
 50 | 
 51 | This drives out (see https://github.com/onsi/migration-proposal#the-bbs-migration-mechanism):
 52 | 
 53 | - introducing a database version
 54 | - teaching the BBS server to run the migration on start
 55 | - teaching the BBS server to bump the database version upon a succesful migration
 56 | - removing the command line arguments that control the data encoding
 57 | 
 58 | Acceptance:
 59 | - After completing a BOSH deploy I see Task data in the BBS stored in base64 encoded protobuf format
 60 | - I can (manually) fetch a database version key from etcd
 61 | 
 62 | L: versioning, diego:ga, migrations
 63 | 
 64 | ---
 65 | 
 66 | [BUG] If the BBS is killed mid-migration, bringing the BBS back completes the migration
 67 | 
 68 | This drives out managing the database version in the case of a failed migration.
 69 | 
 70 | Acceptance:
 71 | - while true; do
 72 |     - I fill etcd with many JSON-encoded tasks
 73 |     - I perform a BOSH deploy
 74 |     - I kill the BBS mid-migration
 75 |     - I bring the BBS back
 76 |     - I see protobuf-encoded tasks, only
 77 | 
 78 | L: versioning, diego:ga, migrations
 79 | 
 80 | ---
 81 | 
 82 | During a migration, the BBS should return 503 for all requests
 83 | 
 84 | - CC-Bridge components should forward the 503 error along where appropriate.
 85 | - Route-Emitter should maintain the routing table and continue emitting it
 86 | 
 87 | Acceptance:
 88 | - I set up a long-running migration (perhaps I have a lot of data)
 89 | - I see requests to the BBS return 503 during the migration
 90 | 
 91 | L: versioning, diego:ga, migrations
 92 | 
 93 | ---
 94 | 
 95 | If a migration fails, I should be able to BOSH deploy the previously deployed release and recover.
 96 | 
 97 | If we keep the `/vN` root node then this becomes trivial.  We simply don't delete the `/vN-1` root node until after the migration.
 98 | 
 99 | L: versioning, diego:ga, migrations
100 | 
101 | ---
102 | 
103 | If the Rep repeatedly fails to mark its ActualLRPs as EVACUATING it should fail to drain and the BOSH deploy should abort.
104 | 
105 | This is a safety valve.  It ensures we don't catastrophically lose all the applications if the BBS is unavailable.  We should be somewhat generous and retry our evacuation requests repeatedly and perhaps require that all the requests fail or 503.  The Cell should exit EVACUATING mode if this happens.
106 | 
107 | L: versioning, diego:ga, migrations
108 | 
109 | ---
110 | 
111 | DesiredLRP data should be split across separate records
112 | 
113 | This should bump the DB version.  All LRP data should get migrated via a BOSH deploy.
114 | 
115 | We do not modify the API, but - instead - do whatever work is necessary under the hood to maintain the existing API.
116 | 
117 | Acceptance:
118 | - After a BOSH deploy I can see that all LRP data has been split in two.
119 | 
120 | L: perf, versioning, diego:ga
121 | 
122 | ---
123 | 
124 | As a BBS client, I can efficiently get frequently accessed data for all DesiredLRPs in a domain
125 | 
126 | Add a new API endpoint
127 | 
128 | L: perf, versioning, diego:ga
129 | 
130 | ---
131 | 
132 | NSYNC's bulker should fetch the minimal set of DesiredLRP data
133 | 
134 | L: perf, versioning, diego:ga
135 | 
136 | ---
137 | 
138 | Route-Emitter's bulk loop should fetch the minimal set of DesiredLRP data
139 | 
140 | L: perf, versioning, diego:ga
141 | 
142 | ---
143 | 
144 | 
145 | All Diego components should communicate securely via mutually-authenticated SSL
146 | 
147 | L: versioning, diego:ga
148 | 
149 | ---
150 | 
151 | Cut Diego 0.9.0
152 | 
153 | We draw a line in the sand.  All subsequent work should ensure a clean upgrade path from 0.9.0.
154 | 
155 | We call this 0.9.0 not to signify that all our GA goals have been met (performance is still somewhat lacking).  However this gives us a reference point to be disciplined about maintaining backward compatibility during subsequent deploys.
156 | 
157 | L: versioning, diego:ga
158 | 
159 | ---
160 | 
161 | [CHORE] Diego has an integration suite/environment that validates that upgrades from 0.9.0 to HEAD satisfy the minimal-downtime requirement
162 | 
163 | This ensures that subsequent work migrates cleanly.
164 | 
165 | L: versioning
166 | 
167 | ---
168 | 
169 | Set up a test suite that runs all migrations from 0.9.0 up for 100k records.
170 | 
171 | Test suite should fail if migration time exceeds some threshold (2 minutes?)
172 | 
173 | This can help us determine whether or not we want to implement background-migrations for things like rotating encryption keys.
174 | 
175 | L: perf
176 | 
177 | ---
178 | 
179 | The converger should run convergence, not the BBS.
180 | 
181 | To do this we'll want to:
182 | 
183 | 1. Separate Task & Desired/Actual LRP convergence from the database-cleanup convergence.
184 | 2. Have the BBS perform database-cleanup periodically.
185 | 3. Have the Converger *fetch* data from the BBS, harmonize it, then update the BBS as relevant.
186 | 
187 | L: versioning, cleanup
188 | 
189 | ---
190 | 
191 | I can rollback a migration to an earlier release
192 | 
193 | We can either run a bosh errand that triggers down-migrations to the specified version.  Or (better?) we can write a drain script on the BBS that triggers the down-migration -- is there any way to specify the target version this way?  If we go that direction how do we prevent other BBSes from rerunning up migrations.
194 | 
195 | L: versioning, needs-discussion
196 | 


--------------------------------------------------------------------------------
/proposals/better-buildpack-caching.md:
--------------------------------------------------------------------------------
 1 | # Better Buildpack Caching
 2 | 
 3 | Things are coming to a head:
 4 | - staging on Diego is slow because we're copying buildpacks around.
 5 | - the latest buildpacks are massive -- so much so that staging fails on Diego because we exceed the disk quota in the container!
 6 | 
 7 | The buildpacks team is slimming down the buildpacks but this only solves a particular usecase.  Customers with their own not-so-lithe buildpacks are going to run into similar issues.
 8 | 
 9 | What can we do to improve Diego's staging performance?  Here's a proposal.  It isn't blocked on any new API from Garden though we can transparently transition to the Volumes API at some point in the future if it makes sense.
10 | 
11 | ## Abstractions vs Optimizations
12 | 
13 | We are faced with an optimization problem that we need to adapt our abstraction to support.  To do this I propose we extend Diego's cached downloading mechanism to support the notion of a named cache that can be mounted and shared between containers at container creation time.
14 | 
15 | ### The new API
16 | 
17 | We modify Tasks and DesiredLRPs to include a set of shared download caches to mount:
18 | 
19 | ```
20 | Task/DesiredLRP = {
21 |     ...
22 |     SharedDownloadCaches: [
23 |         {
24 |             SharedCacheName: "buildpacks",
25 |             MountPoint: "/buildpacks",
26 |         }
27 |     ]
28 |     ...
29 | }
30 | ```
31 | 
32 | Then we update the `DownloadAction` to reference a shared cache.  There are many paths we can go down here but I'll put up a strawman.  I propose we introduce a new `CachedDownloadAction` so that we have:
33 | 
34 | ```
35 | DownloadAction = {
36 |     From: "http://foo",
37 |     To: "/absolute/path/in/container"
38 | }
39 | ```
40 | 
41 | ```
42 | CachedDownloadAction = {
43 |     From: "http://foo",
44 |     To: "/path/relative/to/mount/point",
45 |     CacheKey: "foo",
46 |     SharedCacheName: "buildpacks",
47 | }
48 | ```
49 | 
50 | With this split the semantics become clear.  A `DownloadAction` is never cached and always incurs the cost of a copy *into* the container.  A `CachedDownloadAction` is always cached and mounted directly into the container.  For the asset downloaded by a `CachedDownloadAction` to make it into the container the user *must* add a mount entry in the `SharedDownloadCaches` array.
51 | 
52 | ### Implementation Notes
53 | 
54 | The basic idea here is that executor will use the information in `SharedDownloadCaches` to bindmount the cache directories into containers.  The cached download action will then download and untar/unzip assets directly into the cache directory.  The cache directory will be assumed to live on the local box - though with the Volumes API in Garden we will eventually have a path towards having Garden live on another box.  This is not a priority for now.
55 | 
56 | Open questions:
57 | 
58 | - today's Cached Downloader will delete files when it is running low on resources.  We will at least need to teach it to only do this if a directory is not actively bound to a container.
59 | - what happens if an asset in the cache needs to be updated (for example: a new ruby buildpack has arrived)?  Do we overwrite the existing asset even though the (shared) directory is bound into an existing container?  What does the DEA do here?
60 | - what about cached assets where the copy-in is not a big deal (e.g. the App-Lifecycle binaries).  Do we want to have separate `SharedCacheDownloadAction`s and `CachedDownloadAction`s?  (The latter would behave like the existing `CachedDownloadAction` and incur the copy-in penalty).


--------------------------------------------------------------------------------
/proposals/bind-mounting-downloads-stories.prolific.md:
--------------------------------------------------------------------------------
 1 | Garden-Windows can support read-only bind-mounts specified at container creation
 2 | 
 3 | For the Greenhouse team to implement.
 4 | 
 5 | 
 6 | L: bind-mount-downloads
 7 | 
 8 | ---
 9 | 
10 | [CHORE] Spike: verify buildpack-app staging works when buildpacks are available from bind-mounted directories
11 | 
12 | Make sure staging tasks still work correctly when buildpacks are available in bind-mounted directories instead of streamed into containers. The files should have the permissions set in the current buildpack archives, and the UIDs that result from the rep/executor expanding them as the host `vcap` user.
13 | 
14 | 
15 | L: bind-mount-downloads
16 | 
17 | ---
18 | 
19 | As a cached-downloader client, I would like to retrieve a cached downloaded asset as an expanded archive in a directory
20 | 
21 | 
22 | The cacheddownloader should be able to provide cached assets as expanded archives in a directory, in addition to providing them as a tar stream. To be eligible to be provided in this form, a cache key must accompany the request for the asset, and the download response must provide an ETag header or a Last-Modified header (or both), which are later used as criteria for invalidating cached assets.
23 | 
24 | When a client obtains an asset directory, it is also understood to obtain a reference to the directory itself. When it is done using the directory, it must explicitly signal to the cacheddownloader that it has released its reference to the directory, so that the cacheddownloader can eventually clean up unused directories from the cache.
25 | 
26 | - If an asset is requested as a directory and is a cache hit as a directory, the existing directory path is provided.
27 | - If an asset is requested as a directory and is a cache hit as an archive, the asset is expanded into a new directory, and that directory is provided to the client. The archive asset may be deleted only after all previous readers for that archive are closed (as happens today).
28 | - If an existing asset is requested as a directory, but is a cache miss, the asset is downloaded into a new directory and provided to the client. Any prior asset for that cache key (directory or archive) should not be removed from disk until all other references to it are released.
29 | 
30 | The cacheddownloader must also still be able to provide these cached assets as tar streams to clients:
31 | 
32 | - If an asset is requested as an archive, but is a cache hit as a directory, the cacheddownloader should provide the contents of the directory as a tar stream, rather than redownloading the asset archive.
33 | 
34 | The cacheddownloader should also account for the size of assets in directories correctly enough, so that it can correctly account for disk usage when deciding to remove the least-recently items from the cache to make space for new assets.
35 | 
36 | However we implement this, it must work correctly on Windows as well as on Linux and Darwin.
37 | 
38 | 
39 | L: bind-mount-downloads
40 | 
41 | ---
42 | 
43 | As a BBS API client, I can specify that my DownloadAction on a Task or DesiredLRP is to be provided via a read-only bind-mount, and updated Diego Cells will respect that
44 | 
45 | On the BBS and its models: Add a new optional, string-valued `BindMount` field to the `DownloadAction` model. If the value is specified as `readonly`, validate that the `CacheKey` has been set, in addition to any other existing `DownloadAction` validations.
46 | 
47 | Verify that this is not a breaking change to the BBS API. We do not expect this to require any new endpoints.
48 | 
49 | The rep/executor must recognize the presence of a `readonly` value in the optional `BindMount` value on a container's DownloadAction. In this case, before container creation, the executor should use the cacheddownloader to download the asset into a directory on the host system. The executor should then bind-mount that directory to the desired container-side directory at the time of container creation, using the `BindMounts` field on the garden `ContainerSpec`.
50 | 
51 | We can impose a stricter failure policy for bind-mounted downloads: even if their failure as a streamed-in asset would not cause a failure of the action flow, their failure to download and attach as a bind-mount should cause the container creation to fail.
52 | 
53 | On deletion of the container, the executor should release references to all the cacheddownloader-provided directories that it has bind-mounted to the container.
54 | 
55 | 
56 | L: bind-mount-downloads
57 | 
58 | ---
59 | 
60 | Stager uses read-only bind-mounts for buildpack and lifecycle downloads
61 | 
62 | The stager should construct its tasks to use bind-mounting DownloadActions for the buildpack archives it downloads from Cloud Controller, and for the lifecycle binaries it downloads from the Diego file-server.
63 | 
64 | 
65 | L: bind-mount-downloads
66 | 
67 | ---
68 | 
69 | Nsync recipe generation uses read-only bind-mounts for lifecycle downloads
70 | 
71 | The nsync LRP recipe generation should construct its DesiredLRPs to use bind-mounting DownloadActions for the lifecycle binaries it downloads from the Diego file-server. It should not use them for other assets, such as the app droplet in the case of buildpack apps.
72 | 
73 | 
74 | L: bind-mount-downloads
75 | 
76 | 


--------------------------------------------------------------------------------
/proposals/bosh-deployments.md:
--------------------------------------------------------------------------------
 1 | # The Road to Composing BOSH Releases
 2 | 
 3 | A number of pivots met to discuss a vision for moving away from monolithic BOSH releases towards smaller BOSH releases.
 4 | 
 5 | ## Where we are today
 6 | 
 7 | CF-Release is a monolithic BOSH release containing just about *everything*.  Diego-Release is also a monolithic BOSH release.
 8 | 
 9 | #### Problems with this
10 | 
11 | - There is duplicate code shared between CF-Release and Diego-Release that has probably already diverged (e.g. etcd)
12 | - Deploying Diego involves a complicated dance around merging CF's templates into Diego's templates.
13 | - It is currently difficult to track compatible versions of Diego + CF
14 | - It is difficult and awkward to describe how one might BOSH deploy Lattice (i.e. Diego + Router + LAMB)
15 | - Teams are all up in each other's business - constantly bumping CF-Release
16 | - CF-Release being a monolith hides circular dependencies (e.g. LAMB depends on CC which depends on LAMB)
17 | - CF-Release and Diego-Release both conflate BOSH releases with BOSH deployment templates
18 | 
19 | ## Where we want to end up
20 | 
21 | We envision a world where every major component (or logical grouping of components) is its own BOSH Release.  These releases would contain code, tests, and job & packaging specs with appropriately namespaced properties.
22 | 
23 | A separate repository called a *deployment* repository that acts as the glue that combines disparate releases together into a vetted grouping.  The deployment repository would contain templates (potentially for a variety of targets), integration tests that validate the deployment, and pointers to concrete versions of all the releases it depends on.
24 | 
25 | Teams will be responsible for their own Releases and collaborate on keeping the deployment(s) up-to-date.  Presumably one team (e.g. runtime) would be responsible for keeping the deployment green (typically by poking the team that contributed a component that broke the release).
26 | 
27 | BOSH may add some tooling around easing someworkflows here (e.g. downloading releases via URLs, knowing how to create and upload multiple releases given a deployment)
28 | 
29 | Here's what the world might look like someday:
30 | 
31 | ![the future?](bosh-deployments.png)
32 | 
33 | ## FAQ
34 | 
35 | - *I'm an OSS consumer of CF.  Is my life now harder with all these releases?*
36 | 
37 |   No.  As an OSS consumer you still only download one repository (`cf-deployment`).  This will have all the information necessary to perform a BOSH deploy.  It may even have baked-in URLs pointing to final-release tarballs that you can just download and deploy with a single `bosh deploy`.
38 | 
39 | - *I'm an OSS developer.  Is my life now harder with all these releases?*
40 | 
41 |   Not necessarily.  You can still tweak code in an individual release.  You then create & upload said release and then run `bosh deploy` with the manifest generated by the deployment.  You will have to keep `cf-deployment` in-sync with the release if you make changes to the properties however, so that might be a new point of friction.
42 | 
43 | - *I'm an OSS team member working on CF every day.  Is my life now harder with all these releases?*
44 | 
45 |     Hopefully not.  There are now cleaner delineations of responsibility and dependencies become more explicit.  With the new decoupling you will rarely be blocked on CF-Release being red for reasons outside of your control: you can continue to work on your release and defer integration until CF-Release is green again (unless, of course, you're the reason CF is red).
46 |   
47 | - *What about shared jobs*?
48 | 
49 |     We think this actually gets a lot nicer with releases broken out in this way.  Let's take etcd as a concrete example.
50 | 
51 |     Runtime, LAMB and Diego all rely on etcd.  We run two different etcd clusters (one for Diego and one for Runtime+LAMB).  By sharing an etcd release we reduce code duplication and can invest unified effort in maintaining a correct BOSH release of etcd.  When Diego needs to update etcd Runtime+LAMB need *not* update etcd as it will be possible for the two to reference different versions of the release in cf-deployment.
52 | 
53 | - *What about shared packages*?
54 |     
55 |     We propose either duplicating shared compile-time packages or using git submodules to solve this problem.  If this becomes a pain-point BOSH can try to help.
56 | 
57 | ## Concrete baby steps for Diego
58 | 
59 | 1. Create `diego-deployment` by pulling out all our Diego-Release templates out of Diego-Release and into `diego-deployment`.  Evaluate the developer and consumer experience once this is done.  `diego-deployment` points to vetted compatible versions of CF-Release and Diego-Release.   CF-Release will still have its templates.  `diego-deployment` lays the groundwork for someday becoming `cf-deployment`.
60 | 2. Start consuming `garden-linux` via `garden-linux-release` instead of duplicating `garden-linux` code.  `garden-linux` moves out of `diego-release` and into `diego-deployment`.
61 | 3. Pull out `etcd` into its own release and have `diego-deployment` refer to it.
62 | 4. Present our results to other teams and evaluate next steps.
63 | 


--------------------------------------------------------------------------------
/proposals/bosh-deployments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cloudfoundry/diego-notes/80926d1963c8b75cd2cd6e36f342a79ccf7afaa9/proposals/bosh-deployments.png


--------------------------------------------------------------------------------
/proposals/desired_lrp_update_extension.md:
--------------------------------------------------------------------------------
  1 | # Proposal for zero downtime updates of DesiredLRPs
  2 | 
  3 | ## What Can be Updated?
  4 | 
  5 | 1) App Update - Droplets, env vars etc - RunInfo
  6 | 
  7 | 1) App Update - Memory Quota, Disk Quota - ScheduleInfo
  8 | 
  9 | 1) Update exposed Ports - RunInfo But see below
 10 | 
 11 | * Routing information can already be updated in a DesiredLRP.  But no versions.
 12 | 
 13 | * What does it mean to update Ports when some instances use port A others use Port B?
 14 | 
 15 | * How would the GORouter manage this traffic?
 16 | 
 17 | * How do we emit these new routes?  How does an app user go to different ports?  If we change ports does this not make old ports useless?
 18 | 
 19 | 1) Asset URLs - RunInfo
 20 | 
 21 | 1) Actions Updated - Start command, healthcheck etc - RunInfo
 22 | 
 23 | 1) App SecurityGroups - RunInfo
 24 | 
 25 | 1) Privileged Flag - RunInfo
 26 | 
 27 | ## How do we version in the Model?
 28 | 
 29 | RunInfo is the part of DesiredLRP stored as a BLOB in SQL schema.
 30 | 
 31 | We could have multiple runInfos for a single desiredLRP - is this a refactor of the
 32 | desiredLRP SQL.
 33 | 
 34 | Have a RunInfo table and references to the desiredLRP.
 35 | 
 36 | Inside the desiredLRP we could have active, old1, old2 references to RunInfo entries
 37 | 
 38 | 
 39 | Link between actual LRP and the runInfo it was started with.
 40 | 
 41 | DLRP 1-> Many(3) RunInfos
 42 | ALRP 1-> 1 RunInfos
 43 | DLRP 1-> Many ALRPs
 44 | 
 45 | We may have to split some of the items that are in schedule info (disk, memory quota) stuff to be able to
 46 | version those pieces.
 47 | 
 48 | We should use locking and transactions to ensure we can update both the runInfo and the DesiredLRP and ActualLRP tables in a single step.
 49 | 
 50 | ## Backwards Compatibility
 51 | 
 52 | Pulling the RunInfo out will have some backwards compatibility effects.
 53 | 
 54 | 1) Migration required to convert current BLOB to a single entry in the table.
 55 | 1) API versioning.  Should be fairly simple to have the legacy functionality having a single RunInfo per DLRP.
 56 | 
 57 | ## What manages Behaviour?
 58 | 
 59 | The behaviour is similar to evacuation.  We want to ensure a new version of an instance is up prior to taking down the old version of an instance.  We also want to serially roll all instances of an app.
 60 | 
 61 | Probably use the BBS to manage the overall flow of this for a particular app.
 62 | 
 63 | Can we reuse the evacuation flag in the actual LRP?  This may cause conflicts with how cells are working as well as how convergence is running and looking at evacuating LRPs.
 64 |  * How would this work with cf delete commands from the cloud controller? If the versioning is meant to be transparent to users, we might want one evacuate call that targets LRPs by app name and one that can target `{name, version}` tuples.
 65 | 
 66 | When the BBS and the Rep are in the process of switching a group of ActualLRP instances from the old version to the new version, the system is in a transition state. If the BBS receives a delete command while in this state, it should begin evacuating without trying to finish transitioning all the LRPs to the new version.
 67 | 
 68 | ## Events
 69 | 
 70 | What RunInfo is required for eventing.
 71 | 
 72 | 1) Need to look into usage of RunInfo in route-emitter and nsync bulker to ensure this can work correctly.
 73 | 
 74 | Looks like RunInfo does not directly get passed to route emitter or nsync. Instead, it is the scheduling info that is used in route-emitter to reconstruct the routing tables. The scheduling info has desiredLRP resources (Memory / Disk / Rootfs) as well as the Route information. The route information does not include port number though, just a string of routes.
 75 | 
 76 | 
 77 | ## How do we cleanup unused?
 78 | 
 79 | New convergence could cleanup unused runInfos.  Any RunInfo not the "active" RunInfo for a DesiredLRP and not referenced by a running ActualLRP can be removed.
 80 | 
 81 | What happens if cleanup occurs before a new version launches any ActualLRPs? If we don't have any kind of ordered versioning, then maybe we should spare `UNCLAIMED` and `CLAIMED` ActualLRPs as well.
 82 | 
 83 | 
 84 | ## Proposed API in bbs
 85 | 
 86 | UpdateDesiredLRP(logger lager.Logger, processGuid string, update *models.DesiredLRPUpdate) error.
 87 | 
 88 | models.DesiredLRPUpdate would now have the following information:
 89 | 
 90 | ```go
 91 | type DesiredLRPUpdate string {
 92 |     Instances * int32
 93 |     Routes *Routes
 94 |     Annotation *string
 95 |     DesiredLRP *DesiredLRP
 96 |     UpdateIdentifier *string
 97 | }
 98 | ```
 99 | 
100 | To maintain backwards compatibility we will leave the Instances, Routes, and Annottaion fields as top level entries in the DesiredLRPUpdate.  The DesiredLRP field will be a fill DesiredLRP containing all the other information to be updated.
101 | The UpdateIdentifier will be the named version of this DesiredLRP allowing for mapping of an actual LRP to a specific DesiredLRP version.   On create of a DesiredLRP we can create a tag for the initial version.
102 | 
103 | Nysnc will be able to create this easily as they aleady build up the new DesiredLRP structure using the receipe builder.  They will simply use the full desiredLRP to send to the endpoint.
104 | 
105 | We could also think of creating a NEW endpoint that just takes the full new DesiredLRP.
106 | 
107 | UpdateDesiredLRP(logger lager.Logger, processGuid string, update *models.DesiredLRP) error
108 | 
109 | This would be a NEW endpoint and make backwards compatibility easier.
110 | 
111 | ## Proposed New model for DesiredLRP and ActualLRP
112 | 
113 | ### DesiredLRP
114 | 
115 | Current:
116 | 
117 | ```go
118 | type DesiredLRP struct {
119 |     ProcessGuid                   string
120 |     Domain                        string
121 |     RootFs                        string
122 |     Instances                     int32
123 |     EnvironmentVariables          []*EnvironmentVariable
124 |     Setup                         *Action
125 |     Action                        *Action
126 |     StartTimeoutMs                int64
127 |     DeprecatedStartTimeoutS       uint32
128 |     Monitor                       *Action
129 |     DiskMb                        int32
130 |     MemoryMb                      int32
131 |     CpuWeight                     uint32
132 |     Privileged                    bool
133 |     Ports                         []uint32
134 |     Routes                        *Routes
135 |     LogSource                     string
136 |     LogGuid                       string
137 |     MetricsGuid                   string
138 |     Annotation                    string
139 |     EgressRules                   []*SecurityGroupRule
140 |     ModificationTag               *ModificationTag
141 |     CachedDependencies            []*CachedDependency
142 |     LegacyDownloadUser            string
143 |     TrustedSystemCertificatesPath string
144 |     VolumeMounts                  []*VolumeMount
145 |     Network                       *Network
146 | }
147 | ```
148 | 
149 | New:
150 | 
151 | ```go
152 | type LRPDefinition struct {
153 |     DefinitionIdentifier          string
154 |     RootFs                        string
155 |     Instances                     int32
156 |     EnvironmentVariables          []*EnvironmentVariable
157 |     Setup                         *Action
158 |     Action                        *Action
159 |     StartTimeoutMs                int64
160 |     DeprecatedStartTimeoutS       uint32
161 |     Monitor                       *Action
162 |     DiskMb                        int32
163 |     MemoryMb                      int32
164 |     CpuWeight                     uint32
165 |     Privileged                    bool
166 |     Ports                         []uint32
167 |     Routes                        *Routes
168 |     LogSource                     string
169 |     LogGuid                       string
170 |     MetricsGuid                   string
171 |     EgressRules                   []*SecurityGroupRule
172 |     CachedDependencies            []*CachedDependency
173 |     LegacyDownloadUser            string
174 |     TrustedSystemCertificatesPath string
175 |     VolumeMounts                  []*VolumeMount
176 |     Network                       *Network
177 | }
178 | 
179 | type DesiredLRP struct {
180 |     ProcessGuid                   string
181 |     Domain                        string
182 |     models.LRPDefinition
183 |     Annotation                    string
184 |     LRPDefinition2         *models.LRPDefinition
185 |     LRPDefinition3         *models.LRPDefinition
186 | }
187 | ```
188 | 
189 | 
190 | ### ActualLRP
191 | 
192 | Current:
193 | 
194 | ```go
195 | type ActualLRP struct {
196 |     ActualLRPKey
197 |     ActualLRPInstanceKey
198 |     ActualLRPNetInfo
199 |     CrashCount           int32
200 |     CrashReason          string
201 |     State                string
202 |     PlacementError       string
203 |     Since                int64
204 |     ModificationTag      ModificationTag
205 | }
206 | ```
207 | 
208 | New:
209 | 
210 | ```go
211 | type ActualLRP struct {
212 |     ActualLRPKey
213 |     ActualLRPInstanceKey
214 |     ActualLRPNetInfo
215 |     DesiredLRPDefinitionIdentifier string
216 |     CrashCount           int32
217 |     CrashReason          string
218 |     State                string
219 |     PlacementError       string
220 |     Since                int64
221 |     ModificationTag      ModificationTag
222 | }
223 | ```
224 | 
225 | The main change here is to have a DefinitionIdentifier to link the ActualLRP to the definition for that DLRP
226 | 


--------------------------------------------------------------------------------
/proposals/docker_registry_caching.md:
--------------------------------------------------------------------------------
  1 | # Background
  2 | 
  3 | Private Docker Registry aims to:
  4 | - guarantee that we fetch the same layers when the user scales an application up.
  5 | - guarantee uptime (if the docker hub goes down we shall be able to start a new instance)
  6 | - support private docker images
  7 | 
  8 | # Private Registry Configuration
  9 | 
 10 | The configuration of the registry is proposed [here](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/proposals/docker_registry_configuration.md)
 11 | 
 12 | # Image Caching
 13 | 
 14 | To guarantee predictable scaling and uptime we have to copy the Docker image from the public Docker Hub to the private registry.
 15 | 
 16 | The basic steps we need to perform are:
 17 | 
 18 | * pull image from public Docker Hub
 19 | * tag the image
 20 | * push the tagged image to the Private Registry
 21 | * clean up the pulled image 
 22 | 
 23 | The tagging has two purposes:
 24 | 
 25 | 1. We target the private registry host (since the tag includes the host)
 26 | 1. Assosiates the image to the desired application
 27 |    - generate guid
 28 |    - send the guid back to CC as a URL pointing back to the private docker registry
 29 | 
 30 | ## Pull location 
 31 | 
 32 | There are two ways to do this, based on the location where we store the (intermediate) pulled image: 
 33 | 
 34 | ### Builder
 35 | 
 36 | Builder pulls the image and stores it locally in the container. Then it tags and pushes it to the private registry.
 37 | 
 38 | **Pros:**
 39 | 
 40 | - staging remains isolated with respect to disk space and CPU usage
 41 | - easily scale the Docker image staging 
 42 | 
 43 | **Cons:**
 44 | 
 45 | - limited disk space in staging container (currently 4096 MB)
 46 | 
 47 |   *We may need to increase the default quota for staging of docker images.*
 48 | 
 49 | 
 50 | ### Private Registry
 51 | 
 52 | We can use the docker daemon running in the [Private Docker Registry job](https://github.com/pivotal-cf-experimental/docker-registry-release), instructing it to pull, tag and push the image to the registry. To do this we need to either export the daemon's own API or provide a new endpoint that knows how to call it locally. Since the registry runs on the same machine as the daemon, this should be fairly cheap operation.
 53 | 
 54 | The Builder can drive the caching steps (pull, tag, push) by using API remotely exposed by the Registry Release. 
 55 | 
 56 | **Pros:**
 57 | 
 58 | - saves bandwidth
 59 | 
 60 |  *The image is not transferred between the Builder task and the Registry*
 61 |   
 62 | - caches layers
 63 | 
 64 |  *Since docker graph is common for all cache requests, the layers are cached in the same place*
 65 | 
 66 | **Cons:**
 67 | 
 68 | - disk space for pulling images 
 69 | 
 70 |  *Docker graph will quickly fill up with layer data for different images. While this serves as a cache and can significantly speed up the pull step, this also creates a problem if we want to get rid of the unused layers. We might use the CC URL pointing to tag guid for cleanup*
 71 | 
 72 | - no isolation between staging tasks
 73 | 
 74 |   *Single big image might block other staging tasks. To isolate the staging tasks we can limit the size of pulled images. This can be done by creating a file-based file system for every cache request.*
 75 | 
 76 | - need to scale registry with staging tasks
 77 | 
 78 | ## Implementation details
 79 | 
 80 | ### Caching process
 81 | 
 82 | We can take two approaches to implement the caching process (pull, tag and push steps): 
 83 | - programmatically drive [docker code](https://github.com/docker/)
 84 | - use docker client & daemon
 85 |  
 86 | Both approaches require privileged process to be able to mount the docker graph file system.
 87 | 
 88 | #### Programmatic
 89 | 
 90 | We can extending the code in garden-linux that is used to fetch the rootfs by adding pull image, tag image and push image functionality
 91 | 
 92 | **Pros:**
 93 | 
 94 | - efficiency: reduced memory consumption, number of processes
 95 | - no need to provision Docker daemon
 96 | 
 97 | **Cons:**
 98 | 
 99 | - dependency on Docker internals
100 | 
101 |  
102 | #### Docker daemon
103 | 
104 | We can run Docker daemon and request its public REST API. To do this we shall run the daemon as root or in priviledged container.
105 | 
106 | **Pros:**
107 | 
108 | - stable public API (does not depend on internal implementation)
109 | 
110 | **Cons:**
111 | 
112 | - additional process and memory overhead
113 | - more complex provisioning (configuration, scripts, manifests)
114 | - root access **and** priviliged container
115 | 
116 | ##### Processes
117 | 
118 | We need two processes:  
119 | - Docker daemon  
120 | - Builder  
121 | 
122 | We have to launch the two processes in parallel since there is no point in waiting for the daemon to finish.
123 | 
124 | The Builder is responsible for:  
125 | - wait the daemon to start by reading the response from daemon's `_ping` access point  
126 | - fetch metadata  
127 | - cache image by invoking `docker` CLI:  
128 |    - pull image
129 |    - tag image
130 |    - push image
131 | 
132 | The Docker daemon accepts requests made directly from Builder or triggered by the `docker` client.
133 | 
134 | The processes are launched as [ifrit](https://github.com/tedsuo/ifrit) group, which guarantees that if one of them exits or crashes the other one will be terminated. We also use ifrit to be able to terminate both of them if the parent process is signalled.
135 | 
136 | ## Cached images
137 | 
138 | ### URL
139 | 
140 | The images that are pushed to the private registry are stored as `<ip>:<port>/<GUID>:latest`, where GUID is a generated V4 uuid.
141 | 
142 | ### Tag
143 | 
144 | The cached images are stored in `library/` with tag `latest`
145 | 
146 | ## Cloud Controller
147 | 
148 | Cloud Controller table `droplets` is extended with `cached_docker_image` that stores the image URL returned by the staging process.
149 | 
150 | When desire app request is generated, the `cached_docker_image` is sent, instead of the user provided `docker_image` in `app` table.
151 | 
152 | If the user opts-out of caching the image, on restage we have to nil the `cached_docker_image` since this will disable the caching on desire app request.
153 | 


--------------------------------------------------------------------------------
/proposals/docker_registry_configuration.md:
--------------------------------------------------------------------------------
  1 | # Configuration settings for using private Docker Registry
  2 | 
  3 | Private Docker Registry aims to:
  4 | - guarantee that we fetch the same layers when the user scales an application up.
  5 | - guarantee uptime (if the docker hub goes down we shall be able to start a new instance)
  6 | - support private docker images
  7 | 
  8 | ## Security
  9 | 
 10 | The Docker Registry can be run in 3 flavours with regards to security:
 11 | - HTTP
 12 | - HTTPS with self-signed certificate
 13 | - HTTPS with CA signed certificate
 14 | 
 15 | The first two are considered insecure by the docker registry client. The client needs additional configuration to allow access to insecure registries.
 16 | 
 17 | ## Configuration
 18 | 
 19 | ### Components
 20 | 
 21 | The following components are involved in the staging and running of Docker image:
 22 | - Builder
 23 | The Docker app life-cycle builder needs to access the registry to fetch image meta data. To do this it uses the Docker Registry Client. Therefore the client has to be configured to allow insecure connections if this is needed.
 24 | - Stager
 25 | The builder runs inside a container. This container should allow access to internally hosted (in the protected isolated CF network) registry. The container is configured by the stager that requests launch of the docker builder task.
 26 | 
 27 | ### Goals
 28 | 
 29 | The configuration goals are:
 30 | - to allow access to the private Docker Registry
 31 | - to enable Docker Registry Client to access the host
 32 | 
 33 | ### Proposal
 34 | 
 35 | #### Consul service
 36 | 
 37 | We shall register the private docker registry with Consul (as we do with the file server). The Docker Registry shall be registered as `-dockerRegistryURL=http(s)://docker_registry.service.consul:8080`.
 38 | 
 39 | This will help us easily discover the service instances. We do not need to specify concrete IPs of the service nodes in the BOSH manifests as well.
 40 | 
 41 | #### Stager
 42 | 
 43 | The Docker Registry instance IPs are obtained from Consul cluster. The Stager shall be provided with URL to the Consul Agent with `-consulCluster=http://localhost:8500`.
 44 | 
 45 | The Stager shall be instructed to add the registry service instances as insecure with the help of `-insecureDockerRegistry` flag. This flag shall be provided if the registry is accessed by HTTP or HTTPS with self-signed certificate.
 46 | 
 47 | #### Builder
 48 | 
 49 | The `-dockerRegistryAddresses` argument is used to provide a comma separated list of \<ip\>:\<port\> pairs to access all Docker Registry instances.
 50 | 
 51 | Docker client needs a list of all insecure registry service instances. That's why Builder shall read the optional command line argument `-insecureDockerRegistries` to configure Docker Registry Client and allow access to any insecure hosts. The argument shall accepts comma separated list of \<ip\>:\<port\> pairs. 
 52 | 
 53 | The list is available in Consul cluster, but Builder is running inside a container with no access to Consul Agent. That's why the `-insecureDockerRegistries` list is built by Stager. 
 54 | 
 55 | As a side effect the docker app life-cycle builder may provide access to public registries that are insecure (either HTTP or self-signed cert HTTPS) if they are listed in `-insecureDockerRegistries`. This however will require also forking/modifying Stager code.
 56 | 
 57 | The `-dockerDaemonExecutablePath=<path>` is used to configure the correct path to the Docker executable in different environments (Inigo, different Cells, unit testing).
 58 | 
 59 | **Pros:**
 60 | 
 61 | - Simpler configuration
 62 | - Consistent with the CF CLI
 63 | 
 64 | **Cons:**
 65 | 
 66 | - No support for external insecure registries
 67 | 
 68 | ## Egress rules
 69 | 
 70 | To be able to access the private Docker Registry we have to open up the container. We have these options:
 71 | 
 72 | - Stager fetches a list of all registered `docker_registry` services. This would return all registered IPs and ports and we shall poke holes allowing access to all those IPs and ports.
 73 | - We pick a unique port for the private docker registry that doesn't conflict with anything else in the VPC (hard/tricky!) and then open up that port on the entire VPC.
 74 | 
 75 | If we use the first option, there's a small race in that a new Docker registry may appear/disappear while we are staging. This would result in a staging failure but this should be very infrequent.
 76 | 
 77 | ### Staging
 78 | 
 79 | To enable discovery of all service instances we shall use [Consul service nodes endpoint](http://www.consul.io/docs/agent/http.html#_v1_catalog_service_lt_service_gt_) (/v1/catalog/service/<service>). This will return the following JSON:
 80 | ```
 81 | [
 82 |   {
 83 |     "Node": "foobar",
 84 |     "Address": "10.1.10.12",
 85 |     "ServiceID": "redis",
 86 |     "ServiceName": "redis",
 87 |     "ServiceTags": null,
 88 |     "ServicePort": 8000
 89 |   }
 90 | ]
 91 | ```
 92 | 
 93 | To execute the above request we use the `-consulAgentURL` flag. Currently `ServicePort` is always 0 since we do not register a concrete port and rely on hardcoded one. Using different ports requires changes in:
 94 | 
 95 | - consul agent scripts (to register service with different ports)
 96 | - stager task create request (to add several egress rules)
 97 | 
 98 | For MVP0 we can hardcode the port and make the needed changes for all services later.
 99 | 
100 | Once we find all service instances of the registry we add them explicitly to the staging task container. This presents some problems since we are opening access to the registry even if the desired egress rules (the default CF configuration) does not allow this.
101 | 
102 | ### Running
103 | 
104 | We don't need to add egress rules to allow access to the Docker Registry instances since Garden fetches the image outside of the container.
105 | 
106 | ## Opt-in
107 | 
108 | Image caching in private docker registry will not be enabled by default. User can opt-in by:
109 | 
110 | ```
111 | cf set-env <app> DIEGO_DOCKER_CACHE true
112 | ```
113 | 
114 | The app environment is propagated all the way to the stager, which configures the docker lifecycle builder accordingly with the help of the `-cacheDockerImage` command line flag.
115 | 
116 | If the `-cacheDockerImage` is present, the builder:  
117 | - generates cached docker image GUID  
118 | - caches the image in the private registry (pull, tag & push)  
119 | - includes the cached image metadata (`<ip>:<port>/<guid>:latest`) in the staging response
120 | 
121 | The stager returns the generated staging response back to Cloud Controller (CC). CC stores the cached image metadata in the CCDB.
122 | 
123 | If the user opt-out then the cached image metadata is cleared to enable restage with the user-provided docker image.
124 | 
125 | ## Private images
126 | 
127 | Users need to provide credentials to access the Docker Hub private images. The default flow is as follows:
128 | * user provides credentials to `docker login -u <user> -p <password> -e <email>`
129 | * the login generates ~/.dockercfg file with the following content:
130 | ```
131 | {
132 | 	"https://index.docker.io/v1/": {
133 | 		"auth": "bXlVc2VyOm15c2VjUmV0",
134 | 		"email": "<user email>"
135 | 	}
136 | }
137 | ```
138 | * `docker pull` can now use the authentication token in the configuration file
139 |  
140 | Therefore we can implement two possible flows in Diego using `docker login`:
141 | * user provides user/password
142 | * user provides authentication token and email
143 |  
144 | The second option should be safer since the authentication token is supposed to be temporary, while user/password credentials can be used to generate new token and gain access to Docker Hub account.
145 | 
146 | ### Findings / Issues
147 | 
148 | #### Docker authentication token
149 | 
150 | The current (May 2015) authentication token contains base64 encoded `<user>:<password>`. For example the token `bXlVc2VyOm15c2VjUmV0` above is actually `myUser:mysecRet`. This compromises the idea to use token to prevent storing the user credentials.
151 | 
152 | The Cloud Controller models produce debug log entry with the arguments used to start the application. Since these arguments contain the authentication token this presents a security risk as operators or admins shall not be able to see the user's credentials.
153 | 
154 | #### Docker email
155 | 
156 | `docker login` always requires email, although it is not used if user/password login succeeds. This is already reported as https://github.com/docker/docker/issues/6400 and may be fixed in the future.
157 | 
158 | #### In memory storage of credentials or token
159 | 
160 | We cannot store application metadata in-memory since we may have more than one CC. The `create app` and `start app` requests to CC may reach different instances so the authentication info stored in-memory will not be unavialable for all CC instances.
161 | 
162 | #### OAuth
163 | 
164 | Docker seems to support OAuth. To use OAuth one should either:
165 | * programatically use Docker golang code
166 | * use the Docker REST endpoints with custom client
167 | 
168 | Both approaches seems to bring too much overhead. We either have to adapt to Docker code changes or create a custom client that can login and pull.
169 | 
170 | ### Proposed design
171 | 
172 | Since we cannot use the Docker authentication token as a secure replacement of user/password we shall do the following on CC side:
173 | * accept user, password & email as credentials needed to pull private Docker image
174 | * encrypt the credentials in the database
175 | * propagate the credentials to Diego via stage app request
176 | 
177 | On Diego side we shall:
178 | * propagate the credentials to `stager` and Docker `builder`
179 | * (`builder`) propagate the credentials to Docker golang API to fetch the private image metadata
180 | * (`builder`) perform `docker login` with the credentials to enable `docker pull` of private image
181 | 
182 | 


--------------------------------------------------------------------------------
/proposals/faster-missing-cell-recovery.md:
--------------------------------------------------------------------------------
 1 | # Faster Missing Cell Recovery
 2 | 
 3 | Today, Cell reps maintain a presence key in etcd.  This key has a TTL of `PRESENCE_TTL` (currently set to 30s).  When a Cell disappears this key eventually expires.  When the Converger next performs its convergence loop (every `CONVERGENCE_INTERVAL` seconds) it notices that there are [ActualLRPs](https://github.com/pivotal-cf-experimental/diego-dev-notes#harmonizing-desiredlrps-with-actual-lrps-converger) and [Tasks](https://github.com/pivotal-cf-experimental/diego-dev-notes#maintaining-consistency-converger) in the BBS that claim to be running on non-existing Cells.  It responds by restaring the ActualLRPs and failing the Tasks.
 4 | 
 5 | Today, both `PRESENCE_TTL` and `CONVERGENCE_INTERVAL` are set to 30s.  In the worst case scenario, the time between a Cell going down and the ActualLRPs being restarted can be as high as `PRESENCE_TTL + CONVERGENCE_TTL` -- i.e. 1 minute.  We can do better.
 6 | 
 7 | #### Why bother?
 8 | 
 9 | - Doing better makes for good demos.
10 | - Doing better is better :stuck_out_tongue_winking_eye:.
11 | 
12 | ## A better way
13 | 
14 | One option is to reduce `PRESENCE_TTL` and `CONVERGENCE_INTERVAL`.  The fomer is practical, the latter is not:
15 | 
16 | - Since we *only* heartbeat the presence information into etcd we can probably easily sustain `PRESENCE_TTL` in the neighborhood of `5 seconds`.  This knob may need to be tweaked down as the number of Cells increases, however.
17 | - It is, however, impractical to substantially decrease the `CONVERGENCE_INTERVAL`.  Convergence is a very expensive operation that places heavy strain on etcd.
18 | 
19 | I propose the following:
20 | 
21 | - We decrease `PRESENCE_TTL` to 5 seconds.
22 | - We modify the Converger to watch for expiring Cell presences and respond immediately by triggering a convergence loop.
23 | 
24 | In this way we can reduce the maximum recovery time from 1 minute to ~5 seconds.
25 | 
26 | #### But I thought we were moving away from etcd watches?
27 | 
28 | Yes and no.
29 | 
30 | We are moving away from etcd watches for three reasons:
31 | 
32 | - They easily lead to the thundering herd problem (e.g. a Task is complete, all receptors head to etcd to mark the Task as RESOLVING)
33 | - They are generally unperformant under heavy write loads
34 | - They make it harder to move away from etcd
35 | 
36 | But we are only primarily concerned with not watching for changes in our *data* (DesiredLRPs, ActualLRPs, Tasks).
37 | 
38 | Watching for *services* (e.g. CellPresence) is fine.  In the event that we move our data out of etcd we are likely to continue to use etcd (or consul, which also supports watch-like semantics) for our services.  Note that moving our data out of etcd also dramatically reduces the write-pressure which makes reducing the `PRESENCE_TTL` to 5 seconds a lot more feasible even in very large deployments (>1000 cells).
39 | 


--------------------------------------------------------------------------------
/proposals/go-modules-migrations.md:
--------------------------------------------------------------------------------
  1 | # BOSH release migration path to use Go modules
  2 | 
  3 | ## Summary
  4 | 
  5 | After [proposing
  6 | different](https://docs.google.com/document/d/1MeiXIqzsj_j1ziYAfhVXfCCFJLiKhi7HiOyF3g-7Wpk/edit#heading=h.9jamy725425y)
  7 | routes for converting bosh releases to use Go modules, we are going to move
  8 | forward with the following implementation:
  9 | 
 10 | - Single go module under `src/code.cloudfoundry.org` with a generic name
 11 |   `code.cloudfoundry.org`
 12 | 
 13 | As far as estimation for this work, I estimate 1-2 weeks of work for
 14 | diego-release as an example of a release with over 130 submodules.
 15 | 
 16 | ## Migration Path
 17 | 
 18 | Here are steps taken for conversion
 19 | 
 20 | ```
 21 | cd ./src/code.cloudfoundry.org
 22 | go mod init code.cloudfoundry.org
 23 | go mod vendor -e #ignore errors
 24 | # run tests
 25 | # if tests failed due to mismatch on version, try the submoduled version instead
 26 | ```
 27 | 
 28 | ### Add Depedabot for independent modules (e.g. code.cloudfoundry.org/localip)
 29 | 
 30 | When a new module is created, let's also add dependabot so that the dependencies
 31 | can remain updated.
 32 | [archiver](https://github.com/cloudfoundry/archiver/blob/2762da2677ce6ba931a6e9fff947c7541f470615/.dependabot/config.yml)
 33 | is an example of the dependabot config.
 34 | 
 35 | ### Add Github action integration for running the test for independent modules
 36 | 
 37 | When a new module is created, let's also add github action integration for
 38 | running the tests
 39 | 
 40 | ### Module dependency versions
 41 | 
 42 | BOSH releases have specific SHAs checkout out for each dependency. [Some
 43 | repositories](https://github.com/cloudfoundry/diego-release/blob/68b60677acffd6ab241e2698f581c52f5da3ed83/.dependabot/config.yml)
 44 | are setup to receive security update (if dependabot security bump is working as
 45 | expected), but we are not bumping dependencies to the latest all the time.  When
 46 | these are converted to Go module's dependency:
 47 | 
 48 | - Go mod tooling will automatically get the latest one from the remote
 49 |   repository
 50 | - Ensure tests pass with the latest version
 51 | - If Broken, let's lock it back to the submoduled SHA and exclude that from
 52 |   dependabot
 53 | 
 54 | 
 55 | ### Vendoring external modules that use replace directive
 56 | 
 57 | [`replace`
 58 | directive](https://thewebivore.com/using-replace-in-go-mod-to-point-to-your-local-module/)
 59 | is very handy when a developer is making changes to modules for BOSH releases
 60 | and wants to see the changes applied after deploy without having to commit the
 61 | change to upstream. Using `replace` directive works when it's internal to the
 62 | release, but when the component is used outside of the release (e.g.
 63 | `diego-release` uses `guardian` from `garden-runc-release`) we would have to
 64 | manually add the replace commands before being able to vendor `guardian`. In
 65 | order to use `guardian`, you must clone the needed repositories using submodules. Here
 66 | is an example workflow when adding those components
 67 | 
 68 | ```
 69 | cd inigo
 70 | go mod init
 71 | go mod edit \
 72 |  -replace code.cloudfoundry.org/guardian=./guardian \
 73 |  -replace code.cloudfoundry.org/garden=./garden \
 74 |  -replace code.cloudfoundry.org/grootfs=./grootfs \
 75 |  -replace code.cloudfoundry.org/idmapper=./idmapper
 76 | go mod vendor
 77 | ```
 78 | 
 79 | ### Add a script to update golang version for each module
 80 | 
 81 | Each module would have a script to update go version. [More
 82 | info](https://golang.org/ref/mod#go-mod-file-go) for how to do that.
 83 | 
 84 | 
 85 | ## Work-In-Progress
 86 | 
 87 | - [x] diego-release v2.51.0. dependabot is not syncing because of [this
 88 |   issue](https://github.com/dependabot/dependabot-core/issues/4249)
 89 | - [x] routing-release
 90 |   [cloudfoundry/routing-release#211](https://github.com/cloudfoundry/routing-release/pull/211)
 91 | - [x] cf-networking-release
 92 |   [cloudfoundry/cf-networking-release#90](https://github.com/cloudfoundry/cf-networking-release/pull/90)
 93 | - [x] silk-release
 94 |   [cloudfoundry/silk-release#32](https://github.com/cloudfoundry/silk-release/pull/32)
 95 | - [x] nats-release
 96 |   [cloudfoundry/nats-release#38](https://github.com/cloudfoundry/nats-release/pull/38)
 97 | 
 98 | - [ ] code.cloudfoundry.org/lager need to be updated in diego-release and
 99 |   silk-release and all of the other consumers since the 2.0.0 tag was added we
100 |   need to figure out a way to update this dependency
101 | - [ ] Make sure all scripts that bump a submodule also run `go mod tidy & go mod
102 |   vendor`
103 | - [x] Make sure norsk pipelines are all green
104 | 
105 | ### List of newly converted modules that have github-actions and dependabot
106 | - code.cloudfoundry.org/archiver
107 | - code.cloudfoundry.org/cf-networking-helpers
108 | - code.cloudfoundry.org/silk (CI on concourse instead)
109 | - code.cloudfoundry.org/diego-logging-client
110 | - code.cloudfoundry.org/certsplitter
111 | - code.cloudfoundry.org/bytefmt
112 | - code.cloudfoundry.org/durationjson
113 | - code.cloudfoundry.org/consuladapter
114 | - code.cloudfoundry.org/eventhub
115 | - code.cloudfoundry.org/debugserver
116 | - code.cloudfoundry.org/clock
117 | - code.cloudfoundry.org/localip
118 | 
119 | ## TODO As a different track
120 | - Add a dependabot configuration for releases so they get the latest version of
121 |   the dependencies.
122 | - Go mod fetches `code.cloudfoundry.org/lager` v2.0.0 as the latest version of
123 |   library, instead of master/main branch with the latest updates. This is leading to
124 |   a situation where we have to use a [replace
125 |   directive](https://github.com/cloudfoundry/diego-release/blob/2eb134ea1227064fd4fc15671a34552a20a6e3f3/src/code.cloudfoundry.org/go.mod#L13).
126 |   This works for now, but it has a side-effect of never getting updated when
127 |   there is a commit on lager. [The following is a list of
128 |   repositories](https://pkg.go.dev/code.cloudfoundry.org/lager?tab=importedby) that
129 |   needs to remove the replace directive when we have a different solution
130 | - containernetworking/cni package needs to be updated on silk-release
131 | - Figure out a way to sync github action YAMLs. Currently we have an independent
132 |   configuration that are mostly the same. If we need to update one, it would be
133 |   better to be able to have a list of those repositories so that they all get
134 |   the same version of improvement. This is also true to dependabot.yml.
135 | - certauthority is now a repository, let's move cert creation functionality out
136 |   of inigo so that other repositories don't have to import inigo. Here is the
137 |   list of repositories that do similar cert generation for tests:
138 |     - diego-release
139 |     - silk-release
140 |     - cf-networking-helpers
141 |     - cf-networking-release
142 | 
143 | ## End Goal
144 | 
145 | After migration is completed:
146 | 
147 | - `GOPATH` in `.envrc` is removed
148 | - Release is using Go 1.16+
149 | - `GO111MODULE` is unset
150 | - When we `ag GOPATH`, all results should be converted/removed including docs
151 | 


--------------------------------------------------------------------------------
/proposals/lifecycle-design.md:
--------------------------------------------------------------------------------
 1 | ## App Lifecycles: Static or Dynamic?
 2 | 
 3 | Lifecycles consist of two parts: a trinity of binaries (Builder, Runner, Healthcheck), and a Recipe-Generator interface which can generate a Diego recipe for the Staging Task (Builder) and DesiredLRP (Runner + Healthcheck).  
 4 | 
 5 | The binaries can simply be found in the blobstore.  But the Recipe-Generator is invoked directly by the CC(-Bridge).  So how do we identify the presence of a Lifecycle, and then invoke its Recipe-Generator? AKA, how do we load Lifecycles?  We see three possible choices:
 6 | 
 7 | **Choice #1:**  A set of Lifecycles is part of the CC(-Bridge) codebase.  The Recipe-Generators are directly compiled into the CC(-Bridge).  This is very simple and reliable.  But it is not very extensible without forking the code.
 8 | 
 9 | **Choice #2:** The CC(-Bridge) performs some kind of DLL / shelling-out shenanigans to a local set of Recipe-Generator binaries.  The options here are somewhat language-specific.  How would we load these binaries? Co-located bosh jobs?
10 | 
11 | **Choice #3:** Each Recipe-Generator runs as a service, which the CC(-Bridge) can discover dynamically.  This requires service discovery, but is otherwise cleanly decoupled.
12 | 
13 | The current implementation is a sort of hodgepodge of the above choices.  The available lifecycle are passed in as arguments to the CC-Bridge, but the Recipe-Generators are hardwired into the code.
14 | 


--------------------------------------------------------------------------------
/proposals/per-application-crash-configuration-stories.csv:
--------------------------------------------------------------------------------
 1 | Id,Title,Labels,Iteration,Iteration Start,Iteration End,Type,Estimate,Current State,Created at,Accepted at,Deadline,Requested By,Description,URL,Owned By,Owned By,Owned By
 2 | 106626406,⬇ Crash Restart Policy ⬇,"",,,,release,,unscheduled,"Oct 26, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106626406,,,
 3 | 87479698,When creating a DesiredLRP I can specify a Crash Restart Policy,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This story is purely about the Receptor API - nothing is actually going to use these values yet...
 4 | 
 5 | ```
 6 | DesiredLRP.CrashRestartPolicy = {
 7 |     NumberOfImmediateRestarts: 3,
 8 |     MaxSecondsBetweenRestarts: 960,
 9 |     MaxRestartAttempts: 200   
10 | }
11 | ```
12 | 
13 | The Receptor applies the following validations:
14 | 
15 | - `NumberOfImmediateRestarts`:
16 |     + Allowed to be `0` (means ""never restart immediately"")
17 |     + Must be less than `MaxRestartAttempts`
18 |     + Must be less than `10`
19 | - `MaxSecondsBetweenRestarts`:
20 |     + Must be greater than `30` seconds
21 | - `MaxRestartAttempts`
22 |     + Allowed to be 0 (means ""never stop restarting"")
23 | 
24 | `CrashRestartPolicy` is optional.",https://www.pivotaltracker.com/story/show/87479698,,,
25 | 87479700,Diego honors the `CrashRestartPolicy` on a DesiredLRP when determining whether or not to restart an `ActualLRP`,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"Currently, Diego relies on the hard-coded default `CrashRestartPolicy` in runtime-schema.  Instead, if the DesiredLRP has a `CrashRestartPolicy` Diego should use that policy.
26 | 
27 | If the DesiredLRP has no `CrashRestartPolicy`, Diego should use the hard-coded default.
28 | 
29 | > This affects both the Rep and the Converger but if we've done our homework, the change should be purely in Runtime-Schema.",https://www.pivotaltracker.com/story/show/87479700,,,
30 | 87479702,I can update a DesiredLRP's crash policy,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,The `DesiredLRPUpdateRequest` should support modifying the `CrashRestartPolicy`.  Note that no further action needs to be taken.  Runtime-Schema always looks up the `CrashRestartPolicy` on the Desired LRP when taking actions.,https://www.pivotaltracker.com/story/show/87479702,,,
31 | 87479704,The `DefaultCrashRestartPolicy` is stored in the BBS and can be modified via the Receptor API on a per-domain basis.,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"When upserting a Domain, I should be able to specify an optional `DefaultCrashRestartPolicy` on the Domain.
32 | 
33 | When determining whether or not to restart the application Diego should do the following:
34 | 
35 | - Use the `CrashRestartPolicy` on the `DesiredLRP`
36 | - If there is none, use the `DefaultCrashRestartPolicy` associated with the LRP's `domain`
37 | - If there is none, fallback on using the hard-coded `CrashRestartPolicy` in runtime-schema",https://www.pivotaltracker.com/story/show/87479704,,,
38 | 87479706,As a CF Admin I can CRUD the `DefaultCrashRestartPolicy` using the CC API,crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This should end up in the CCDB.
39 | 
40 | To flow this into Diego and keep Diego in sync with CF, NSYNC's bulker should fetch the `DefaultCrashRestartPolicy` and update it when it upserts the domain.",https://www.pivotaltracker.com/story/show/87479706,,,
41 | 87479708,"As a CF Admin, I can CRUD `CrashRestartPolicy`",crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,A la ApplicationSecurityGroups,https://www.pivotaltracker.com/story/show/87479708,,,
42 | 87479710,"As a CF Admin, I can associate a `CrashRestartPolicy` with a space and the see the new `CrashRestartPolicy` take effect",crash-restart-policy,,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"This has the effect of updating all applications in said space with the `CrashRestartPolicy`.  This should emit a desire message to NSYNC's listener, which should update the `CrashRestartPolicy` associated with the `DesiredLRP`.
43 | 
44 | In addition, NSYNC's bukler should eventually sync changes associated with `CrashRestartPolicy` (i.e. this should modify the etag associated with the applications in the space).",https://www.pivotaltracker.com/story/show/87479710,,,
45 | 87479712,The CF cli should support specifying the `DefaultCrashRestartPolicy`,"cli, crash-restart-policy",,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"```
46 | cf default-crash-restart-policy      --> gets the default crash restart policy
47 | cf set-default-crash-restart-policy  --> sets the default crash restart policy
48 | ```",https://www.pivotaltracker.com/story/show/87479712,,,
49 | 87479714,The CF cli should support creating `CrashRestartPolicies` and associating them with spaces,"cli, crash-restart-policy",,,,feature,,unscheduled,"Feb 2, 2015",,,Onsi Fakhouri,"A la Application Security Groups
50 | ```
51 | cf crash-restart-policy                --> get a particular crash restart policy
52 | cf crash-restart-policies              --> get all crash restart policies
53 | cf create-crash-restart-policy         --> creates a crash restart policy
54 | cf update-crash-restart-policy         --> updates an existing crash restart policy
55 | cf delete-crash-restart-policy         --> removes an existing crash restart policy
56 | cf bind-crash-restart-policy           --> associates a crash restart policy with a space
57 | cf unbind-crash-restart-policy         --> removes a crash restart policy from a space
58 | ```",https://www.pivotaltracker.com/story/show/87479714,,,
59 | 106626404,⬆︎ Crash Restart Policy ⬆︎,"",,,,release,,unscheduled,"Oct 26, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106626404,,,
60 | 


--------------------------------------------------------------------------------
/proposals/per-application-crash-configuration.md:
--------------------------------------------------------------------------------
 1 | # Per-Application Crash Configuration
 2 | 
 3 | A common feature request in CF is to have finer-grained control over how applications should be restarted in the face of crashes.
 4 | 
 5 | Diego can easily support this.
 6 | 
 7 | ## Crash Restart Policy
 8 | 
 9 | A CrashRestartPolicy in Diego can be expressed as:
10 | 
11 | ```
12 | {
13 |     NumberOfImmediateRestarts: 3,
14 |     MaxSecondsBetweenRestarts: 960,
15 |     MaxRestartAttempts: 200,
16 | }
17 | ```
18 | 
19 | This particular set of numbers constitutes Diego's default policy.  Here's what they mean:
20 | 
21 | - An application is immediately restarted for the first `NumberOfImmediateRestarts = 3` crashes.
22 | - After that we wait 30s, 60s, 120s, up-to `MaxSecondsBetweenRestarts = 960` between subsequent restarts.
23 | - After 200 restart attempts we give up on the application.
24 | 
25 | > Note: the consumer cannot modify the minimum time between restarts (30s) or the fact that we impose an exponential backoff.  Also the consumer cannot modify the 5-minute rule (i.e. we reset the crash count if you've been running for 5 minutes).
26 | 
27 | ### Allowed Values:
28 | 
29 | #### `NumberOfImmediateRestarts`
30 | 
31 | - Allowed to be 0 (means never restart immediately)
32 | - Must be less than `MaxRestartAttempts`
33 | - Must be less than 10
34 | 
35 | #### `MaxSecondsBetweenRestarts`
36 | 
37 | - Must be greater then 30s
38 | 
39 | #### `MaxRestartAttempts`
40 | 
41 | - Allowed to be 0 (means never stop restarting)
42 | 
43 | ## Setting the Crash Policy
44 | 
45 | The crash policy is part of the `DesiredLRP`:
46 | 
47 | ```
48 | {
49 |    ProcessGuid:...,
50 |    ...
51 |    CrashRestartPolicy: {
52 |         ...
53 |     },
54 |     ...
55 | }
56 | ```
57 | 
58 | - The crash policy can be updated after-the fact and, therefore, is part of `DesiredLRPUpdateRequest`.
59 | - `CrashRestartPolicy` is optional.  If unspecified, the default will be used (see below).
60 | 
61 | ## Setting the Default Crash Policy
62 | 
63 | The `DefaultCrashRestartPolicy` is stored in the BBS and can be modified via the Receptor API on a *per-domain-basis* (note: this is not a BOSH property - the `DefaultCrashRestartPolicy` can be modified at runtime!)
64 | 
65 | Only `DesiredLRP`s with *no* `CrashRestartPolicy` use the `DefaultCrashRestartPolicy`.  If the `DefaultCrashRestartPolicy` is not specified via the Receptor API, Diego will use the hard-coded values listed above.
66 | 
67 | ## Required CF Work
68 | 
69 | We propose that only CF admins/operators will be allowed to set/modify crash policies.  So:
70 | 
71 | - As a CF Admin/Operator I can use the CC API to set the `DefaultCrashRestartPolicy`.
72 | - As a CF Admin/Operator I can set up a CrashRestartPolicy on an Org/Space/App level.
73 | 
74 | > Notes: in addition to flowing an event through to NSYNC's listeneer, we'll probably want the NSYNC bulker to periodically (re)set the `DefaultCrashRestartPolicy` to make sure it's up-to-date.
75 | 


--------------------------------------------------------------------------------
/proposals/placement-constraints-stories.csv:
--------------------------------------------------------------------------------
 1 | Id,Title,Labels,Iteration,Iteration Start,Iteration End,Type,Estimate,Current State,Created at,Accepted at,Deadline,Requested By,Description,URL,Owned By,Owned By,Owned By,Comment,Comment
 2 | 106613240,⬇ Placement Constraints ⬇,"",,,,release,,unscheduled,"Oct 25, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106613240,,,
 3 | 90748240,"As a consumer of the Receptor API, I should be able to specify a Constraint when requesting a Task or DesiredLRP",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"1. A constraint looks like:
 4 | 
 5 | ```
 6 | type Constraint struct{
 7 |     Require []string,
 8 | }
 9 | ```
10 | 
11 | 2. Cells (in particular, the Rep) should be BOSH configurable with a list of `Tags`.
12 | 
13 | 3. When scheduling the DesiredLRP/Task the auction should only allocate the application on Cells that satisfy the constraint.  If there are no Cells that satisfy the Constraint, the Auctioneer should attach a `diego_errors.CELL_MISMATCH_MESSAGE` placement error.
14 | 
15 | To satisfy the constraint a Cell **MUST** have any tags in the `Require` list
16 | 
17 | 4. Empty Constraints are allowed.
18 | 
19 | 5. The Constraint cannot be modified on a `DesiredLRP`
20 | 
21 | Here's the [accompanying Diego-Dev-Notes proposal](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/accepted_proposals/placement_pools.md).",https://www.pivotaltracker.com/story/show/90748240,,,
22 | 90748242,"As a consumer of the Receptor API, I should see that tags associated with the Cells I get from the `/v1/cells` endpoint",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,Here's the [accompanying Diego-Dev-Notes proposal](https://github.com/pivotal-cf-experimental/diego-dev-notes/blob/master/accepted_proposals/placement_pools.md).,https://www.pivotaltracker.com/story/show/90748242,,,
23 | 90748244,"As a consumer of the CC API, I can CRUD PlacementConstraint rules",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"This should mirror the organizational structure of [Application Security Groups](http://apidocs.cloudfoundry.org/203/).
24 | 
25 | After this story I can:
26 | 
27 | - Create a PlacementConstraint
28 | - Delete a PlacementConstraint
29 | - List all PlacementConstraints
30 | - Retrieve a Particular PlacementConstraint
31 | - Update a PlacementConstraint
32 | 
33 | Add this point these are just objects in CCDB.  No behavior changes (other than the presence of this new API).
34 | 
35 | **PlacementConstraints should be under the `/v3` API**",https://www.pivotaltracker.com/story/show/90748244,,,
36 | 90748246,"As a consumer of the CC API, I can associate a PlacementConstraint with a Space (Staging)",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"A space can have one Staging PlacementConstraint.  The staging tasks for applications in this space should be run on Cells that satisfy this constraint.
37 | 
38 | After this story I can (If this is not what CC's DSL naturally gives us by default let me know):
39 | 
40 | - Retrieve the Staging PlacementConstraint for a Space
41 | - Associate a PlacementConstraint as the Staging PlacementConstraint for a Space
42 | - Disassociate a PlacementConstraint as the Staging PlacementConstraint for a Space
43 | - Given a PlacementConstraint, list all Spaces that have it as a Staging PlacementConstraint 
44 | 
45 | Behaviorally:
46 | 
47 | - If a Space has an associated PlacementConstraint then Diego should be informed of the PlacementConstraint when the staging task is created.
48 | - If a Space has no associated PlacementConstraint there should be no PlacementConstraint sent to Diego.",https://www.pivotaltracker.com/story/show/90748246,,,
49 | 90748666,"As a consumer of the CC API, I can associate a PlacementConstraint with a Space (Running)
50 | ",placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,"A space can have one Running PlacementConstraint.  The staging tasks for applications in this space should be run on Cells that satisfy this constraint.
51 | 
52 | After this story I can (If this is not what CC's DSL naturally gives us by default let me know):
53 | 
54 | - Retrieve the Running PlacementConstraint for a Space
55 | - Associate a PlacementConstraint as the Running PlacementConstraint for a Space
56 | - Disassociate a PlacementConstraint as the Running PlacementConstraint for a Space
57 | - Given a PlacementConstraint, list all Spaces that have it as a Running PlacementConstraint 
58 | 
59 | Behaviorally:
60 | 
61 | - If a Space has an associated PlacementConstraint then Diego should be informed of the PlacementConstraint when the DesiredLRP is created.
62 | - If a Space has no associated PlacementConstraint there should be no PlacementConstraint sent to Diego.
63 | 
64 | Changing the PlacementConstraint associated with a space is allowed.  At this point the applications in the space need to be restarted but CC has no way to orchestrate this.  The same issue exists with the Application Security Groups.",https://www.pivotaltracker.com/story/show/90748666,,,
65 | 90748248,As a consumer of the CC API I can specify a default Staging PlacementConstraint,placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,This should be unspecified out of the box.  Once specified it applies to any spaces that have no Staging PlacementConstraint.,https://www.pivotaltracker.com/story/show/90748248,,,
66 | 90748250,As a consumer of the CC API I can specify a default Running PlacementConstraint,placement-constraints,,,,feature,,unscheduled,"Mar 19, 2015",,,Onsi Fakhouri,This should be unspecified out of the box.  Once specified it applies to any spaces that have no Running PlacementConstraint.,https://www.pivotaltracker.com/story/show/90748250,,,
67 | 94711650,It should be possible to have different sets of routers service different pools of placement constraints,"placeholder, placement-constraints",,,,feature,,unscheduled,"May 15, 2015",,,Onsi Fakhouri,This is a placeholder that Onsi is injecting in that needs definition.,https://www.pivotaltracker.com/story/show/94711650,,,,"/cc @emalm -- this came up today and it's something that just needs to be on the roadmap so I've added this placeholder and injected it into the rough priority order.  Happy to chat about it. (Onsi Fakhouri - May 15, 2015)","(also feel free to slap me on the wrist - am just trying to get it off of my plate and into tracker somewhere) (Onsi Fakhouri - May 15, 2015)"
68 | 106613356,⬆︎ Placement Constraints ⬆︎,"",,,,release,,unscheduled,"Oct 25, 2015",,,Eric Malm,,https://www.pivotaltracker.com/story/show/106613356,,,
69 | 


--------------------------------------------------------------------------------
/proposals/placement_pools.md:
--------------------------------------------------------------------------------
  1 | # Supporting multiple RootFSes.  AKA what is Stack?
  2 | 
  3 | Stack has been a confusing concept in the CC.  I'd like to propose that stack simply correlates with the RootFS.  I also propose that a single Cell should be able to support multiple RootFSes (this has many benefits including, for example, simplifying the process of upgrading from one RootFS to another).
  4 | 
  5 | Where we need to head in the short-term is support for the following:
  6 | 
  7 | - `lucid64` - a preloaded tarball based on `lucid`
  8 | - `cflinxfs2` - a preloaded tarball based on `trusty`
  9 | - `docker` - a dynamically download RootFS.
 10 | 
 11 | I propose making this clearer in the Diego API by dropping `Stack` from the `DesiredLRP/Task` and, instead, relying on the existing `RootFS` field which would take on the values (e.g.):
 12 | 
 13 | ```
 14 | lucid64: {
 15 |     RootFS: "preloaded://lucid64",
 16 | }
 17 | 
 18 | cflinxfs2: {
 19 |     RootFS: "preloaded://cflinuxfs2",
 20 | }
 21 | 
 22 | docker: {
 23 |     RootFS: "docker:///foo/bar#baz",
 24 | }
 25 | ```
 26 | 
 27 | ## Supporting multiple RootFSes
 28 | 
 29 | Once the `DesiredLRP/Task` has this information we would modify the components as follows.
 30 | 
 31 | ### Rep
 32 | 
 33 | The Rep would take a list of preloaded RootFSes and a list of supported RootFSProviders.  For example:
 34 | 
 35 | ```
 36 | rep --preloaded-rootfses='{"lucid64":"/path/to/lucid64", "cflinuxfs2":"/path/to/cflinuxfs2"} --supported-root-fs-providers="docker"
 37 | ```
 38 | 
 39 | This information could make its way onto the Cell Presence.  Not strictly necessary at this point, though it is convenient to have access to this information from the API.
 40 | 
 41 | ### Auction
 42 | 
 43 | During the auction the Auctioneer will be given the RootFS and the Rep will furnish its available preloaded RootFSes and supported RootFS providers in its response to `State` requests:
 44 | 
 45 | ```
 46 | State: {
 47 |     RootFSProviders = ["docker"],
 48 |     PreloadedRootFSes = ["lucid64", "cflinuxfs2"],
 49 | }
 50 | ```
 51 | 
 52 | The Auctioneer would then know whether or not a Cell could support the requested RootFS.  In pseudocode:
 53 | 
 54 | ```
 55 | func (c Cell) CanRunTask(task Task) bool {
 56 |     if task.RootFS.Scheme == "preloaded" {
 57 |         return c.PreloadedRootFSes.Contains(task.RootFSResource)
 58 |     } else {
 59 |         return c.RootFSProviders.Contains(task.RootFS.Scheme)
 60 |     }
 61 | }
 62 | ```
 63 | 
 64 | ### Stager/NSYNC
 65 | 
 66 | Stager/NSYNC would translate CC's "Stack" requests to appropriate values for the two RootFS fields.
 67 | 
 68 | ---
 69 | 
 70 | # Placement Constraints (née Placement Pools)
 71 | 
 72 | Placement Constraints are going to be one of the first new features that Diego brings to the platform.  This proposal is intended to get that ball rolling.
 73 | 
 74 | ## What are Placement Constraints?
 75 | 
 76 | At the highest level Placement Constraints will allow operators to group Cells arbitrarily by painting them with tags.  Diego workloads can then be constrainted to run on a certain set of tags.
 77 | 
 78 | #### Tags
 79 | 
 80 | A `tag` is a an arbitrary case-insensitive string (e.g. `"staging"`, `"production"`, `"skynet"`).  Cells can be painted with arbitrarily many `tags`.  For example
 81 | 
 82 | Cell 1 | Cell 2 | Cell 3 | Cell 4 | Cell 5 | Cell 6 | Cell 7 | Cell 8 | Cell 9
 83 | -------|--------|--------|--------|--------|--------|--------|--------|-------
 84 | staging|staging|staging|staging|production|production|production|production|
 85 | skynet|skynet|||skynet|skynet|||skynet
 86 | 
 87 | A `tag` must be less than 64 characters long (arbitrary!)
 88 | 
 89 | #### Constraints
 90 | 
 91 | `constraint` is a set of rules associated with a Task or DesiredLRP.  The `constraint` determines which Cells are eligible for running the given workload.
 92 | 
 93 | Here is an empty `constraint`:
 94 | ```
 95 | Constraint: {
 96 |     Require: [],
 97 | }
 98 | ```
 99 | This says that *no* tags are required.  Cells 1-9 satisfy this `constraint`.
100 | 
101 | Here is a `constraint` that requires a workload run on `staging` Cells:
102 | ```
103 | Constraint: {
104 |     Require: ["staging"],
105 | }
106 | ```
107 | Cells 1-4 satisfy this `constraint`.
108 | 
109 | Here is a `constraint` that only runs on `staging` Cells allocated to the `skynet` corporation:
110 | ```
111 | Constraint: {
112 |     Require: ["staging", "skynet"],
113 | }
114 | ```
115 | Cells 1-2 satisfy this `constraint`.
116 | 
117 | Finally, here are some `constraint`s that cannot be satisfied with the given Cell setup:
118 | ```
119 | Constraint: {
120 |     Require: ["alfalfa"],
121 | }
122 | ```
123 | No cells could possiby satisfy these particular `constraint`s.  Diego is not in the business of identifying these sorts of inconsistencies -- it is up to the consumer to coordinate their Placement Constraint `tags` and `constraint`s.  Diego will, however, inform the user (asynchronously after a failed attempt to auction) when it fails to satisfy a constraint.
124 | 
125 | #### How does this interact with `Stack`?
126 | 
127 | It does not need to.  CC's `Stack` is related to the RootFS (discussed above).
128 | 
129 | #### Are Placement Constraints dynamic?
130 | 
131 | No.  Not for MVP.
132 | 
133 | You cannot change the `constraint` on a DesiredLRP.  You must request a new DesiredLRP.
134 | 
135 | Also, you cannot change the tags on a running Cell.  You will need to perform a rolling deploy to change the tags.
136 | 
137 | Because of these constraints we do not need to make the converger aware about Placement Constraints: they can't change so there's nothing to keep consistent once ActualLRPs are scheduled on Cells.
138 | 
139 | #### Querying Diego for `tags`
140 | 
141 | For MVP we propose using the `/v1/cells` Receptor API to fetch all Cells and derive the set of tags.
142 | 
143 | If/when we switch to a relational database we will be able to support `/v1/tags` to fetch all tags cheaply.
144 | 
145 | ## Changes to Diego
146 | 
147 | #### Task/DesiredLRP
148 | 
149 | We add `constraint` to Tasks and DesiredLRPs.  `constraint` will be **immutable** and is optional (leaving it off means "run this anywhere").
150 | 
151 | #### Rep
152 | 
153 | The `rep` should take accept a new command line flag `-tags` -- a comma separated list of tags.  We should make these tags bosh configurable.
154 | 
155 | The `rep` will include the list of `tags` in `CellPresence` and responses to `State` requests from the Auctioneer.
156 | 
157 | #### Auctioneer
158 | 
159 | The `auctioneer` will be responsible for enforcing `constraint`s in addition to `rootfs` (see above).  Extending it to apply the rules outlined above should be fairly straightforward.
160 | 
161 | The only subtelty here is around the `PlacementError` that the `auctioneer` applies to ActualLRPs and Tasks that fail to be placed.  There are two and these should be strictly interpreted as follows:
162 | 
163 | - `diego_errors.CELL_MISMATCH_MESSAGE`: should be returned if there are *no* Cells satisfying the required `constraint`.
164 | - `diego_errors.INSUFFICIENT_RESOURCES_MESSAGE`: should be returned only if there *are* Cells satisfying the required `constraint` but those Cells do not have sufficient capacity to to run the requested work.
165 | 
166 | ## Changes to CF/CC
167 | 
168 | Like Application Security Groups (ASG), Placement Constraints (PC) will be a assigned on a per-space basis.  I imagine we would mirror the organization of ASGs as closely as possible with the difference that the PC associated with an application will apply to staging and running applications.  Looking at the [CC API docs](http://apidocs.cloudfoundry.org/197/) for ASG this would entail APIs that support:
169 | 
170 | - CRUDding PCs
171 | - A space has one StagingPC and one RunningPC
172 | - Specifying a default StagingPC and a default RunningPC
173 | 
174 | For CC, a PC would look like identical to a Diego `constraint`:
175 | 
176 | ```
177 | PC = {
178 |     Require: [],
179 | }
180 | ```
181 | 
182 | As with ASGs, modifications to a PC will only go through once applications are restarted.
183 | 
184 | #### Validations
185 | 
186 | For MVP I don't think that CC should enforce any rules on PlacementConstraints.  One could imagine a world in which the CC is taught by an operator about which `tags` are deployed (or reaches out to Diego to learn about available tags) -- I don't think we need to go there quite yet.
187 | 
188 | #### Placement Constraints & Restarts vs Restages
189 | 
190 | The CC is somewhat unclear (and confused) about what actions must trigger restages and what actions must trigger restarts.
191 | 
192 | We propose that modifications to Placement Constraints need only trigger a restart.
193 | 


--------------------------------------------------------------------------------
/proposals/private-docker-registry.md:
--------------------------------------------------------------------------------
 1 | # Private Docker Registry
 2 | 
 3 | Diego's docker support today is limited to fetching images from the public docker registry.  This has a number of issues:
 4 | 
 5 | - No way to guarantee that we fetch the same layers when the user scales an application up.
 6 | - No way to guarantee uptime (if the docker hub goes down we can't start a new instance)
 7 | - No (good) way to support private docker registries
 8 | 
 9 | To overcome these issues we propose running a private Docker registry backed by Riak CS.  One of the most compelling aspects of our proposed solution is that it requires *no* changes to any core Diego components (including Garden-Linux).  Instead, only CC and the Docker-Circus need to change.
10 | 
11 | **Goals**
12 | - solve the aforementioned problems
13 | - do so without making Diego more docker-specific
14 | 
15 | **None-Goals**
16 | - providing CF consumers with a docker registry that they `docker push` to
17 | 
18 | ## The Proposal
19 | 
20 | The basic premise here is that we would augment the existing staging step in the Docker flow to *fetch* the Docker layers and then re-upload them to a private Docker registry -- we would push to the registry using a uniquely generated guid (e.g the ProcessGuid) to guarantee that scaling the instance up/down always results in an identical set of layers.
21 | 
22 | To implement this we would modify the Docker-Tailor and Stager/CC to do the following:
23 | 
24 | 1. When staging a Dockerimage the Docker-Tailor fetches the layers and uploads them to the internal registry.
25 |     - For a V0 implementation we would always fetch all the layers and attempt to push them to the internal registry.  Docker's built-in caching support will naturally ensure we only push new layers, however will be required to download all the layers.
26 |     - For a V1 implementation we might try to minimize the layers we *download* as well.  Perhaps we can query the registry for the set of layers that comprise the image and only download ones that are *not* in our private registry?
27 | 2. Upon staging the Docker-Tailor should return a Dockerimage url that uniquely points to the image created in the internal registry.  This will be unique to each CF push.
28 | 3. CC should store off this Dockerimage url and use it when requesting instances of the application.
29 | 
30 | Note that the Task the Stager requests will need to be able to communicate with the private docker registry (presumably in the VPC).  We'll need to use the Receptor's egress rules to poke holes to communicate with the private docker registry.  We'll need to discover these at runtime in the Stager in order to poke *just* the holes to communicate with the docker registry.  This is not an issue when it comes to *fetching* the Dockerimage as Garden-Linux does the downloading and has complete access to the VPC.
31 | 
32 | ## The Road to MVP
33 | 
34 | We begin by using the [Python docker-registry](https://github.com/docker/docker-registry).  We can use the community-generated (docker-registry BOSH release)[https://github.com/cloudfoundry-community/docker-registry-boshrelease] to bosh deploy the python registry.  The first iteration would be a spike in which we back the docker-registry with local storage.
35 | 
36 | With an internal registry in place we can work on the code necessary to have the Tailor download and re-upload the Dockerimage.  It would be best if this could be optional (maybe even on an app-by-app basis for flexibility for now).  Applications that opt-into the internal registry will be copied into it, apps that aren't will always be fetched from the Dockerhub.
37 | 
38 | Once this approach is validated we can make the registry more robust and add new features.
39 | 
40 | ## The Roadmap past MVP
41 | 
42 | - MVP: private docker registry backed by local disk
43 | - MVP: opt-in support for storing Dockerimages in the private registry
44 | - private docker registry backed by [Riak CS](https://github.com/cloudfoundry/cf-riak-cs-release)
45 | - highly-available private docker registry
46 | - ability to stage *private* images (i.e. images that require auth to download.  Basic idea: user supplies credentials which we use *at staging time* to fetch their dockerimage - a short-lived token is OK since we only need it for staging)
47 | - support for [Docker's V2 API](https://github.com/docker/distribution) -- currently not ready for primetime, but when it is we'll want our private registry to speak V2 and use the new Docker go codebase
48 | - ability to prune the private registry of unused Dockerimages (e.g. grab DesiredLRPs in fresh domains and flag any unused layers for removal)?


--------------------------------------------------------------------------------
/proposals/relational-bbs-db-stories-2016-02-16.prolific.md:
--------------------------------------------------------------------------------
  1 | As a Diego developer, I expect the Diego BBS-benchmark test suite to include per-record reads and writes in the rep bulk loop
  2 | 
  3 | The rep bulk loop tests in the BBS benchmark suite should perform individual reads and some writes (perhaps proportional to the number of instances) after their bulk loop, to match the actual rep code more closely.
  4 | 
  5 | For 200K instances, can we do these requests all from the same ginkgo process, if its host has enough CPU/bandwidth/connections available, or do we need to distribute this test across several VMs to avoid resource bottlenecks at the test site?
  6 | 
  7 | Acceptance: Benchmark test suites running in CI include these per-record writes and reads in their rep bulk loops.
  8 | 
  9 | L: perf, perf:breadth
 10 | 
 11 | ---
 12 | 
 13 | As a Diego developer, I can run BBS unit tests against MySQL on my local workstation
 14 | 
 15 | The BBS should be able to pass its unit tests when run locally against a MySQL database, as well as against etcd. For the component integration test, pass in the required MySQL connection information via flags.
 16 | 
 17 | This should enclude support for encrypting the parts of Task, DesiredLRP, and ActualLRP records that should be encrypted and managing the encryption key name in the MySQL database.
 18 | 
 19 | Also document the steps needed to install MySQL on a development workstation.
 20 | 
 21 | Acceptance: I can follow the MySQL installation steps in the CONTRIBUTING document in diego-release and then successfully run the BBS unit tests.
 22 | 
 23 | L: bbs:relational
 24 | 
 25 | ---
 26 | 
 27 | As a Diego developer, I expect BBS unit tests to run against both MySQL and etcd in CI
 28 | 
 29 | Now that the BBS unit tests run against both MySQL and etcd, configure CI to run them both against etcd and MySQL in the 'units' concourse job. This also entails installing any new dependencies in the appropriate pipeline Docker image.
 30 | 
 31 | Acceptance: I can verify from the 'units' job output that the expected tests suites are run.
 32 | 
 33 | L: bbs:relational
 34 | 
 35 | ---
 36 | 
 37 | As a Diego developer, I expect inigo and component integration test suites to run against both MySQL and etcd in CI
 38 | 
 39 | Extend MySQL support to the other non-BOSH-based integration tests that our components run against the BBS. This includes inigo, the 'cmd' integration tests in each component, and a few other test suites (such as the operation test suites in the rep).
 40 | 
 41 | Acceptance: I can verify from the 'units' and 'inigo' job output that the expected tests suites are run.
 42 | 
 43 | L: bbs:relational
 44 | 
 45 | ---
 46 | 
 47 | As a Diego operator, I can run a CF+Diego deployment backed by a MySQL DB instance
 48 | 
 49 | diego-release should allow the BBS to be BOSH-configured to use a MySQL database outside the Diego deployment as its store, with etcd still the default. This option should be configurable in the spiff-based manifest generation through the property-overrides stub.
 50 | 
 51 | Provide instructions and an auto-generated manifest for deploying a singleton MySQL instance to BOSH-Lite and configuring Diego to use it as its backend.
 52 | 
 53 | Acceptance: I can follow instructions to deploy CF+Diego to BOSH-Lite using a MySQL DB and then run CATs and vizzini against it successfully.  
 54 | 
 55 | L: bbs:relational
 56 | 
 57 | ---
 58 | 
 59 | As a Diego developer, I expect Diego CI to run CATs and vizzini against a 'catsup' AWS environment with the BBS backed by an RDS MySQL instance
 60 | 
 61 | Create a new 'catsup' CF+Diego deployment on AWS analogous to ketchup, with a MySQL RDS instance as its backing store. Set up a pipeline parallel to ketchup that runs CATs and vizzini against catsup, both on changes to diego-release or cf-release and periodically.
 62 | 
 63 | Acceptance: CATs and vizzini against catsup are green.
 64 | 
 65 | L: bbs:relational
 66 | 
 67 | ---
 68 | 
 69 | [RELEASE] Diego with relational BBS functional but experimental
 70 | 
 71 | L: bbs:relational
 72 | 
 73 | ---
 74 | 
 75 | As a Diego developer, I expect to run Diego BBS benchmarks against an AWS environment with the BBS backed by an RDS MySQL instance
 76 | Create a new CF+Diego deployment on AWS analogous to the 'diego-3' environment and set up the BBS benchmarks to run against it, both on changes to diego-release at the appropriate level of stability (release-candidate?) and periodically. Configure it to emit metrics to Datadog and to store the test-run artifacts in a separate S3 bucket.
 77 | 
 78 | Aim to support 200K instances across 1000 reps initially.
 79 | 
 80 | Acceptance: BBS benchmarks pass consistently at the appropriate scale, and results are recorded in S3.
 81 | 
 82 | L: bbs:relational
 83 | 
 84 | ---
 85 | 
 86 | As a Diego developer, I expect 'catsup' and the relational-benchmarks deployment to be backed by HA MySQL deployments
 87 | 
 88 | Deploy the CF MySQL release in a suitably HA configuration to catsup and the new relational benchmarks environment and configure the Diego deployment to use that as its backing store instead of RDS. Avoid usage of anything specific to AWS, such as ELBs, to make the MySQL deployment HA.
 89 | 
 90 | Acceptance: Catsup and the relational benchmarks environment function without RDS instances provisioned.
 91 | 
 92 | L: bbs:relational
 93 | 
 94 | ---
 95 | 
 96 | BBS communicates to the relational store over SSL
 97 | 
 98 | The BBS can connect to its relational store using SSL. Any additional relevant information should be configurable via BOSH properties. Do we need to be able to supply a separate CA for use with that connection, or should we rely on the globally trusted or BOSH-deployed trusted certificates?
 99 | 
100 | Acceptance: catsup and the relational benchmarks connect to their relational stores over SSL.
101 | 
102 | L: bbs:relational, security
103 | 
104 | ---
105 | 
106 | [RELEASE] Diego with relational BBS supports 200K instances
107 | 
108 | L: bbs:relational, perf, perf:breadth
109 | 
110 | ---
111 | 
112 | Placeholder: remaining relational-BBS topics (scale, Postgres, migration from etcd, manifest generation)
113 | 
114 | - BBS can use either MySQL or Postgres as a BBS store (BBS units run against all 3 backends)
115 | - Diego manifest generation supports relational stores (BOSH-Lite: CF's postgres; catsup: HA MySQL)
116 | - Diego migrates existing BBS data from etcd to its relational store
117 | - Diego CI runs another DUSTs suite in etcd-to-relational migration mode
118 | - [RELEASE] Diego with relational BBS fully supported
119 | - Diego CI runs a DUSTs suite with only MySQL
120 | - ketchup migrates to relational BBS store, catsup obsolete
121 | 
122 | 
123 | L: bbs:relational, placeholder
124 | 


--------------------------------------------------------------------------------
/proposals/relational-bbs-db-stories-2016-04-18.prolific.md:
--------------------------------------------------------------------------------
  1 | As a Diego operator, I expect the BBS to migrate existing data from etcd to the relational store
  2 | 
  3 | If the BBS is configured with connection information and credentials for both etcd and a relational store, it will migrate existing data in etcd to the relational store and then serve data only from the relational store. The migration mechanism should also ensure that if a BBS server becomes master that understands only how to serve data from etcd, it should realize that it is not current and relinquish the BBS lock. It should also ensure that this migration happens exactly once, so that subsequent BBS servers that start with the dual 
  4 | 
  5 | This likely needs to be done using the versioning system already implemented in the etcd store, so that BBSes all the way back to 0.1434.0 will function correctly. Consider the `CurrentVersion`/`TargetVersion` rules implemented in [the existing BBS migration mechanism](https://github.com/cloudfoundry/diego-dev-notes/blob/master/accepted_proposals/bbs-migrations.md#the-bbs-migration-mechanism).
  6 | 
  7 | Acceptance:
  8 | - I can configure a multi-instance BBS Diego first to populate etcd with data, then redeploy it with a relational configuration and observe that data is migrated from etcd to the relational store correctly (preserved, migration happens only once). 
  9 | - I can arrange the deployment with BBS both in etcd and in etcd+relational configurations and observe that once the data is migrated to the relational store, the etcd-only BBSes do not retain the lock during BBS elections.
 10 | - I can cause the etcd-to-relational migration to fail on one BBS node, and observe that it does not prevent etcd-only BBSes from taking over and serving requests or other etcd-and-relational BBSes from performing the migration later.
 11 | - If relational credentials are supplied to the BBS but a BBS node cannot connect on start-up, the deploy should fail on that node.
 12 | 
 13 | 
 14 | L: bbs:relational
 15 | 
 16 | ---
 17 | 
 18 | As a Diego operator, I should be able to generate a Diego manifest with connection info for a relational store
 19 | 
 20 | An operator should be able to supply connection information to connect to the relational store they have provisioned externally. This information should come in as parameters in a stub (either in property-overrides or a new optional stub argument).
 21 | 
 22 | Acceptance:
 23 | - I can follow documentation in diego-release to place the relational connection information in the appropriate stub as an input to manifest-generation, and then use that to deploy Diego against MySQL on BOSH-Lite.
 24 | 
 25 | L: bbs:relational
 26 | 
 27 | ---
 28 | 
 29 | Explore Diego system behavior with several CF MySQL nodes to discover how deadlock errors and rollbacks affect Diego correctness
 30 | 
 31 | We'd like to know how Diego deals with consistency errors if it's communicating with several different CF MySQL nodes simultaneously. Core Services has said we'll get deadlocks, but they will roll back and manifest as a failure client-side. Investigate how this affects the correctness of the Diego system, and generate stories with recommendations of how to mitigate the problems in the BBS, if possible. If this is impossibly bad, we may need to change the switchboard proxy to do leader election via, say, consul locks.
 32 | 
 33 | Also determine if we can configure the mysql client to accept several IPs, or if it requires only a single hostname. If the latter, we should explore colocating the consul agent on the proxy nodes to register them via Consul DNS.
 34 | 
 35 | Timebox to 2 days initially.
 36 | 
 37 | L: bbs:relational, charter
 38 | 
 39 | ---
 40 | 
 41 | As a Diego operator, I should be able to follow documentation to deploy an single-node standalone CF-MySQL cluster on my infrastructure
 42 | 
 43 | The diego-release documentation should explain how to produce a deployment manifest for a single-node standalone CF-MySQL cluster. Ideally, we can link to documentation in the cf-mysql release itself with a high-level explanation of the required deployment parameters (instance counts, job types). If the cf-mysql-release makes this real difficult, talk to Core Services about getting it changed in their release. (Spoiler: Luan says they need this help for a minimal+standalone deployment option.)
 44 | 
 45 | Acceptance:
 46 | - I can follow documentation to provision a single-node standalone CF-MySQL cluster on AWS and then successfully deploy Diego to use that as its datastore via the manifest-generation scripts.
 47 | 
 48 | L: bbs:relational
 49 | 
 50 | ---
 51 | 
 52 | As a Diego operator, I should be able to follow documentation to provision an RDS MySQL instance to support my AWS Diego deployment
 53 | 
 54 | The diego-release AWS documentation should explain how to provision an RDS MySQL instance manually in the VPC created for the BOSH+CF+Diego stack, and then how to provide the connection information to the Diego manifest-generation scripts. This configuration should be optional for now.
 55 | 
 56 | Let's also make this manual for now, and consider how to provision this with CloudFormation/BOOSH/bbl later.
 57 | 
 58 | Acceptance:
 59 | - I can follow the AWS-specific documentation in diego-release to provision this RDS instance and connect my AWS Diego deployment to it.
 60 | 
 61 | L: bbs:relational
 62 | 
 63 | ---
 64 | 
 65 | As a Diego operator, I should be able to follow documentation to deploy an HA standalone CF-MySQL cluster on my infrastructure
 66 | 
 67 | The diego-release documentation should explain how to produce a deployment manifest for an HA standalone CF-MySQL cluster. Ideally, we can link to documentation in the cf-mysql release itself with a high-level explanation of the required deployment parameters (instance counts, job types).
 68 | 
 69 | BLOCKED on outcome of the charter #117854575 to explore Diego behavior when interacting with multiple masters.
 70 | 
 71 | Acceptance:
 72 | - I can follow documentation to provision an HA standalone CF-MySQL cluster on AWS and then successfully deploy Diego to use that as its datastore via the manifest-generation scripts.
 73 | 
 74 | L: bbs:relational
 75 | 
 76 | ---
 77 | 
 78 | As a Diego developer, CI should be configured to run the DUSTs to upgrade the BBS from Diego 0.1434.0 targeting etcd to latest targeting MySQL
 79 | 
 80 | This DUSTs run should not yet block delivery of diego-release.
 81 | 
 82 | L: bbs:relational
 83 | 
 84 | ---
 85 | 
 86 | As a Diego operator, I should be able to generate a Diego manifest that uses only a relational store
 87 | 
 88 | As a Diego operator who either has finished migrating an existing etcd-backed deployment to MySQL or is starting a new deployment with MySQL only, I should be able to produce a Diego deployment manifest that does not require the etcd-release and does not set etcd deployment properties. The manifest-generation script should also require minimal relational configuration to be provided.
 89 | 
 90 | Acceptance:
 91 | - I can follow documentation to generate a deployment manifest intended to support only a relational store, and then deploy Diego successfully without an etcd release present on the BOSH director. The manifest should also not contain any extraneous etcd-specific parameters, and should not configure the BBS to communicate with etcd at all.
 92 | 
 93 | L: bbs:relational
 94 | 
 95 | ---
 96 | 
 97 | Placeholder: remaining relational-BBS topics (pipeline work, scale validation, validation of robustness to failure)
 98 | 
 99 | - Validate Diego system correctness in face of MySQL deployment failures
100 | - warp-drive, etcd-to-MySQL DUST failures block delivery
101 | - basic end-to-end perf validation?
102 | - make sure BBS can run SQL migrations (need migration to drive this for sure, could use the timeout-units one)
103 | - [RELEASE] Diego with relational BBS fully supported
104 | - Diego CI runs a DUSTs suite with only MySQL
105 | - ketchup migrates to relational BBS store, catsup obsolete
106 | 
107 | 
108 | L: bbs:relational, placeholder
109 | 


--------------------------------------------------------------------------------
/proposals/relational-bbs-db-stories-2016-05-10.prolific.md:
--------------------------------------------------------------------------------
 1 | As a Diego operator, I should be able to use Postgres as a backing relational store for the Diego BBS
 2 | 
 3 | We have received feedback that some operators would prefer to use Postgres as their backing relational store for the BBS API, as they have operational familiarity with that database instead of MySQL. CC and UAA also provide top-level support for both MySQL and postgres, so supporting both for Diego's BBS database is reasonable.
 4 | 
 5 | ### Acceptance
 6 | 
 7 | - I can follow documentation to install and configure postgres on my local workstation and then run BBS test suites against it.
 8 | - I can observe from test runs in CI that appropriate test coverage is run against Postgres.
 9 | 
10 | L: bbs:relational
11 | 
12 | ---
13 | 
14 | As a Diego PM, I expect that failures in the super-laser tests and etcd-to-MySQL DUSTs run should block delivery of diego-release
15 | 
16 | In order to guarantee support for Diego when backed by a relational store, we should arrange for test runs in CI against a relational store to gate delivery of diego-release.
17 | 
18 | BLOCKED on #117843841: running an etcd-to-MySQL DUSTs run.
19 | 
20 | L: bbs:relational, blocked
21 | 
22 | ---
23 | 
24 | [RELEASE] Diego with relational BBS fully supported
25 | 
26 | L: bbs:relational
27 | 
28 | ---
29 | 
30 | Update performance-testing protocol for 250K-instance, 1000+-cell end-to-end performance test
31 | 
32 | Update the [performance-testing protocol in the diego-dev-notes](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/measuring_performance.md) for the 250K-instance experiment we intend to run. Address the following points:
33 | 
34 | - Distribution of and resource limits for app instances: at only 1000 cells, container density will be 250 per cell, which is right at the default limit for garden-linux (and garden-runc?)
35 | - Prescribed IOPS configuration for cell disk at that scale
36 | - Consistency of app behavior (we seem to have turned off substantial app logging in later experiments?)
37 | - Collection of logs and metrics from the environment: for example, we learned that Datadog is not ideal for later analysis of metrics
38 | 
39 | Let's have a separate story to add extra experiments that we think would be valuable to run.
40 | 
41 | 
42 | ### Acceptance
43 | 
44 | The performance testing protocol has explicit configuration parameters for the 250K-instance test and is up-to-date and consistent in its Diego terminology. It also should account for any deficiencies we observed in data processing capabilities after the 10K-instance test run.
45 | 
46 | L: bbs:relational, perf, perf:breadth
47 | 
48 | ---
49 | 
50 | As a Diego operator, I expect to be able to upgrade my MySQL-backed deployment from the earliest supported version to the latest without downtime
51 | 
52 | We can ensure this by configuring CI to run the DUSTs from a MySQL-backed deployment on that earliest supported version to a MySQL-backed deployment on the latest candidate version.
53 | 
54 | BLOCKED on declaring support for a relational store in a future Diego final release.
55 | 
56 | L: bbs:relational, blocked, pipeline
57 | 
58 | ---
59 | 
60 | [CHORE] Migrate ketchup to a relational store and destroy the super-laser environment
61 | 
62 | BLOCKED on declaring support for a relational store in a future Diego final release.
63 | 
64 | L: bbs:relational, blocked, pipeline
65 | 
66 | ---
67 | 
68 | Placeholder: remaining relational-BBS topics (migration validation, especially on failure)
69 | 
70 | ### Pre-support
71 | 
72 | - charter: investigate migration failure scenarios
73 | - make sure BBS can run SQL migrations (need migration to drive this for sure, could use the timeout-units one)
74 | 
75 | 
76 | L: bbs:relational, placeholder
77 | 


--------------------------------------------------------------------------------
/proposals/release-versioning-testing-stories.prolific.md:
--------------------------------------------------------------------------------
  1 | As a Diego developer, I can deploy CF v220 + Diego 0.1434.0 + GL 0.307.0 ('V0') as a specific collection of coordinated piecewise deployments
  2 | 
  3 | This CF + Diego deployment should consist of 5 separate BOSH deployments:
  4 | 
  5 | - CF
  6 | - `D1`: 1 database VM
  7 | - `D2`: 1 brain, 1 cc-bridge, 1 access, 1 route-emitter
  8 | - `D3`: 1 cell
  9 | - `D4`: 1 cell
 10 | 
 11 | Acceptance:
 12 | 
 13 | - I can run a script to generate these manifests that I can easily configure to produce BOSH-Lite-specific manifests.
 14 | - After uploading the designated releases, I can use the manifests to deploy CF + Diego at these versions to a BOSH-Lite, even if other releases exist on its BOSH director.
 15 | - I can validate that the deployment is correct by running DATs against it successfully.
 16 | 
 17 | 
 18 | L: pipeline:upgrade-stable
 19 | 
 20 | ---
 21 | 
 22 | As a Diego developer, I can upgrade my piecewise V0 deployment to a given CF/Diego/Garden/ETCD version combination later than V0 (V-prime)
 23 | 
 24 | For a given collection of CF and Diego versions, I can use scripts to generate a set of manifests designed to upgrade the piecewise V0 deployment to the new CF/Diego versions.
 25 | 
 26 | 
 27 | Acceptance:
 28 | 
 29 | - After checking out cf-release and diego-release at later versions, generating releases, and uploading them, I can generate these CF and Diego manifests and use them to upgrade my piecewise V0 deployment to V-prime.
 30 | 
 31 | 
 32 | L: pipeline:upgrade-stable
 33 | 
 34 | ---
 35 | 
 36 | As a Diego developer, I have an 'upgrade-from-stable' test suite to create a piecewise V0 deployment against a BOSH director
 37 | 
 38 | This test suite should take the piecewise V0 deployment and deploy CF and Diego to the target director. We can assume that the correct V0 releases are already uploaded to the director.
 39 | 
 40 | Acceptance:
 41 | 
 42 | - After uploading the V0 releases to my BOSH-Lite, I can run the test suite with the appropriate configuration to deploy the piecewise V0 deployment.
 43 | - The suite should fail if the deployments already exist or if the BOSH commands themselves fail.
 44 | 
 45 | 
 46 | L: pipeline:upgrade-stable
 47 | 
 48 | ---
 49 | 
 50 | As a Diego developer, I expect that the 'upgrade-from-stable' suite upgrades my V0 deployment to V-prime
 51 | 
 52 | The suite should assume that the V-prime releases are also already uploaded to the target BOSH director. The upgrade should proceed in the following order:
 53 | 
 54 | - Upgrade `D1` (databases)
 55 | - Upgrade `D3` (cell bank 1)
 56 | - Upgrade `D2` (brain and pals)
 57 | - Upgrade CF
 58 | - Upgrade `D4` (cell bank 2)
 59 | 
 60 | Acceptance:
 61 | 
 62 | - I can run this test suite against a BOSH-Lite after uploading the required V0 and V-prime releases and verify that it deploys a piecewise V0 deployment and then upgrades it to V-prime.
 63 | 
 64 | 
 65 | L: pipeline:upgrade-stable
 66 | 
 67 | ---
 68 | 
 69 | As a Diego developer, I expect that the 'upgrade-from-stable' suite runs smoke tests against my piecewise deployment at each step between upgrades from V0 to V-prime 
 70 | 
 71 | We also need to change the deployment strategy to stop and start groups of cells intentionally to ensure that the apps in the smoke tests are running on cells of the intended version.
 72 | 
 73 | Deployment changes:
 74 | 
 75 | - After upgrading `D3`, stop `D4`.
 76 | - After upgrading `D2`, start `D4` and stop `D3`.
 77 | - Before upgrading `D4`, start `D3`.
 78 | 
 79 | Smoke test to run after each set of deployment changes:
 80 | - Push 2 instances of new test app (e.g., Dora) with SSH enabled
 81 | - Verify all instances run, are routable, have visible logs, and are accessible over SSH (run some trivial command)
 82 | - Destroy the instances and their routes
 83 | 
 84 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the suite output and inspecting the system during the run that the smoke tests are running.
 85 | 
 86 | 
 87 | L: pipeline:upgrade-stable
 88 | 
 89 | ---
 90 | 
 91 | As a Diego developer, I expect that the 'upgrade-from-stable' suite pushes an app to my piecewise deployment after V0 is deployed and checks its scalability at each step between upgrades from V0 to V-prime
 92 | 
 93 | After the suite deploys V0, it pushes one instance of a test app (e.g., Dora) and verifies it is running and routable. Call it A0.
 94 | 
 95 | After each step in the deployment upgrade, the suite scales the app up to 2 instances, verifies they are both running and routable, and then scales the app back down to 1 instance and verifies the second instance is gone.
 96 | 
 97 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the suite output that the A0 app is created and exercised throughout the test run.
 98 | 
 99 | 
100 | L: pipeline:upgrade-stable
101 | 
102 | ---
103 | 
104 | As a Diego developer, I expect the 'upgrade-from-stable' suite to check the initial V0 app for continual routability during Diego deployment upgrades (but not CF)
105 | 
106 | During all of the BOSH operations on the Diego deployment, the test suite should continually verify that the A0 app is routable. The BOSH operations from previous stories are already structured so that the instance is always able to evacuate to an active cell.
107 | 
108 | This routing verification should not be done during the CF deployment, as parts of the routing tier (HAproxy and/or the gorouter) may not be available then.
109 | 
110 | Acceptance: I can still run the test suite against a local BOSH-Lite and verify from the test output (and by tailing logs on the A0 app) that these routing tests are performed.
111 | 
112 | 
113 | L: pipeline:upgrade-stable
114 | 
115 | ---
116 | 
117 | As a Diego developer, I expect to run the 'upgrade-from-stable' suite in CI against a new BOSH-Lite instance provisioned on AWS
118 | 
119 | In this case, V-prime is the collection of releases in the candidate build in the pipeline.
120 | 
121 | A job in the Diego CI should provision a new BOSH-Lite instance on AWS, upload the V0 and V-prime releases to it, then generate the V0 and V-prime piecewise manifests and run the 'upgrade-from-stable' test suite. Once it is done, we should destroy the BOSH-Lite instance.
122 | 
123 | We should run this job as part of the main Diego pipeline, in parallel with CATs and DATs after the ketchup deployment and smoke tests succeed. If it fails, we fail to promote the diego-release candidate through the pipeline
124 | 
125 | 
126 | L: pipeline:upgrade-stable
127 | 
128 | 


--------------------------------------------------------------------------------
/proposals/release-versioning-testing-v2.md:
--------------------------------------------------------------------------------
  1 | # Upgrade Testing V2
  2 | 
  3 | ## Updates from [the previous version](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md)
  4 | 
  5 | We revisit release version testing and the current upgrade testing strategy (the DUSTs) with the following changes and considerations in mind:
  6 | 
  7 | - The recommended way to deploy Cloud Foundry is now [cf-deployment](https://github.com/cloudfoundry/cf-deployment) instead of the manifest-generation systems in [cf-release](https://github.com/cloudfoundry/cf-release) and [diego-release](https://github.com/cloudfoundry/diego-release), which will be removed soon.
  8 | - As a result, the instance groups are implicitly spread across AZs, instead of explicitly grouped by AZ, although some instance groups may still end up being parallelized.
  9 | - Breaking changes to other systems in a CF deployment have caused the DUSTs to fail much more often than an actual breaking change between the APIs or components under the control of the Diego team.
 10 | - Reliance on BOSH makes the initial deploy and subsequent updates slow. An inigo-style test suite written in Ginkgo could allow the Diego team to iterate faster and to run more focused subset of the test suite.
 11 | 
 12 | ## Aspects remaining from [the previous version](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md)
 13 | 
 14 | - We perceive there to be value in maintaining deployments at different heterogeneous combinations of versions and running an appropriate set of tests to validate functionality, instead of coincidentally running tests while proceeding through an upgrade of a single BOSH deployment.
 15 | - Operational guarantees have not changed regarding compatibility of versions of diego-release. Refer to [this section](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md#operational-guarantees) in the original version of the document.
 16 | - We still do not require testing every possible upgrade path (i.e. every `(V0, V1)` pair) See [this section](https://github.com/cloudfoundry/diego-dev-notes/blob/master/proposals/release-versioning-testing.md#selection-of-versions-for-testing) for the rationale.
 17 | 
 18 | # Test configurations
 19 | 
 20 | The new dusts suite separate the two concerns, explained in the following sections.
 21 | 
 22 | We select two main configurations of importance, based on the initial major versions of the release and major architectural changes:
 23 | 
 24 | - Diego starting at v1.0.0, configured to use Consul and global route-emitters,
 25 | - Diego starting at v1.25.2, configured to use Locket with local route-emitters.
 26 | 
 27 | ## App Routability during Upgrades
 28 | 
 29 | As in the current [DUSTs](https://github.com/cloudfoundry/diego-upgrade-stability-tests), there will always be an app pushed right after the initial configuration (C0) is deployed and constantly checked for routability.
 30 | 
 31 | ### BBS running with Consul (from v1.0.0 to develop)
 32 | 
 33 | - Initial Diego version is v1.0.0.
 34 | - Infrastructure components: MySQL database, NATS, Consul, GoRouter.
 35 | - Route-emitter is configured in global mode, with Consul lock.
 36 | - Each Cell includes the Diego cell rep and garden-runc.
 37 | 
 38 | | Configuration | BBS | BBS Client | Auctioneer | Cell 0 | Cell  1 | RouteEmitter | Notes                               |
 39 | |---------------|-----|------------|------------|--------|---------|--------------|-------------------------------------|
 40 | | C0            | v0  | v0         | v0         | v0     | v0      | v0           | Initial configuration               |
 41 | | C1            | v1  | v0         | v0         | v0     | v0      | v0           | Simulates upgrading diego-api       |
 42 | | C2            | v1  | v1         | v0         | v0     | v0      | v0           | Simulates API upgrading             |
 43 | | C3            | v1  | v1         | v1         | v0     | v0      | v0           | Simulates scheduler upgrading       |
 44 | | C4'           | v1  | v1         | v1         | v1     | v0      | v0           | Simulates Cell evacuation/upgrading |
 45 | | C4            | v1  | v1         | v1         | v1     | v1      | v0           | Simulates cell evacuation/upgrading |
 46 | | C5            | v1  | v1         | v1         | v1     | v1      | v1           | Simulates route-emitter upgrade     |
 47 | 
 48 | ### BBS running with Locket (from v1.25.2 to develop)
 49 | 
 50 | - Initial Diego version is v1.25.2.
 51 | - Infrastructure components: MySQL database, NATS, GoRouter.
 52 | - Route-emitter is configured in local mode.
 53 | - Each Cell includes the Diego cell rep, garden-runc, grootfs, and local route-emitters.
 54 | 
 55 | | Configuration | Locket | BBS | BBS Client (Vizzini) | Auctioneer | Cell 0 | Cell 1 | Notes                               |
 56 | |---------------|--------|-----|----------------------|------------|--------|--------|-------------------------------------|
 57 | | C0            | v0     | v0  | v0                   | v0         | v0     | v0     | Initial configuration               |
 58 | | C1            | v1     | v0  | v0                   | v0         | v0     | v0     | Simulates upgrading diego-api       |
 59 | | C2            | v0     | v1  | v0                   | v0         | v0     | v0     |                                     |
 60 | | C3            | v1     | v1  | v1                   | v0         | v0     | v0     | Simulates API upgrading             |
 61 | | C4            | v1     | v1  | v1                   | v1         | v0     | v0     | Simulates scheduler upgrading       |
 62 | | C5'           | v1     | v1  | v1                   | v1         | v1     | v0     | Simulates cell evacuation/upgrading |
 63 | | C5            | v1     | v1  | v1                   | v1         | v1     | v1     | Simulates cell evacuation/upgrading |
 64 | 
 65 | ## Smoke Tests
 66 | 
 67 | Run the vizzini test suite at the appropriate version to verify core functionality of the Diego BBS API. This test could be structured as its own isolated context, in which it starts the appropriate versions of the components and runs the vizzini tests, instead of coupling it to the sequential upgrade of the components.
 68 | 
 69 | ### BBS running with Consul (from v1.0.0 to develop)
 70 | 
 71 | - Initial Diego version is v1.0.0.
 72 | - Infrastructure components: MySQL database, NATS, Consul, GoRouter.
 73 | - Route-emitter is configured in global mode, with Consul lock.
 74 | - Each Cell includes the Diego cell rep and garden-runc.
 75 | 
 76 | | Configuration | BBS | BBS Client | SSH proxy + Auctioneer | Cell | RouteEmitter | Notes                           |
 77 | |---------------|-----|------------|------------------------|------|--------------|---------------------------------|
 78 | | C0            | v0  | v0         | v0                     | v0   | v0           | Initial configuration           |
 79 | | C1            | v1  | v0         | v0                     | v0   | v0           | Simulates upgrading diego-api   |
 80 | | C2            | v1  | v1         | v0                     | v0   | v0           | Simulates API upgrading         |
 81 | | C3            | v1  | v1         | v1                     | v0   | v0           | Simulates scheduler upgrading   |
 82 | | C4            | v1  | v1         | v1                     | v1   | v0           | Simulates cell upgrade          |
 83 | | C5            | v1  | v1         | v1                     | v1   | v1           | Simulates route-emitter upgrade |
 84 | 
 85 | ### BBS running with Locket (from v1.25.2 to develop)
 86 | 
 87 | - Initial Diego version is v1.25.2.
 88 | - Infrastructure components: MySQL database, NATS.
 89 | - Route-emitter is configured in local mode.
 90 | - Each Cell includes the Diego cell rep, garden-runc, grootfs, and local route-emitters.
 91 | 
 92 | | Configuration | Locket | BBS | BBS Client (Vizzini) | SSH proxy + Auctioneer | Cell | Notes                         |
 93 | |---------------|--------|-----|----------------------|------------------------|------|-------------------------------|
 94 | | C0            | v0     | v0  | v0                   | v0                     | v0   | Initial configuration         |
 95 | | C1            | v1     | v0  | v0                   | v0                     | v0   | Simulates upgrading diego-api |
 96 | | C2            | v0     | v1  | v0                   | v0                     | v0   |                               |
 97 | | C3            | v1     | v1  | v1                   | v0                     | v0   | Simulates API upgrading       |
 98 | | C4            | v1     | v1  | v1                   | v1                     | v0   | Simulates scheduler upgrading |
 99 | | C5            | v1     | v1  | v1                   | v1                     | v1   | Simulates cell upgrading      |
100 | 
101 | ## Notes
102 | 
103 | - The above configurations have the added benefit of testing the contract between rep and BBS regarding cell presences (namely, that the newer BBS can still parse old rep cell presences in both Consul or Locket).
104 | - Deploy only one instance for all instance groups except the Cell to ensure routability to the test app at all times.
105 | - Cells are to be updated gracefuly (that is, each cell evacuates before it stops) to ensure routability to the test app at all times.
106 | - Configurations follow the order of instance group updates as of [this version of cf-deployment](https://github.com/cloudfoundry/cf-deployment/commit/9be2644da8de08540891e24856bbdb88f9a83f67).
107 | - We should test both combinations of different versions of BBS and Locket since different versions can exist during a rolling update.
108 | - The SSH proxy and auctioneer do not communicate with each other and exist on the same VM, hence we update them at the same time.
109 | 
110 | 
111 | # Concerns
112 | 
113 | - Routing: We think testing HTTP routing is sufficient, as that is the main routing tier of importance in CF at present. The Routing team is primarily responsible for ensuring interoperability of the clients and servers in the routing control plane.
114 | - Locket: We test with a separate configuration that includes Locket because of the importance of this architectural change.
115 | - Cell update order: The most common update order in the deployment is to update the Diego control plane first (BBS, auctioneer, locket) and then the cells, so we focus on those configurations.
116 | - BOSH upgradability: We do lose test coverage for ensuring that the components restart correctly, and we have encountered issues around pid-file management, but we could handle this with a different test suite.
117 | - Quota enforcement: We did regress on enforcing instance quotas with a breaking change to the cell rep API, but there were vizzini tests that would have caught that regression.
118 | - Diego client: Running the appropriate version of vizzini will provide sufficient coverage of the Diego BBS client functionality.
119 | 


--------------------------------------------------------------------------------
/proposals/release-versioning-testing.md:
--------------------------------------------------------------------------------
  1 | # Release Versioning
  2 | 
  3 | ## Operational guarantees
  4 | 
  5 | We intend the BOSH release of Diego to have a simple policy regarding upgradability, communicated clearly through its version: operators will have a stable upgrade path through versions of Diego as long as they deploy every major version of the release. This stability includes keeping application instances running through upgrades and maintaining the integrity of data submitted through Diego's BBS API.
  6 | 
  7 | To signal to operators the ramifications of upgrading their deployments to new versions of the release, we will follow `MAJOR.MINOR.PATCH`-style [semantic versioning](http://semver.org) with the following conventions:
  8 | 
  9 | - A **major version increment** may include the removal of deprecated APIs or functionality from a previous major release version. In accordance with our goal of allowing operators to upgrade seamlessly across major versions, we will typically remove functionality only after it is deprecated for a full major version.
 10 | 
 11 | - A **minor version increment** will indicate the addition of new functionality to the release, typically in the form of new APIs or behavior.
 12 | 
 13 | - A **patch version increment** will indicate no new addition of functionality, and the introduction only of bugfixes and refinements to behavior of existing APIs.
 14 | 
 15 | We call a pair of versions `(V1, V2)` *upgradable* if `V1 <= V2` lexicographically and if `0 <= MAJOR(V2) - MAJOR(V1) <= 1`. An upgradable version pair is *strict* if `V1 < V2`.  
 16 | 
 17 | ## Timelines
 18 | 
 19 | We have yet to set a precedent for publishing major versions of Diego, and to a large extent our cadence there will depend on the scope and future direction of the Diego project within the CF ecosystem. That said, we understand that there is a natural tension between our desire to keep the Diego codebase and features clean and coherent, and the needs of the CF ecosystem and community to have a sufficiently stable set of APIs on which to build our platform. To that end, we expect to release major versions of Diego no more frequently than every three months, and more realistically approximately every six months to a year.
 20 | 
 21 | 
 22 | ## Testing strategies
 23 | 
 24 | For a given upgradable version pair `(V1, V2)`, we intend to conduct a battery of tests to ensure that deployments with different combinations of components at `V1` and `V2` function correctly. These tests should include being able to run, scale up, and scale down existing apps created in the `V1` deployment, and to stage and run apps created in the mixed deployment. It will be most important to verify that `V1` cells are compatible with `V2` inputs and vice-versa, so we will test these operations against the following combinations of components:
 25 | 
 26 | 
 27 | | Configuration  | BBS  | Cells | Brain/Bridge | CF   |
 28 | |----------------|------|-------|--------------|------|
 29 | | `C0` (Initial) | `V1` | `V1`  | `V1`         | `V1` |
 30 | | `C1`           | `V2` | `V1`  | `V1`         | `V1` |
 31 | | `C2`           | `V2` | `V2`  | `V1`         | `V1` |
 32 | | `C3`           | `V2` | `V1`  | `V2`         | `V1` |
 33 | | `C4`           | `V2` | `V1`  | `V2`         | `V2` |
 34 | | `C5`           | `V2` | `V2`  | `V2`         | `V2` |
 35 | 
 36 | 
 37 | Rather than attempting these operations against a deployment in flight during a monolithic upgrade of the Diego deployment, we can progress through these combinations intentionally and deterministically by arranging the Diego deployment as four separate, coordinated deployments:
 38 | 
 39 | - `D1`: Database nodes
 40 | - `D2`: Brain, CC-Bridge, Route-Emitter, and Access nodes
 41 | - `D3`: Cell group 1
 42 | - `D4`: Cell group 2
 43 | 
 44 | The CF deployment will be separate, as it is today. The configurations above can then be achieved with the following sequence of BOSH operations: 
 45 | 
 46 | - `C1`: Deploy `V2` to `D1`
 47 | - `C2`: Deploy `V2` to `D3`, stop `D4`
 48 | - `C3`: Deploy `V2` to `D2`, start `D4`, stop `D3`
 49 | - `C4`: Deploy `V2` to CF
 50 | - `C5`: Start `D3`, deploy `V2` to `D4`
 51 | 
 52 | Assuming that BOSH stopping a VM triggers draining, so that the Diego cells evacuate correctly, these operations should allow existing app instances to be routable continually.
 53 | 
 54 | We detail the tests for each one of these configurations below:
 55 | 
 56 | ### `C0`: System at `V1`
 57 | 
 58 | With the system in its initial state, we seed it with an instance of Dora, Grace, or another test application asset and verify it is routable. For the sake of illustration below, assume that the app is Dora.
 59 | 
 60 | 
 61 | ### `C1`: BBS at `V2`, Cells and Brain/Bridge at `V1`
 62 | 
 63 | During the upgrade of the BBS to `V2`, we verify that the Dora instance is always routable.
 64 | 
 65 | After upgrading the BBS to `V2`, we verify that the system functions correctly:
 66 | 
 67 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1.
 68 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it.
 69 | 
 70 | 
 71 | ### `C2`: BBS and Cells at `V2`, Brain/Bridge/CF at `V1`
 72 | 
 73 | During the upgrade of the `D3` cells to `V2` and the shutdown of the `D4` cells, we verify that the Dora instance is always routable.
 74 | 
 75 | With all the active Cells at `V2`, we verify that the system functions correctly:
 76 | 
 77 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1.
 78 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it.
 79 | 
 80 | 
 81 | ### `C3`: BBS and Brain/Bridge at `V2`, CF and Cells at `V1`
 82 | 
 83 | During the upgrade of the Brain and CC-Bridge to `V2`, we verify that the Dora instance is always routable.
 84 | 
 85 | During the re-start of the `D4` cells and the stop of the `D3` cells, the Dora instance should be routable, as it will evacuate from the `D3` cells to the `D4` cells.
 86 | 
 87 | With CF and all the active Cells at `V1` and the rest of the system at `V2`, we verify that the system functions correctly:
 88 | 
 89 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1.
 90 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it.
 91 | 
 92 | > If we introduce changes in the RunInfo that cannot be represented appropriately in the `V1` BBS API, do we expect that `V1` cells will be able to run that work? If not, is there any way we can ensure that older, incompatible cells won't be assigned work they can't do?
 93 | 
 94 | 
 95 | ### `C4`: BBS and Brain/Bridge/CF at `V2`, Cells at `V1`
 96 | 
 97 | During the upgrade of the CF deployment to `V2`, the gorouter and HAproxy will likely be offline for some amount of time, so the Dora instance is not expected to be routable externally then.
 98 | 
 99 | With all the active Cells at `V1` and the rest of the system at `V2`, we verify that the system functions correctly:
100 | 
101 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1.
102 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it.
103 | 
104 | 
105 | ### `C5`: BBS, Cells, and Brain/Bridge/CF all at `V2`
106 | 
107 | With the entire system at `V2`, we verify that the system functions correctly:
108 | 
109 | - The existing Dora scales up to 2 instances that are all routable, then back down to 1.
110 | - Stage and run a new instance of Dora, verify that it is routable, scale it up and down, then delete it.
111 | 
112 | 
113 | ## Minimal Deployment on BOSH-Lite
114 | 
115 | In order to run this test suite effectively in CI, we propose that the target environment be an instance of BOSH-Lite running on AWS, which will be easy to provision on demand and fast to deploy `V1` and then to migrate to `V2`. The deployments above can then consist of the following groups of VMs:
116 | 
117 | - `D1`: 3-node database cluster (to ensure minimal downtime during the update to `V2`)
118 | - `D2`: 1 node each of the Brain, CC-Bridge, Access, and Route-Emitter
119 | - `D3`: 1 Cell
120 | - `D4`: 1 Cell
121 | - CF: Minimal CF, with no need for the DEAs or HM9000
122 | 
123 | We may even be able to capture an image of the BOSH-Lite instance after deploying `V1` and pushing the initial set of apps, and begin the test suite by restarting a copy of that instance and verifying that the system is still functional before proceeding (although with BOSH-Lite this may require that the instance always have the same IP).
124 | 
125 | 
126 | ## Selection of Versions for Testing
127 | 
128 | While we would ideally conduct these tests for every strict upgradable version pair `(V1, V2)`, as we generate new minor and patch versions of the release this will be prohibitively expensive to implement, and would likely have strongly diminishing returns. Instead, we intend to test upgrades where `V1` is the earliest supported version within its major version, and `V2` is the HEAD of the develop branch (after passing smoke tests or perhaps CATs/DATs on ketchup). In the near-term, these `V1` versions will be:
129 | 
130 | - Version 0: earliest supported stable Diego (0.1434.0)
131 | - Version 1: 1.0.0
132 | 
133 | When we release Version 2 of Diego, we will cease testing upgrades from 0.1434.0, but continue to test upgrades from 1.0.0 to the latest develop branch.
134 | 
135 | We also recognize that the addition of significant new functionality in a minor version may also warrant its promotion to a `V1` against which we test upgrades as part of CI.
136 | 


--------------------------------------------------------------------------------
/proposals/rolling-out-diego.md:
--------------------------------------------------------------------------------
 1 | # Rolling out Diego
 2 | 
 3 | The vision here is simple.  We roll out Diego in a given environment incrementally.  We'll have a couple of versions where Diego is deployed *alongside* the DEAs.  Operators will be told to encourage their developers to try the new backend and send us feedback, etc.  Operators will be able to choose whether opting applications into Diego is under *their* control or under their user's control.
 4 | 
 5 | After some period of having both backends deployed, we deploy CF so that Diego is the only show in town.  This deployment could be accompanied by a migration that forces all applications onto Diego, though ideally (to avoid downtime) most applications will be running on Diego by then.
 6 | 
 7 | ## How do applications opt into Diego?
 8 | 
 9 | Our use of environment variables is a source of complexity and is difficult to audit.  
10 | 
11 | We propose introducing a new application column to the DB called, simply, `diego`.  This would be a boolean settable via the CC API (though see below).  The boolean would be around for the transitional CF releases and will then be removed with Diego is the only backend.
12 | 
13 | Since Diego can run DEA-staged droplets, modifying the `diego` boolean can automatically move an application from the DEA stack to the Diego stack.  I would propose that modifying the `diego` boolean always sends an event to the chosen backend to run the application and relies on eventual consistency to wind down the applications on the former backend.  To be clear:
14 | 
15 | `diego => true`: emits a start message to Diego and allows the DEAs to clean up naturally via HM9000.
16 | `diego => false`: emits a start message to the DEAs and allows Diego to clean up naturallly via NSYNC.
17 | 
18 | This minimizes downtime and complexity.
19 | 
20 | We also propose removing the configuration flags from CC.  Diego support is turned on in CC with no way to turn it off.
21 | 
22 | ### Who sets the `diego` boolean?
23 | 
24 | Some operators may want to give their users control over this.  Others may want to do it themselves.  Some may want to have a graduated phase-in plan where they give developers control to begin with but then grab control themselves later on.
25 | 
26 | To accomplish this I propose we add a bosh-configurable option called `users_can_select_backend`.  If `true` then all users can modify the `diego` boolean.  If `false` then *only* operators can modify the `diego` boolean.  Note that flip-flopping this option does not modify where the applications are running -- only *who* can modify this state.
27 | 
28 | ### How do operators audit the `diego` boolean?
29 | 
30 | Operators/admins will use the CC API (we may need to add to it?) to list all applications with `diego=true` or `diego=false`
31 | 
32 | OSS tooling can be built on top of this to slowly bleed load from the DEAs onto Diego.  For example, public clouds might use this tooling to move applications from the DEAs to Diego in a controlled manner.  We could investigate downtime-less approaches to doing this.  Such a tool would not be necessary for the initial deployment of Diego alongside the DEAs.
33 | 
34 | ### How do we get visibility into the chosen backend?
35 | 
36 | The CLI could be modified such that `cf apps` and `cf app` list the selected backend.  Not sure that this is MVP, however as it would be transitory (when Diego is the only game in town we'd rip this feature out).
37 | 


--------------------------------------------------------------------------------
/proposals/routing.md:
--------------------------------------------------------------------------------
  1 | # Improved Routing API Proposal
  2 | 
  3 | Diego's story around routing, while sufficient for the CF use-case, has much room for improvement.
  4 | 
  5 | To set the stage, let's clarify some roles and responsibilities.
  6 | 
  7 | ### Roles and Responsibilities
  8 | 
  9 | #### Diego
 10 | 
 11 | In the context of routing Diego (i.e. Cell+BBS+Brain) is solely responsible for:
 12 | 
 13 | 1. Opening ports on containers.
 14 | 2. Making routing-related information available to consumers.
 15 | 
 16 | Here "routing information" refers specifically to:
 17 | 
 18 | - The `DesiredLRP` and its routing related fields:
 19 |     + `Routes`: currently, an array of strings
 20 |     + `Ports`: the set of container-side ports to open up on the container.
 21 | - The `ActualLRP` and its routing related fields:
 22 |     + `Address`: the IP address of the host machine running the container.
 23 |     + `Ports`: an array of port mappings of the form `{ "container_port": XXX, "host_port": YYY }`.  There will an entry in the `ActualLRP` `Ports` array for every corresponding entry in the `DesiredLRP` `Ports` array.  To connect to a particular container-side port, you must first lookup the corresponding host-side port and then construct an address of the form `actualLRP.Address:actualLRP.Ports[i].HostPort`
 24 | 
 25 | Consumers of Diego are free to specify arbitrary `Ports` on the `DesiredLRP`.  Diego will dutifully open up the corresponding ports on the container and populate the `ActualLRP` `Ports` array appropriately.  Diego does not allow modifying the `DesiredLRP`'s `Ports` array after-the-fact.
 26 | 
 27 | Consumers of Diego are *also* free to specify an arbitrary array of `Routes` strings on the `DesiredLRP`.  Diego doesn't actually care about these at all and does nothing with them.  It simply holds onto the array on the consumer's behalf.  Unlike `Ports`, the `Routes` array can be modified dynamically.
 28 | 
 29 | It is up to Diego's consumer to fetch the set of `DesiredLRP`s and `ActualLRP`s and construct a routing table.  This is done by:
 30 | 
 31 | 1. Periodically fetching the entire set of `ActualLRP`s and `DesiredLRP`s via the Receptor API.
 32 | 2. Attaching to an event stream emanating from the Receptor API.  This event stream emits changes to `ActualLRP`s and `DesiredLRP`s soon after they occur.
 33 | 
 34 | Both are necessary to ensure efficiency (real-time events) and robustness (periodic polling to catch for missed events).
 35 | 
 36 | #### Router & Route-Emitter
 37 | 
 38 | The Route-Emitter communicates with Diego via the Receptor API to construct a routing table.  It then emits this routing table to the router via NATS (there are plans to eventually have the routers communicate directly with the Receptor and cut out the route-emitter).
 39 | 
 40 | For a given `ProcessGuid` the route-emitter connects all the FQDNs provided in the `DesiredLRP`s `Routes` array to the **first** port in the `Ports` array on the `ActualLRP`s.
 41 | 
 42 | So, concretely, if the consumer requests a `DesiredLRP` with:
 43 | 
 44 | ```
 45 | {
 46 |     ...
 47 |     "routes": ["foo.com", "bar.com"],
 48 |     "ports": [4000, 5000],
 49 |     ...
 50 | }
 51 | 
 52 | ```
 53 | 
 54 | and Diego starts an `ActualLRP` with:
 55 | ```
 56 | {
 57 |     ...
 58 |     "address": "10.10.1.2",
 59 |     "ports": [
 60 |         {"container_port":4000, "host_port":59001},
 61 |         {"container_port":5000, "host_port":59002},
 62 |     ],
 63 |     ...
 64 | }
 65 | ```
 66 | 
 67 | Given this, requests to the router for `foo.com` and `bar.com` will both proxy through to `10.10.1.2:59001`.  There is currently no way to connect to port 5000.
 68 | 
 69 | > Note: the description in this section applies *after* the stories for adding an [event stream](https://www.pivotaltracker.com/story/show/84607000) and updating the [route-emitter](https://www.pivotaltracker.com/story/show/84607028) are complete.
 70 | 
 71 | 
 72 | ### Routing 2.0
 73 | 
 74 | #### Diego Changes
 75 | 
 76 | [Tracker Story](https://www.pivotaltracker.com/story/show/86337946)
 77 | 
 78 | The fact that `DesiredLRP`'s `Routes` is an array of strings is an accident of history.  What Diego supports today is the minimum necessary to get the CF usecase to work.  In truth, Diego is routing agnostic and we can leave it up to the consumer to encode routing information as they see fit.  This opens up several possibilities, including implementing custom service discovery solutions on top of Diego.
 79 | 
 80 | The first part of this proposal, then, is to make Diego less restrictive around what can go into `DesiredLRP.Routes`.  I propose turning `DesiredLRP.Routes` into (in Go parlance) a `map[string]string`.  The key in the map would correspond to a routes provider and the string would correspond to arbitrary metadata associated with said provider.  This allows multiple service-discovery/routing providers to live alongside one-another in Diego.  It also frees up the consumer to define an arbitrary schema to suit their needs
 81 | 
 82 | Here's an example that supports the CF Router and a DNS service (e.g. skydns):
 83 | 
 84 | ```
 85 | {
 86 |     ...
 87 |     "routes": {
 88 |         "cf-router": "[{\"port\": 4000, \"routes\": [\"foo.com\", \"bar.com\"]}, {\"port\": 5000, \"routes\":n[\"admin.foo.com\"]}]",
 89 |         "skydns" "[{\"port\":5000, \"host\":\"admin-api.service.skydns.local\", \"priority\":20}]"
 90 |     },
 91 |     "ports": [4000, 5000],
 92 |     ...
 93 | }
 94 | ```
 95 | 
 96 | The important thing here is that Diego does not care about what goes into  `DesiredLRP.Routes` at all.  This frees the user to cook up whatever schema they deem fit.
 97 | 
 98 | Diego should be defensive and apply a limit to the size of the `routes` entry.  I propose a(n arbitrary) 4K limit for now.
 99 | 
100 | #### Route-Emitter Changes
101 | 
102 | ##### Supporting multiple ports
103 | 
104 | [Tracker Story](https://www.pivotaltracker.com/story/show/86338588)
105 | 
106 | The most basic modification to the route-emitter that I would propose would be to support routing to multiple ports on the same container.
107 | 
108 | For this the schema for `cf-router`'s entry in `DesiredLRP.Routes` would look like (from above):
109 | 
110 | ```
111 | [
112 |     {
113 |         "port": 4000,
114 |         "routes": ["foo.com", "bar.com"]
115 |     },
116 |     {
117 |         "port": 5000,
118 |         "routes": ["admin.foo.com"]
119 |     }
120 | ]
121 | ```
122 | 
123 | Given the sample `ActualLRP` given above, requests to `admin.foo.com` will now be routed to `10.10.1.2:59002`
124 | 
125 | ##### Supporting Index-Specific-Routing
126 | 
127 | [Tracker Story](https://www.pivotaltracker.com/story/show/86338996)
128 | 
129 | Another (straightforward) addition to route-emitter would be support for routing to a *particular* container index.  This might be useful (for example) to target admin/metric panels for a particular instance or (alternatively) to deterministically refer to specific instances of a database (e.g. giving individual member addresses to an etcd cluster).  The schema for `cf-router` might look like:
130 | 
131 | ```
132 | [
133 |     {
134 |         "port": 4000,
135 |         "routes": ["foo.com", "bar.com"]
136 |     },
137 |     {
138 |         "port": 5000,
139 |         "routes": ["admin.foo.com"],
140 |         "route_to_instances": true
141 |     }
142 | ]
143 | ```
144 | 
145 | Now requests to `admin.foo.com` will be balanced across all containers, but requests to `0.admin.foo.com` will *only* go to the container at index 0.
146 | 
147 | ##### Future Extensions
148 | 
149 | The aforementioned additions can be trivially implementd with the existing gorouter.  Potential future features can also be expressed easily with this flexible schema.  Consider two features:
150 | 
151 | 1. The ability to require `ssl` for a given route.
152 | 2. The ability to route `tcp`/`udp` traffic.
153 | 
154 | These could be expressed via (for example)
155 | 
156 | ```
157 | [
158 |     {
159 |         "port": 4000,
160 |         "protocol": "tcp",
161 |         "incoming_port": 62312
162 |     },
163 |     {
164 |         "port": 4000,
165 |         "protocol": "udp",
166 |         "incoming_port": 43218
167 |     },
168 |     {
169 |         "port": 5000,
170 |         "routes": ["admin.foo.com"],
171 |         "ssl": true
172 |     }
173 | ]
174 | ```
175 | 
176 | > As an aside: `incoming_port` here reflects a potential implementation of `tcp/udp` routing whereby a user first checks out a load-balancer port.  This checked-out port can then be associated with a particular application - the gorouter could use the incoming port, and the information expresed in the schema above, to figure out which application to route to.
177 | 


--------------------------------------------------------------------------------
/proposals/ssh-one-time-auth-code.md:
--------------------------------------------------------------------------------
 1 | # SSH Access via One-Time Authorization Codes
 2 | 
 3 | The use of a user's access token as the password for SSH access to Diego containers is problematic for two reasons:
 4 | 
 5 | 1. A coincidental one: These access tokens are frequently longer than 1KB in length, but the password buffer on the OpenSSH client is hard-coded to 1KB. The token is then truncated (or not even sent), and authentication fails.
 6 | 2. A security one: These access tokens are issued for the 'cf' UAA client, but are being used by a service (the proxy) that is not that client.
 7 | 
 8 | We therefore propose an entirely different authentication flow for CF app instances, similar to that presented in [this proposal](https://docs.google.com/document/d/1DTLNW-0twYnIHs9z7OE0v5BDNQhgNIO9kZ0fyqWiMo0/edit):
 9 | 
10 | - The Diego SSH Proxy is registered as a UAA client with a specific name (say, 'ssh-proxy').
11 | - The end user obtains from UAA a one-time authorization code issued for that SSH Proxy client and sends that as the SSH password.
12 | - The Diego SSH Proxy then sends a request to UAA as that client to exchange the one-time code for an access token, and uses that token to authorize the user's access to the CF app instance.
13 | 
14 | This flow should be the only supported one for the CF authenticator, and we should remove the access-token-as-SSH-password option that is currently implemented in the SSH proxy. Fortunately, the UAA can now issue such tokens as of [story #102931196](https://www.pivotaltracker.com/story/show/102931196), but this work must be included in the next UAA release, which then must be updated in cf-release.
15 | 
16 | Implementing this flow and removing the old one requires the following stories:
17 | 
18 | ---
19 | 
20 | The Diego SSH Proxy can receive an authorization code as the SSH password to access a CF app instance
21 | 
22 | 
23 | This requires the proxy to be configured with a client name and secret to present to UAA along with the token. Spiff-based generation of the Diego BOSH manifest should be updated to retrieve this secret from the UAA client configuration in the CF manifest.
24 | 
25 | 
26 | Acceptance:
27 | 
28 | - I can follow documentation in the diego-ssh README to obtain a one-time authorization code from UAA for the SSH proxy's client.
29 | - I can then use the code as the password to establish an SSH connection to a CF app instance for which I am authorized access.
30 | - The existing access-token-as-password behavior should continue to work for now (until we update the SSH plugin to use the new flow).
31 | 
32 | L: ssh
33 | 
34 | 
35 | ---
36 | 
37 | CC presents the Diego SSH Proxy client name in the /v2/info endpoint
38 | 
39 | - The endpoint response should present this name in its `app_ssh_oauth_client` field.
40 | - This name should be BOSH configurable in the CC configuration.
41 | 
42 | L: ssh
43 | 
44 | 
45 | ---
46 | 
47 | The SSH plugin establishes SSH connections to CF app instances by sending an authorization code as the SSH password
48 | 
49 | - For each connection attempt, the plugin retrieves a new one-time authorization code for the SSH proxy client and uses it as the password.
50 | - The plugin should not have the SSH proxy UAA client name hardcoded, but should instead read it from the CC `/v2/info` endpoint.
51 | - Release version 0.2.0 of the plugin and submit to the [Community Plugin Repo](https://github.com/cloudfoundry-incubator/cli-plugin-repo).
52 | 
53 | L: ssh
54 | 
55 | 
56 | ---
57 | 
58 | The SSH plugin provides a command to print a one-time authorization code issued for the SSH proxy client
59 | 
60 | Acceptance:
61 | 
62 | - I can run `cf get-ssh-code` and use the output as my password for connection with the OpenSSH `ssh` and `scp` clients.
63 | - This command should also look up client info from CC's `/v2/info` endpoint.
64 | 
65 | L: ssh
66 | 
67 | 
68 | ---
69 | 
70 | The Diego SSH Proxy no longer accepts a user's access token as an SSH password for CF app instances
71 | 
72 | Acceptance:
73 | 
74 | - Version 0.1.x of the SSH plugin no longer allows connections.
75 | 
76 | 
77 | L: ssh
78 | 
79 | ---
80 | 
81 | 


--------------------------------------------------------------------------------
/proposals/tuning-health-checks-stories.prolific.md:
--------------------------------------------------------------------------------
  1 | Reproduce slow app startup times during evacuation
  2 | 
  3 | We've observed evacuation taking longer than expected on systems with sufficiently high app-instance densities (say, 50 app-instances per cell or higher). Here's the behavior we've seen from the logs:
  4 | 
  5 | - Previously evacuated cell comes back up while next cell is evacuating
  6 | - Auction places lots of containers simultaneously on the new cell
  7 | - Container-creation time increases to about 30 seconds
  8 | - The cell rep downloads droplets for the new app instances
  9 | - The cell rep starts the apps and health-checks them aggressively
 10 | - The health-check processes take 3-5 seconds to complete instead of the usual 50ms
 11 | - The app processes also take longer to come up, as the system is under load, and some of them fail to come up within the default 60s start timeout and are crashed
 12 | - Container deletion time also takes a long time (up to 30s)
 13 | - Eventually the cell stabilizes, but this can take as long as 10 to 15 minutes
 14 | 
 15 | Let's try to reproduce this phenomenon in an isolated environment so we can explore ways to mitigate it. Here's a starting suggestion about how to make this reproduction environment realistic:
 16 | 
 17 | - Deploy a Diego cluster to AWS with 4 cells on r3.xlarge VMs. This can likely be a one-AZ deployment, but then let's arrange the deployment manifest so that the cell job updates serially (not in parallel with the CC-Bridge/Access/Route-Emitter VMs, as is the default today).
 18 | - Deploy separate CF apps to fill up the cluster to a density of 50 instances per cell. The apps should have only 1 instance each, so that new cells incur a realistic number of droplet downloads (which are cached downloads). A large enough proportion of these apps should have large droplets that take a long time to start up, such as Java apps (for example, spring-music, or some Spring Boot examples).
 19 | - Deploy a change that causes an update to the cells and therefore evacuation. Measure how long it take cells to evacuate, and how long it takes individual instances on the new cells to start up and become healthy. 
 20 | 
 21 | We could also potentially explore this behavior on a 1-cell deployment by starting a lot of already-staged apps simultaneously, but the time required for SSH-key generation might interfere too much.
 22 | 
 23 | Onsi's [analyzer](https://github.com/onsi/analyzer) may be helpful for analyzing the aggregate timing behavior of these starts, but we'll likely need some assistance to get started with it. [Cicerone](https://github.com/onsi/cicerone) may also help with timeline analysis on a per-instance basis.
 24 | 
 25 | 
 26 | Acceptance: We can demonstrate reproduction of the described slow evacuation phenomenon.
 27 | 
 28 | 
 29 | L: charter, perf, net-check-action, start-placement
 30 | 
 31 | ---
 32 | 
 33 | Evaluate the effect of reduced health-check frequency on evacuation time
 34 | 
 35 | Once we can reproduce the slow evacuation times in an isolated environment, let's evaluate how much of an effect increasing the time between health checks for unhealthy instances has. It's currently 500ms, so let's try 1s, 2s, and 5s instead with varying instance densities (50/cell, 100/cell, 150/cell?). Do any of those changes reduce the instance start-up times significantly?
 36 | 
 37 | This timing parameter can already be BOSH-configured on the rep, so this evaluation shouldn't require any code changes.
 38 | 
 39 | 
 40 | Acceptance: report on the differences in evacuation recovery time with different unhealthy check intervals and different instance densities, as well as differences in other relevant system metrics (such as system load, disk I/O, and CPU utilization).
 41 | 
 42 | 
 43 | L: charter, perf, net-check-action
 44 | 
 45 | ---
 46 | 
 47 | Evaluate the effect of a spiked-out native net-check action on evacuation time
 48 | 
 49 | Once we can reproduce the slow evacuation times in an isolated environment, let's evaluate what effect a native net-check action has on improving them. For this spike, we need implement only the `Port` field on the proposed `NetCheckAction`, as we're not concerned about backwards-compatibility, HTTP checks, or timeout configurability. Nsync should then also use it in its LRP generation.
 50 | 
 51 | 
 52 | Acceptance: report on the differences in evacuation recovery time and system metrics when using this new action compared to the old action.
 53 | 
 54 | L: charter, perf, net-check-action
 55 | 
 56 | 
 57 | ---
 58 | 
 59 | As a BBS client, I want to be able to specify a NetCheckAction on my LRP Monitor action and have the executor run a net-check directly instead of invoking a container-side process
 60 | 
 61 | As described in the 'Tuning Health Checks proposal', the BBS should support a new `NetCheckAction` action with the following fields:
 62 | 
 63 | - `Port`
 64 | - `Endpoint`
 65 | - `TimeoutInMilliseconds`
 66 | - `FallbackAction`
 67 | 
 68 | See the proposal for the details about types, optionality, and intent.
 69 | 
 70 | The BBS should also support new endpoints for tasks and LRP run-infos that are capable of returning this new action. If an client requests an LRP or Task from an old endpoint, the BBS should collapse all `NetCheckAction`s to their fallback actions.
 71 | 
 72 | The executor should be updated to understand this action and to perform a port-connectivity or HTTP endpoint check on the container with its own in-process client, instead of invoking a separate process inside the container to perform this check. We do not want this action to log on every attempt.
 73 | 
 74 | 
 75 | Acceptance:
 76 | 
 77 | - I can modify vizzini to use the `NetCheckAction` and it still passes against both old and new cells as long as the BBS in the deployment is updated to understand it.
 78 | 
 79 | 
 80 | L: net-check-action
 81 | 
 82 | 
 83 | ---
 84 | 
 85 | Nsync should use the native NetCheckAction in the Monitor action for LRPs with a port-based health-check
 86 | 
 87 | 
 88 | Acceptance:
 89 | 
 90 | - I can use veritas to verify that newly desired CF apps are created using the `NetCheckAction` with an appropriate fallback RunAction for compatibility with old cells.
 91 | 
 92 | L: net-check-action
 93 | 
 94 | ---
 95 | 
 96 | The executor's Monitor step should log on health transitions
 97 | 
 98 | The monitor step should log to the instance's log stream on the following events:
 99 | 
100 | - Starting the monitor step: "Starting health monitoring of container"
101 | - Failing the monitor step because the container does not become healthy within the N-second start timeout: "Container failed to become healthy within N seconds"
102 | - Verifying the instance is healthy: "Container became healthy"
103 | - Detecting the instance is now unhealthy after having previously been healthy: "Container became unhealthy"
104 | 
105 | The log source for these log lines should be the top-level LogSource on the executor container.
106 | 
107 | 
108 | Acceptance:
109 | 
110 | - I can observe these log lines in the log stream for CF app instances when they:
111 | 	- become healthy within the timeout
112 | 	- become healthy and then unhealthy
113 | 	- never become healthy within the start timeout
114 | 
115 | 
116 | L: logging, net-check-action
117 | 
118 | 


--------------------------------------------------------------------------------
/proposals/tuning-health-checks.md:
--------------------------------------------------------------------------------
  1 | # Tuning Health Checks
  2 | 
  3 | From running larger and more public deployments of Diego, the Diego team has observed a few issues with the current approach to health-checking:
  4 | 
  5 | - When starting many containers simultaneously on the same cell, invoking the port-based health-check every half-second puts the system under undue stress, and can cause containers to fail to be verified as healthy before their start timeout. This situation occurs naturally from cell evacuation during a rolling deploy of a Diego cluster with a sufficiently high density of LRP instances.
  6 | - App developers have reported that the frequent health log lines are more noisy than helpful, and can prevent them from seeing application logs. This is especially true during the aggressive initial health-checking before instance health is verified, when the volume of logs can overwhelm the loggregator system's buffer of recent log messages.
  7 | - Garden-Windows currently has some special one-off argument-processing logic to configure the port-based health-check correctly, which the Greenhouse team has expressed interest in removing.
  8 | 
  9 | In light of these points, we propose some changes to how Diego handles both the timing and the method of health-checking instances:
 10 | 
 11 | - Cell performs native net-check action without logging
 12 | - Cell reduces frequency of initial health-check
 13 | - Monitor action logs only health transitions
 14 | - BBS client can configure health-check timing parameters on the DesiredLRP
 15 | 
 16 | We will explain these in more detail below.
 17 | 
 18 | 
 19 | ## Native Net-Check Actions
 20 | 
 21 | We currently perform network-based health checks of services run in containers by invoking a separate process inside the container to connect to the service's port over TCP. While appealing in its simplicity, especially in the context of the plugin model we use for the platform-  and app-lifecycle-specific details of performing staging tasks and running app instances, this network task can be done far more efficiently by the executor component itself. This is particularly true with the strong support for these network operations in the Golang standard library.
 22 | 
 23 | The `healthcheck` executable currently operates in two different modes:
 24 | 
 25 | - checking establishment of a TCP connection to a particular port, on the first detected non-loopback IP address,
 26 | - making an HTTP GET request to a particular endpoint on this address and port and checking that the response is successful and has a `200 OK` status code.
 27 | 
 28 | A configurable timeout applies to both modes, defaulting to 1s.
 29 | 
 30 | Consequently, we propose introducing a new executor action, `NetCheckAction`, to take on this functionality. It will support the following fields:
 31 | 
 32 | | Field      | Required? | Type       | Description |
 33 | |------------|-----------|------------|-------------|
 34 | | `Port`     | required  | `uint32`   | Container-side port on which to check connectivity. |
 35 | | `Endpoint` | optional  | `string`   | If present, endpoint against which to check for a successful HTTP response. |
 36 | | `TimeoutInMilliseconds`     | optional  | `uint32`   | If present, different timeout in milliseconds to apply to the connection attempt or request. If absent, the executor instead uses a default duration for the timeout (1 second, perhaps configurable on the executor). |
 37 | 
 38 | 
 39 | ### Backwards Compatibility
 40 | 
 41 | With the introduction of this new action, we also require cells from previous stable releases to be able to run a sufficiently equivalent action in its place. Unfortunately, the only actions at our disposal are the previous `Download`, `Run`, and `Upload` actions, as well as the various combining and wrapping actions. Two possible options for backwards compatibility are as follows:
 42 | 
 43 | - Backfill an action provided by the BBS that cannot fail, such as a `TryAction` wrapping a `RunAction` that runs `echo "backfill: no-op action"`.
 44 | - Backfill an arbitrary action provided by the client, which can be customized to provide equivalent functionality for the desired NetCheckAction behavior in terms of, say, a RunAction.
 45 | 
 46 | We prefer the second option, as it ensures that older cells can perform a real health-check on their app instances correctly. This option results in the following additional field on the NetCheckAction:
 47 | 
 48 | | Field      | Required? | Type       | Description |
 49 | |------------|-----------|------------|-------------|
 50 | | `FallbackAction`     | required  | `Action`   | Fallback action for older BBS clients that do not understand this new action type. |
 51 | 
 52 | 
 53 | In any case, the presence of this new action will require either new BBS endpoints for complete DesiredLRP and Task records or a new, optional parameter to be sent in requests to existing endpoints to indicate that the client is capable of understanding this new action. The old endpoints or responses should instead substitute this fallback action for each `NetCheckAction` in the action tree. We will of course have to make sure that the BBS correctly replaces all `NetCheckAction` instances in the tree with their fallbacks, in case a client perversely uses a `NetCheckAction` within the fallback action for another `NetCheckAction`.
 54 | 
 55 | 
 56 | #### Deprecation Plan for `FallbackAction` field
 57 | 
 58 | Obviously, this `Fallback` field is a wart on this action that we would like to freeze off as soon as possible. We expect the following transition plan to deprecate and eliminate this field:
 59 | 
 60 | - Diego `0.N`:
 61 | 	- Release in which Diego cells understand `NetCheckAction` natively.
 62 | 	- New BBS endpoints are introduced to handle the new action, and older BBS endpoints are deprecated.
 63 | - Diego `0.X` (`X` &geq;&nbsp; `N`) and `1.X`:
 64 | 	- BBS validation requires the `Fallback` field, as it must be present for deployed cells from releases prior to `0.N`. Later cells ignore it and perform the native net-check.
 65 | - Diego `2.0` and later:
 66 | 	- BBS validation does not require the `Fallback` field, and the BBS now ignores it, as no support is guaranteed for cells from a Diego `0.X` release.
 67 | 	- Previously deprecated BBS endpoints are removed from the API.
 68 | 	- Clients whose upgrades are coordinated with the BBS upgrades (say, the CC-Bridge) may safely drop the `Fallback` field, as they expect the BBS API to be upgraded to version 2 before they are, given Diego deployment order constraints.
 69 | 
 70 | 
 71 | ### Justification of Effort
 72 | 
 73 | Before we make these changes, we should validate that they mitigate the problematic behavior we have observed during evacuation at scale. We should first characterize this evacuation behavior generally via logs and metrics from production systems where it is observed, then independently reproduce it in an isolated environment. Once we have this control baseline, we can spike out the `NetCheckAction` to validate that it has a significant mitigation benefit before we proceed with the work to implement it in a backward-compatible, test-driven way. 
 74 | 
 75 | 
 76 | ### Implementation Notes
 77 | 
 78 | We expect the implementation of this action to be straightforward. The executor already has access to the external IP and the port mappings of the container, which is all that it requires to connect to a given container-side port from outside the container. We should certainly validate that this network information is correct on Windows cells as well, but we expect it to be, as app instances on Windows cells publish this information to the routing tiers to receive external requests.
 79 | 
 80 | The executor step corresponding to this action will also have to be appropriately cancelable.
 81 | 
 82 | 
 83 | ## Reduced Frequency of Initial Health-Check
 84 | 
 85 | We have also received feedback that the 500ms duration between health checks while the instance is initially unhealthy may be too aggressive. We intend to increase this duration to a longer 2 seconds by default, both in the executor default configuration and in the property spec for the rep job in the Diego BOSH release.
 86 | 
 87 | 
 88 | ## Logging Only Health Transitions
 89 | 
 90 | In the absence of any logging from the `NetCheckAction` itself, we intend for the executor's Monitor step to provide a minimal amount of logging, and only at the following health transitions:
 91 | 
 92 | - when it starts monitoring the instance,
 93 | - when the monitor step times out before the instance is considered healthy,
 94 | - when the instance becomes healthy,
 95 | - when a healthy instance becomes unhealthy by failing the monitor action.
 96 | 
 97 | We expect this should provide an adequate level of visibility into the health status of the instance without overwhelming the log stream emanating from the instance.
 98 | 
 99 | One issue that may arise is with the LogSource to be used for this Monitor step logging. We would prefer that the current `HEALTH` log source be used for these logs as well, but it is currently set only on the `RunAction` provided as the LRP's Monitor action. There is a container-wide LogSource field, but it sets the LogSource for any logs emitted by cell or container actions unless explicitly overriden on an action, so it is unsuited for customizing the LogSource only on logs coming from the Monitor step. We may wish to introduce an optional LogSource identifier on the LRP to be used for this Monitor action logging, or with the context included in the Monitor step logs, it may suffice to use the current container-wide LogSource (currently set to `CELL` for CF app instances).
100 | 
101 | 
102 | ## Configurable Health-Check Timing
103 | 
104 | We also realize that the default parameters of the Monitor step may not apply universally to all long-running workloads, especially if CF app developers are allowed to customize the health-check action to, say, run a custom script instead of the default port check. We will therefore expose some of these monitoring timing parameters on the DesiredLRP itself, such as:
105 | 
106 | - the duration to wait before initiating health-checking (currently `0`),
107 | - the duration to wait between health checks before the instance is healthy (currently `500ms`, proposed to increase to `2s` by default),
108 | - the duration to wait between health checks after the instance is healthy (currently `30s`). If `0`, the health check is never performed after the instance is considered healthy.
109 | 
110 | Each one of these new DesiredLRP properties would be optional, and if not specified, would default to the configuration of the rep (or its internal executor). Corresponding changes would also be required on the executor's Container model.
111 | 
112 | As additional motivation, since some of these parameters are configurable on the rep, the Diego BBS client may wish to specify these parameters explicitly on the DesiredLRP it submits, rather than relying on configuration that may vary with the Diego deployment.
113 | 
114 | 


--------------------------------------------------------------------------------
/proposals/versioning.md:
--------------------------------------------------------------------------------
  1 | # Versioning
  2 | 
  3 | ## BBS Semantics
  4 | 
  5 | There are likely to be a class of updates that require a modification to the
  6 | structure or semantics of the BBS. An example might be the movement of egress
  7 | rules from the desired LRP to a separate part of the BBS. This change would
  8 | effectively create a new set of keys for egress rules and would change the
  9 | semantics of a desired LRP with the requirement to reference the egress rules.
 10 | 
 11 | Changes like this would required a more involved upgrade process such as:
 12 | 
 13 | 1. Code is deployed with a BBS version that understands old and new semantics
 14 | 2. While running, code operates with the old BBS version semantics, can read
 15 | 	 new, writes old
 16 | 3. An operator (or some other entity) explicitly bumps the BBS version via an
 17 | 	 errand
 18 | 4. The BBS code in each component eventually discovers the new operating mode
 19 | 	 and generates new data structures and uses new semantics
 20 | 5. Migration phase to move to new levels - errand does the migration after
 21 | 	 setting the new active version
 22 | 
 23 | The BBS version needs to be stored in a consistent, persistent place outside of
 24 | the BBS. We can't put the version directly into the BBS since a loss of our
 25 | persistent store would leave us unable to determine the appropriate semantics
 26 | to use when the diego jobs restart.
 27 | 
 28 | Options to where to store this version are consul or the manifest. Problem with
 29 | the manifest being the necessity of a manual manifest-only deploy.
 30 | 
 31 | ## Schema Models
 32 | 
 33 | Each model has its own version that can grow independently of the Diego release
 34 | version. A bump in a model version may (and in most cases must) trigger a major
 35 | bump on the Diego release version.
 36 | 
 37 | A new version of a model requires creating a new structure (code) in the BBS
 38 | with the version number in its name. That structure, when marshalled should
 39 | write a `version` field. Unmarshalling of any JSON payload should read the
 40 | version and create the appropirate object. Version 1 is implied if no version
 41 | is present on the payload.
 42 | 
 43 | Adding an optional field does not require bumping the version.
 44 | Removing an optional field does not require bumping the version.
 45 | Adding a model does not require bumping the version.
 46 | 
 47 | ## Internal APIs
 48 | 
 49 | APIs themselves are not explicitly versioned, we inherit the version from the
 50 | schema models being passed through the requests.
 51 | 
 52 | Internal servers should respond with the same version that was received.
 53 | 
 54 | We do not preclude the addition of new endpoints and messages; if the semantics
 55 | of an operation change significantly, a new endpoint is likely warranted and
 56 | the (updated) client will need a mechanism to detect the new capability. (Can
 57 | use bbs?)
 58 | 
 59 | ## Upgrades
 60 | 
 61 | Consider a current major version N. When you deploy version N+1 it has the
 62 | ability to read and write the new and old BBS data models. Prior to migration,
 63 | it will always write the old data model but it can understand the new model;
 64 | after migration, it can still understand the old model but it will only write
 65 | the new data model. Once the migration completes, code running version N will
 66 | no longer function correctly.
 67 | 
 68 | This results in two simple rules for operators:
 69 | 
 70 | 1. You **SHOULD** run the migration errand *after* deploying N + 1.
 71 | 2. You **MUST** run the migration errand *before* deploying N + 2.
 72 | 
 73 | 
 74 | An alternative BBS Schema Migration Approach
 75 | --------------------------------------------
 76 | 
 77 | ![diagram](diego_versioning.svg)
 78 | 
 79 | # Steps
 80 | 
 81 | 1. Bosh deploy Diego Release version N + 1.
 82 | 2. Run migration errand
 83 | 
 84 | # Changes to Jobs
 85 | 
 86 | Jobs need a way to detect if they're running in a mode where they dont' have to
 87 | worry about backwards compatibility. We could model that as a flag on the
 88 | command line or we could find a way to do it in the BBS. Any of those
 89 | mechanisms could work...
 90 | 
 91 | With that in place:
 92 | 
 93 | 1. Version N + 1 Jobs read from version N BBS first, then read from version N +
 94 | 	 1 if that operation fails
 95 | 2. Version N + 1 Jobs to write to version N and N + 1 BBS 
 96 | 3. Version N + 1 jobs to read/write from/to version N + 1 BBS when the
 97 | 	 migration flag is turned off.  Migration flag is turned off when version N
 98 | 	 BBS is not available.
 99 | 
100 | 
101 | # Migration Errand
102 | 
103 | 1. Build version N + 1 BBS
104 | 2. Scan version N BBS nodes for discrepencies by comparing its node index to
105 |    N + 1 BBS nodes (version N nodes are always going to be greater than or equal
106 | 	 to version N + 1 nodes) 
107 | 3. Update out of sync node in N + 1 BBS
108 | 4. Build a nodes-in-sync lookup table for nodes with the same node index
109 | 	 between version N and N + 1.
110 | 5. Repeat 2 until all nodes are in sync
111 | 6. Delete version N BBS
112 | 


--------------------------------------------------------------------------------