├── .github
    └── workflows
    │   └── markdownlint.yml
├── .gitignore
├── .markdownlint.json
├── LICENSE
├── README.md
├── media
    └── kubernetes-dashboard-login.png
├── package-lock.json
├── package.json
├── template.md
└── text
    ├── 2020-09-08-easily-send-http-request-on-workflow-task.md
    ├── 2020-10-22-authn-and-authz-on-chaos-dashboard.md
    ├── 2020-11-09-chaos-mesh-workflow.md
    ├── 2021-03-11-unified-selector.md
    ├── 2021-07-23-extensible-chaosctl.md
    ├── 2021-08-11-implement-physical-machine-chaos.md
    ├── 2021-09-09-physical-machine-auth.md
    ├── 2021-09-27-refine-error-handling.md
    ├── 2021-10-08-monitoring-metrics-about-chaos-mesh.md
    ├── 2021-11-17-ui-monorepo.md
    ├── 2021-12-09-logging.md
    ├── 2021-12-29-openapi-to-typescript-api-client-and-forms.md
    ├── 2022-01-17-keep-a-changelog.md
    └── 2022-02-21-workflow-status-check.md


/.github/workflows/markdownlint.yml:
--------------------------------------------------------------------------------
 1 | # This workflow will do a clean install of node dependencies, build the source code and run tests across different versions of node
 2 | # For more information see: https://help.github.com/actions/language-and-framework-guides/using-nodejs-with-github-actions
 3 | 
 4 | name: markdownlint
 5 | 
 6 | on:
 7 |   push:
 8 |     branches: [main]
 9 |   pull_request:
10 |     branches: [main]
11 | 
12 | jobs:
13 |   build:
14 |     runs-on: ubuntu-latest
15 |     steps:
16 |       - uses: actions/checkout@v3
17 |       - name: Use Node.js 14.x
18 |         uses: actions/setup-node@v3
19 |         with:
20 |           node-version: 14.x
21 |           cache: "npm"
22 |       - run: npm ci
23 |       - run: npm run lint
24 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | node_modules/
2 | 


--------------------------------------------------------------------------------
/.markdownlint.json:
--------------------------------------------------------------------------------
1 | {
2 |   "default": true,
3 |   "MD013": {
4 |     "code_blocks": false
5 |   },
6 |   "MD033": false,
7 |   "MD034": false
8 | }
9 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Chaos Mesh RFCs
 2 | 
 3 | Many changes, including bug fixes and documentation improvements can be
 4 | implemented and reviewed via the normal GitHub pull request workflow.
 5 | 
 6 | Some changes though are "substantial", and we ask that these be put through a
 7 | bit of a design process and produce a consensus among the Chaos Mesh community.
 8 | 
 9 | The "RFC" (request for comments) process is intended to provide a consistent
10 | and controlled path for new features to enter the project, so that all
11 | stakeholders can be confident about the direction the project is evolving in.
12 | 
13 | ## How to submit an RFC
14 | 
15 | 1. Copy `template.md` into `text/YYYY-MM-DD-my-feature.md`.
16 | 2. Write the document and fill in the blanks.
17 | 3. Submit a pull request.
18 | 
19 | ## Timeline of an RFC
20 | 
21 | 1. An RFC is submitted as a PR.
22 | 2. Discussion takes place, and the text is revised in response.
23 | 3. The PR is merged or closed when at least two project maintainers reach
24 |    consensus.
25 | 
26 | ## Style of an RFC
27 | 
28 | We follow lint rules listed in
29 | [markdownlint](https://github.com/DavidAnson/markdownlint/blob/main/doc/Rules.md).
30 | 
31 | Run lints (you must have [Node.js](https://nodejs.org) installed):
32 | 
33 | ```bash
34 | # Install linters: npm install
35 | npm run lint
36 | ```
37 | 
38 | ## License
39 | 
40 | This content is licensed under Apache License, Version 2.0,
41 | ([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0)
42 | 
43 | ## Contributions
44 | 
45 | Unless you explicitly state otherwise, any contribution intentionally submitted
46 | for inclusion in the work by you, as defined in the Apache-2.0 license, shall
47 | be dual licensed as above, without any additional terms or conditions.
48 | 


--------------------------------------------------------------------------------
/media/kubernetes-dashboard-login.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chaos-mesh/rfcs/a00e8824b58ead04f39b08d705a481e61a1bf11d/media/kubernetes-dashboard-login.png


--------------------------------------------------------------------------------
/package-lock.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "name": "rfcs",
  3 |   "version": "0.0.0",
  4 |   "lockfileVersion": 1,
  5 |   "requires": true,
  6 |   "dependencies": {
  7 |     "argparse": {
  8 |       "version": "1.0.10",
  9 |       "resolved": "https://registry.npmjs.org/argparse/-/argparse-1.0.10.tgz",
 10 |       "integrity": "sha512-o5Roy6tNG4SL/FOkCAN6RzjiakZS25RLYFrcMttJqbdd8BWrnA+fGz57iN5Pb06pvBGvl5gQ0B48dJlslXvoTg==",
 11 |       "requires": {
 12 |         "sprintf-js": "~1.0.2"
 13 |       }
 14 |     },
 15 |     "balanced-match": {
 16 |       "version": "1.0.0",
 17 |       "resolved": "https://registry.npmjs.org/balanced-match/-/balanced-match-1.0.0.tgz",
 18 |       "integrity": "sha1-ibTRmasr7kneFk6gK4nORi1xt2c="
 19 |     },
 20 |     "brace-expansion": {
 21 |       "version": "1.1.11",
 22 |       "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.11.tgz",
 23 |       "integrity": "sha512-iCuPHDFgrHX7H2vEI/5xpz07zSHB00TpugqhmYtVmMO6518mCuRMoOYFldEBl0g187ufozdaHgWKcYFb61qGiA==",
 24 |       "requires": {
 25 |         "balanced-match": "^1.0.0",
 26 |         "concat-map": "0.0.1"
 27 |       }
 28 |     },
 29 |     "commander": {
 30 |       "version": "2.9.0",
 31 |       "resolved": "https://registry.npmjs.org/commander/-/commander-2.9.0.tgz",
 32 |       "integrity": "sha1-nJkJQXbhIkDLItbFFGCYQA/g99Q=",
 33 |       "requires": {
 34 |         "graceful-readlink": ">= 1.0.0"
 35 |       }
 36 |     },
 37 |     "concat-map": {
 38 |       "version": "0.0.1",
 39 |       "resolved": "https://registry.npmjs.org/concat-map/-/concat-map-0.0.1.tgz",
 40 |       "integrity": "sha1-2Klr13/Wjfd5OnMDajug1UBdR3s="
 41 |     },
 42 |     "deep-extend": {
 43 |       "version": "0.5.1",
 44 |       "resolved": "https://registry.npmjs.org/deep-extend/-/deep-extend-0.5.1.tgz",
 45 |       "integrity": "sha512-N8vBdOa+DF7zkRrDCsaOXoCs/E2fJfx9B9MrKnnSiHNh4ws7eSys6YQE4KvT1cecKmOASYQBhbKjeuDD9lT81w=="
 46 |     },
 47 |     "entities": {
 48 |       "version": "2.0.3",
 49 |       "resolved": "https://registry.npmjs.org/entities/-/entities-2.0.3.tgz",
 50 |       "integrity": "sha512-MyoZ0jgnLvB2X3Lg5HqpFmn1kybDiIfEQmKzTb5apr51Rb+T3KdmMiqa70T+bhGnyv7bQ6WMj2QMHpGMmlrUYQ=="
 51 |     },
 52 |     "esprima": {
 53 |       "version": "4.0.1",
 54 |       "resolved": "https://registry.npmjs.org/esprima/-/esprima-4.0.1.tgz",
 55 |       "integrity": "sha512-eGuFFw7Upda+g4p+QHvnW0RyTX/SVeJBDM/gCtMARO0cLuT2HcEKnTPvhjV6aGeqrCB/sbNop0Kszm0jsaWU4A=="
 56 |     },
 57 |     "fs.realpath": {
 58 |       "version": "1.0.0",
 59 |       "resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
 60 |       "integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
 61 |     },
 62 |     "get-stdin": {
 63 |       "version": "5.0.1",
 64 |       "resolved": "https://registry.npmjs.org/get-stdin/-/get-stdin-5.0.1.tgz",
 65 |       "integrity": "sha1-Ei4WFZHiH/TFJTAwVpPyDmOTo5g="
 66 |     },
 67 |     "glob": {
 68 |       "version": "7.1.6",
 69 |       "resolved": "https://registry.npmjs.org/glob/-/glob-7.1.6.tgz",
 70 |       "integrity": "sha512-LwaxwyZ72Lk7vZINtNNrywX0ZuLyStrdDtabefZKAY5ZGJhVtgdznluResxNmPitE0SAO+O26sWTHeKSI2wMBA==",
 71 |       "requires": {
 72 |         "fs.realpath": "^1.0.0",
 73 |         "inflight": "^1.0.4",
 74 |         "inherits": "2",
 75 |         "minimatch": "^3.0.4",
 76 |         "once": "^1.3.0",
 77 |         "path-is-absolute": "^1.0.0"
 78 |       }
 79 |     },
 80 |     "graceful-readlink": {
 81 |       "version": "1.0.1",
 82 |       "resolved": "https://registry.npmjs.org/graceful-readlink/-/graceful-readlink-1.0.1.tgz",
 83 |       "integrity": "sha1-TK+tdrxi8C+gObL5Tpo906ORpyU="
 84 |     },
 85 |     "ignore": {
 86 |       "version": "5.1.8",
 87 |       "resolved": "https://registry.npmjs.org/ignore/-/ignore-5.1.8.tgz",
 88 |       "integrity": "sha512-BMpfD7PpiETpBl/A6S498BaIJ6Y/ABT93ETbby2fP00v4EbvPBXWEoaR1UBPKs3iR53pJY7EtZk5KACI57i1Uw=="
 89 |     },
 90 |     "inflight": {
 91 |       "version": "1.0.6",
 92 |       "resolved": "https://registry.npmjs.org/inflight/-/inflight-1.0.6.tgz",
 93 |       "integrity": "sha1-Sb1jMdfQLQwJvJEKEHW6gWW1bfk=",
 94 |       "requires": {
 95 |         "once": "^1.3.0",
 96 |         "wrappy": "1"
 97 |       }
 98 |     },
 99 |     "inherits": {
100 |       "version": "2.0.4",
101 |       "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.4.tgz",
102 |       "integrity": "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="
103 |     },
104 |     "ini": {
105 |       "version": "1.3.5",
106 |       "resolved": "https://registry.npmjs.org/ini/-/ini-1.3.5.tgz",
107 |       "integrity": "sha512-RZY5huIKCMRWDUqZlEi72f/lmXKMvuszcMBduliQ3nnWbx9X/ZBQO7DijMEYS9EhHBb2qacRUMtC7svLwe0lcw=="
108 |     },
109 |     "js-yaml": {
110 |       "version": "3.13.1",
111 |       "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-3.13.1.tgz",
112 |       "integrity": "sha512-YfbcO7jXDdyj0DGxYVSlSeQNHbD7XPWvrVWeVUujrQEoZzWJIRrCPoyk6kL6IAjAG2IolMK4T0hNUe0HOUs5Jw==",
113 |       "requires": {
114 |         "argparse": "^1.0.7",
115 |         "esprima": "^4.0.0"
116 |       }
117 |     },
118 |     "jsonc-parser": {
119 |       "version": "2.2.1",
120 |       "resolved": "https://registry.npmjs.org/jsonc-parser/-/jsonc-parser-2.2.1.tgz",
121 |       "integrity": "sha512-o6/yDBYccGvTz1+QFevz6l6OBZ2+fMVu2JZ9CIhzsYRX4mjaK5IyX9eldUdCmga16zlgQxyrj5pt9kzuj2C02w=="
122 |     },
123 |     "linkify-it": {
124 |       "version": "3.0.2",
125 |       "resolved": "https://registry.npmjs.org/linkify-it/-/linkify-it-3.0.2.tgz",
126 |       "integrity": "sha512-gDBO4aHNZS6coiZCKVhSNh43F9ioIL4JwRjLZPkoLIY4yZFwg264Y5lu2x6rb1Js42Gh6Yqm2f6L2AJcnkzinQ==",
127 |       "requires": {
128 |         "uc.micro": "^1.0.1"
129 |       }
130 |     },
131 |     "lodash.differencewith": {
132 |       "version": "4.5.0",
133 |       "resolved": "https://registry.npmjs.org/lodash.differencewith/-/lodash.differencewith-4.5.0.tgz",
134 |       "integrity": "sha1-uvr7yRi1UVTheRdqALsK76rIVLc="
135 |     },
136 |     "lodash.flatten": {
137 |       "version": "4.4.0",
138 |       "resolved": "https://registry.npmjs.org/lodash.flatten/-/lodash.flatten-4.4.0.tgz",
139 |       "integrity": "sha1-8xwiIlqWMtK7+OSt2+8kCqdlph8="
140 |     },
141 |     "markdown-it": {
142 |       "version": "11.0.0",
143 |       "resolved": "https://registry.npmjs.org/markdown-it/-/markdown-it-11.0.0.tgz",
144 |       "integrity": "sha512-+CvOnmbSubmQFSA9dKz1BRiaSMV7rhexl3sngKqFyXSagoA3fBdJQ8oZWtRy2knXdpDXaBw44euz37DeJQ9asg==",
145 |       "requires": {
146 |         "argparse": "^1.0.7",
147 |         "entities": "~2.0.0",
148 |         "linkify-it": "^3.0.1",
149 |         "mdurl": "^1.0.1",
150 |         "uc.micro": "^1.0.5"
151 |       }
152 |     },
153 |     "markdownlint": {
154 |       "version": "0.21.1",
155 |       "resolved": "https://registry.npmjs.org/markdownlint/-/markdownlint-0.21.1.tgz",
156 |       "integrity": "sha512-8kc88w5dyEzlmOWIElp8J17qBgzouOQfJ0LhCcpBFrwgyYK6JTKvILsk4FCEkiNqHkTxwxopT2RS2DYb/10qqg==",
157 |       "requires": {
158 |         "markdown-it": "11.0.0"
159 |       }
160 |     },
161 |     "markdownlint-cli": {
162 |       "version": "0.24.0",
163 |       "resolved": "https://registry.npmjs.org/markdownlint-cli/-/markdownlint-cli-0.24.0.tgz",
164 |       "integrity": "sha512-AusUxaX4sFayUBFTCKeHc8+fq73KFqIUW+ZZZYyQ/BvY0MoGAnE2C/3xiawSE7WXmpmguaWzhrXRuY6IrOLX7A==",
165 |       "requires": {
166 |         "commander": "~2.9.0",
167 |         "deep-extend": "~0.5.1",
168 |         "get-stdin": "~5.0.1",
169 |         "glob": "~7.1.2",
170 |         "ignore": "~5.1.4",
171 |         "js-yaml": "~3.13.1",
172 |         "jsonc-parser": "~2.2.0",
173 |         "lodash.differencewith": "~4.5.0",
174 |         "lodash.flatten": "~4.4.0",
175 |         "markdownlint": "~0.21.0",
176 |         "markdownlint-rule-helpers": "~0.12.0",
177 |         "minimatch": "~3.0.4",
178 |         "minimist": "~1.2.5",
179 |         "rc": "~1.2.7"
180 |       }
181 |     },
182 |     "markdownlint-rule-helpers": {
183 |       "version": "0.12.0",
184 |       "resolved": "https://registry.npmjs.org/markdownlint-rule-helpers/-/markdownlint-rule-helpers-0.12.0.tgz",
185 |       "integrity": "sha512-Q7qfAk+AJvx82ZY52OByC4yjoQYryOZt6D8TKrZJIwCfhZvcj8vCQNuwDqILushtDBTvGFmUPq+uhOb1KIMi6A=="
186 |     },
187 |     "mdurl": {
188 |       "version": "1.0.1",
189 |       "resolved": "https://registry.npmjs.org/mdurl/-/mdurl-1.0.1.tgz",
190 |       "integrity": "sha1-/oWy7HWlkDfyrf7BAP1sYBdhFS4="
191 |     },
192 |     "minimatch": {
193 |       "version": "3.0.4",
194 |       "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-3.0.4.tgz",
195 |       "integrity": "sha512-yJHVQEhyqPLUTgt9B83PXu6W3rx4MvvHvSUvToogpwoGDOUQ+yDrR0HRot+yOCdCO7u4hX3pWft6kWBBcqh0UA==",
196 |       "requires": {
197 |         "brace-expansion": "^1.1.7"
198 |       }
199 |     },
200 |     "minimist": {
201 |       "version": "1.2.5",
202 |       "resolved": "https://registry.npmjs.org/minimist/-/minimist-1.2.5.tgz",
203 |       "integrity": "sha512-FM9nNUYrRBAELZQT3xeZQ7fmMOBg6nWNmJKTcgsJeaLstP/UODVpGsr5OhXhhXg6f+qtJ8uiZ+PUxkDWcgIXLw=="
204 |     },
205 |     "once": {
206 |       "version": "1.4.0",
207 |       "resolved": "https://registry.npmjs.org/once/-/once-1.4.0.tgz",
208 |       "integrity": "sha1-WDsap3WWHUsROsF9nFC6753Xa9E=",
209 |       "requires": {
210 |         "wrappy": "1"
211 |       }
212 |     },
213 |     "path-is-absolute": {
214 |       "version": "1.0.1",
215 |       "resolved": "https://registry.npmjs.org/path-is-absolute/-/path-is-absolute-1.0.1.tgz",
216 |       "integrity": "sha1-F0uSaHNVNP+8es5r9TpanhtcX18="
217 |     },
218 |     "rc": {
219 |       "version": "1.2.8",
220 |       "resolved": "https://registry.npmjs.org/rc/-/rc-1.2.8.tgz",
221 |       "integrity": "sha512-y3bGgqKj3QBdxLbLkomlohkvsA8gdAiUQlSBJnBhfn+BPxg4bc62d8TcBW15wavDfgexCgccckhcZvywyQYPOw==",
222 |       "requires": {
223 |         "deep-extend": "^0.6.0",
224 |         "ini": "~1.3.0",
225 |         "minimist": "^1.2.0",
226 |         "strip-json-comments": "~2.0.1"
227 |       },
228 |       "dependencies": {
229 |         "deep-extend": {
230 |           "version": "0.6.0",
231 |           "resolved": "https://registry.npmjs.org/deep-extend/-/deep-extend-0.6.0.tgz",
232 |           "integrity": "sha512-LOHxIOaPYdHlJRtCQfDIVZtfw/ufM8+rVj649RIHzcm/vGwQRXFt6OPqIFWsm2XEMrNIEtWR64sY1LEKD2vAOA=="
233 |         }
234 |       }
235 |     },
236 |     "sprintf-js": {
237 |       "version": "1.0.3",
238 |       "resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.0.3.tgz",
239 |       "integrity": "sha1-BOaSb2YolTVPPdAVIDYzuFcpfiw="
240 |     },
241 |     "strip-json-comments": {
242 |       "version": "2.0.1",
243 |       "resolved": "https://registry.npmjs.org/strip-json-comments/-/strip-json-comments-2.0.1.tgz",
244 |       "integrity": "sha1-PFMZQukIwml8DsNEhYwobHygpgo="
245 |     },
246 |     "uc.micro": {
247 |       "version": "1.0.6",
248 |       "resolved": "https://registry.npmjs.org/uc.micro/-/uc.micro-1.0.6.tgz",
249 |       "integrity": "sha512-8Y75pvTYkLJW2hWQHXxoqRgV7qb9B+9vFEtidML+7koHUFapnVJAZ6cKs+Qjz5Aw3aZWHMC6u0wJE3At+nSGwA=="
250 |     },
251 |     "wrappy": {
252 |       "version": "1.0.2",
253 |       "resolved": "https://registry.npmjs.org/wrappy/-/wrappy-1.0.2.tgz",
254 |       "integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8="
255 |     }
256 |   }
257 | }
258 | 


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "rfcs",
 3 |   "version": "0.0.0",
 4 |   "devDependencies": {},
 5 |   "scripts": {
 6 |     "lint": "markdownlint text/*.md"
 7 |   },
 8 |   "dependencies": {
 9 |     "markdownlint-cli": "^0.24.0"
10 |   }
11 | }
12 | 


--------------------------------------------------------------------------------
/template.md:
--------------------------------------------------------------------------------
 1 | # Title
 2 | 
 3 | ## Summary
 4 | 
 5 | One para explanation of the proposal.
 6 | 
 7 | ## Motivation
 8 | 
 9 | Why are we doing this? What use cases does it support? What is the expected
10 | outcome?
11 | 
12 | ## Detailed design
13 | 
14 | This is the bulk of the RFC. Explain the design in enough detail that:
15 | 
16 | - It is reasonably clear how the feature would be implemented.
17 | - Corner cases are dissected by example.
18 | - How the feature is used.
19 | 
20 | ## Drawbacks
21 | 
22 | Why should we not do this?
23 | 
24 | ## Alternatives
25 | 
26 | - Why is this design the best in the space of possible designs?
27 | - What other designs have been considered and what is the rationale for not
28 |   choosing them?
29 | - What is the impact of not doing this?
30 | 
31 | ## Unresolved questions
32 | 
33 | What parts of the design are still to be determined?
34 | 


--------------------------------------------------------------------------------
/text/2020-09-08-easily-send-http-request-on-workflow-task.md:
--------------------------------------------------------------------------------
  1 | # Easily Send HTTP Request On Workflow Task
  2 | 
  3 | - [Easily Send HTTP Request On Workflow Task](#easily-send-http-request-on-workflow-task)
  4 |   - [Summary](#summary)
  5 |   - [Motivation](#motivation)
  6 |   - [Detailed design](#detailed-design)
  7 |     - [Rendering `Task` for sending HTTP request](#rendering-task-for-sending-http-request)
  8 |       - [Why `curl`](#why-curl)
  9 |       - [Frontend form parameters](#frontend-form-parameters)
 10 |       - [Load request body from ConfigMap as file](#load-request-body-from-configmap-as-file)
 11 |       - [Advanced configuration](#advanced-configuration)
 12 |       - [Frontend component for this type of `Task` and preview of generated result](#frontend-component-for-this-type-of-task-and-preview-of-generated-result)
 13 |     - [New context variable: `http`](#new-context-variable-http)
 14 |       - [Structure of `http`](#structure-of-http)
 15 |       - [Example conditions](#example-conditions)
 16 |   - [Drawbacks](#drawbacks)
 17 |     - [parsing curl command inline back into the parameters](#parsing-curl-command-inline-back-into-the-parameters)
 18 |     - [parser of context variable `http`](#parser-of-context-variable-http)
 19 |   - [Alternatives](#alternatives)
 20 |     - [Alternative solution 1: New type of WorkflowNode/Template for sending HTTP request](#alternative-solution-1-new-type-of-workflownodetemplate-for-sending-http-request)
 21 |   - [Unresolved questions](#unresolved-questions)
 22 | 
 23 | ## Summary
 24 | 
 25 | Rendering `curl` command line into Workflow `Task`  with parameters, with common
 26 | used parameters for HTTP. And more useful context variables like `json` or `http`.
 27 | 
 28 | > I used to design the rendering logic with pure frontend/typescript, but as the
 29 | > requirement of "parsing curl command line", I think using golang is better to
 30 | > reuse the codes and round-trip testing.
 31 | 
 32 | ## Motivation
 33 | 
 34 | The design of `Task` node in Workflow is very common-usable, users could set any
 35 | workload as one `ContainerSpec` inside the `Task`, and select branches with
 36 | conditions. But it's quite complex to use the `Task`, users need to write an
 37 | entire one-line shell to call the utilities in docker image, such as `curl -X
 38 | GET https://your.application/helath -H 'Required-Token: token'`. And we do not
 39 | want to introduce more types of WorkflowNode for doing that, so I think
 40 | rendering the original require of "sending HTTP request" to `Task` is a good
 41 | idea.
 42 | 
 43 | As the certain context of "sending HTTP request", we could introduce some
 44 | dedicated context variables only for parsing HTTP status code, and response
 45 | body. For most usage, an HTTP endpoint like `https://your.application/health`
 46 | will return `20x` as application is healthy, `50x` as application is not
 47 | available.
 48 | 
 49 | ## Detailed design
 50 | 
 51 | ### Rendering `Task` for sending HTTP request
 52 | 
 53 | The basic idea is There are some utilities called "curl command line builder",
 54 | we could borrow the core logic from that.
 55 | 
 56 | #### Why `curl`
 57 | 
 58 | We will raise a pod inside of the kubernetes cluster, so we should consider
 59 | about the size of image and multi architecture/platform support.
 60 | 
 61 | I will use `curlimages/curl:7.78.0` as the docker image. It's the latest stable
 62 | curl for now(2021-09-07). It's small(< 4MB). It supports most common
 63 | architectures: `386`, `amd64`, `arm64` and so on.
 64 | 
 65 | Why not use other tools like `httpie`?
 66 | 
 67 | `httpie` provides more easy flags and better colorful output of HTTP response.
 68 | But the image is much larger(about 25MB) and no support for other architectures.
 69 | 
 70 | Why not use python/javascript script files with `python` and `node` dockerimage?
 71 | 
 72 | Both of official images are large(about 60MB for python and 40MB for node). And
 73 | it's much harder for generating codes instead of generating command line.
 74 | 
 75 | Why not build another standalone binary tool?
 76 | 
 77 | We do not want reinvent another wheel. :D
 78 | 
 79 | #### Frontend form parameters
 80 | 
 81 | We will provide the most common used parameters on the frontend:
 82 | 
 83 | - HTTP method
 84 | - URL
 85 | - Custom headers
 86 | - Request body in string
 87 | - Path of the file that as the content of request body
 88 | 
 89 | "Request body in string" and "Path of the file that as the content of request body"
 90 | are exclusive.
 91 | 
 92 | > I used to mount "configmap" into the pod, and use it's content as the request body
 93 | > directly. But it's could not be implemented now, because the name of configmap
 94 | > would not appear in the command line, so parser could not rebuild this, round-trip
 95 | > not works.
 96 | 
 97 | #### Load request body from ConfigMap as file
 98 | 
 99 | `curl` supports loading content from file as request body with `-d`. We could
100 | load an existed configmap as the content of request body, the mount path of
101 | configmap could be configured.
102 | 
103 | #### Advanced configuration
104 | 
105 | And we could provide advanced configurations:
106 | 
107 | - flags: custom flags that will be appended at the end of command line directly
108 | - image: replace the default image `curlimages/curl:7.78.0`, useful in air-gap
109 |   cluster
110 | 
111 | #### Frontend component for this type of `Task` and preview of generated result
112 | 
113 | We want store the state into one or several `annotation` of templates, but it
114 | could not be done in short future. There are no embedded `metav1.ObjectMeta`
115 | inside the `Template`. We'd better to treat `Template` with `WorkflowNode` as
116 | `PodTemplateSpec` to `Pod`. I think we should split CRDs for workflow into
117 | another subgroup, then upgrade to `v1alpha2`.
118 | 
119 | If we want to show the form of configuration of HTTP request in frontend, we
120 | could only parse the command line of `curl`, and that is awful: it's hard to
121 | keep consistently between original parameters and parsed parameters. I prefer
122 | NOT to do that. But this feature is very important for user to modifying the
123 | exists workflow. So I have to implement this. Welcome to other way to implement
124 | this.
125 | 
126 | The generated `Task` should show the preview instantly.
127 | 
128 | ### New context variable: `http`
129 | 
130 | Although we could use `stdout` to collect the output of `curl`, it meaningless
131 | that we do not provide the parsed HTTP context for the next conditional
132 | branches.
133 | 
134 | So we will provide another context variable called `http` for introducing the
135 | parsed HTTP context. It contains very simplified and common-used HTTP request
136 | and response, easy for being used in the conditions.
137 | 
138 | #### Structure of `http`
139 | 
140 | ```golang
141 | type ContextVariableHTTP struct {
142 |     Request  HTTPRequest  `json:"request"`
143 |     Response HTTPResponse `json:"response"`
144 | }
145 | type HTTPRequest struct {
146 |     Method string      `json:"method"`
147 |     URL    string      `json:"url"`
148 |     Header http.Header `json:"header"`
149 |     Body   []byte      `json:"body"`
150 | }
151 | type HTTPResponse struct {
152 |     StatusCode int         `json:"statusCode"`
153 |     Header     http.Header `json:"header"`
154 |     Body       []byte      `json:"body"`
155 | }
156 | 
157 | ```
158 | 
159 | #### Example conditions
160 | 
161 | Response returns 20x:
162 | 
163 | `http.response.statusCode >= 200 && http.response.statusCode < 300`
164 | 
165 | Response returns 50x:
166 | 
167 | `http.response.statusCode >= 500`
168 | 
169 | And used could select these fields in context variable `http`, and use it in the
170 | condition.
171 | 
172 | ## Drawbacks
173 | 
174 | There are several things that bother me a long time, please leave comments if
175 | you have any idea!
176 | 
177 | ### parsing curl command inline back into the parameters
178 | 
179 | Here is one concerned thing about "parsing curl command inline back into the
180 | parameters".
181 | 
182 | Origin demand:
183 | 
184 | - restore the parameters of HTTP request, for the next show and modification on
185 |   workflow view.
186 | - restore the parameters of HTTP request, for the context variable
187 |   `http.request`
188 | 
189 | Excepted implementation:
190 | 
191 | Store all the parameters with `annotation`, with specific prefix like
192 | `chaos-mesh.opg/workflow.http.method=GET`
193 | 
194 | Actual implementations:
195 | 
196 | Because struct `Template` in WorkflowSpec does not embed `metav1.Object`, there
197 | is nowhere to place the annotations. So we have to parse the rendered "curl
198 | command line". We need a lot of works and tests for keep the parsed result is
199 | consistent with the original parameters.
200 | 
201 | ### parser of context variable `http`
202 | 
203 | I want to parse the output of the `curl`, with flag `-i`. It will print both response
204 | header and body to `stdout`. So we need a parser, with a lot of testcase about
205 | `curl`'s output and expected `HTTPResponse`.
206 | 
207 | And about the `-L` and http `301`/`302`, I think only keep the last response is
208 | the right way.
209 | 
210 | Another thing is the `stdout` context variable is not only contains `stdout`,
211 | but also `stderr`. It might bring confusing for users. But kubernetes does not
212 | provide a way to split `stdout` and `stderr` from the `log` subresource from
213 | `Pod`, we have no idea expected implement collector for each container runtime.
214 | It will not break anything yet, but it does bring a little mess. Maybe we need a
215 | rename, from `stdout` to `log`.
216 | 
217 | ## Alternatives
218 | 
219 | ### Alternative solution 1: New type of WorkflowNode/Template for sending HTTP request
220 | 
221 | I do really want to add some annotations as the metadata of rendering this Template,
222 | but I will not turn them into the "official" fields of CRD. Current CRD is powerful
223 | enough for describing the behavior.
224 | 
225 | ## Unresolved questions
226 | 
227 | - This implementation is not so friendly to users with pure command-line and
228 |   yaml files.
229 | 


--------------------------------------------------------------------------------
/text/2020-10-22-authn-and-authz-on-chaos-dashboard.md:
--------------------------------------------------------------------------------
 1 | # Authentication and authorization on chaos dashboard
 2 | 
 3 | ## Summary
 4 | 
 5 | We need using authentication and authorization to restrict user's permission and
 6 | action.
 7 | 
 8 | ## Motivation
 9 | 
10 | Users could create, suspend, get, and delete chaos experiments via chaos-dashboard.
11 | But chaos-dashboard does NOT contains and features about permission management, users
12 | could do anything, as long as he could access the chaos-dashboard UI. It is a secure
13 | issue.
14 | 
15 | We need permission management about:
16 | 
17 | - access control for **Resource**.
18 |   - User A could create/get IO chaos experiments.
19 |   - User B could create/get Network chaos experiments.
20 | - access control for **Action**.
21 |   - User A could create/get chaos experiments.
22 |   - User B could only get chaos experiments.
23 | 
24 | ## Detailed design
25 | 
26 | ### Login
27 | 
28 | Users are asked for a `Service Account Token` to login. Like kubernetes dashboard:
29 | 
30 | ![kubernetes login](../media/kubernetes-dashboard-login.png)
31 | 
32 | ### Create new users
33 | 
34 | Chaos dashboard do NOT provide any features about creating user. System administrators
35 | should manually create `ServiceAccount` with certain username, then bind to a `Role`.
36 | 
37 | ### Implementation references
38 | 
39 | Things to do:
40 | 
41 | - frontend asking user input token to login
42 | - frontend will attach the token while sending requests to backend
43 | - backend will use a certain token to create a new kube client
44 | - backend need support multi-user
45 | 
46 | > We could references auth module in kubernetes-dashboard while implementing this.
47 | 
48 | We will provide some pre-set `Role`, like:
49 | 
50 | - Admin: could create/get any chaos experiments.
51 | - Viewer: could only get chaos experiments.
52 | 
53 | System administrators could also create their own roles, for advanced permission
54 | control.
55 | 
56 | ### Advantages
57 | 
58 | - This solution depends on kubernetes rbac, reducing many logics for permission management.
59 | - System administrators could change each user's permission as their requirements,
60 |   it's very flexible.
61 | - Using kubernetes rbac could also restrict permission with `kubectl`.
62 | 
63 | ## Drawbacks
64 | 
65 | - Users should understand basic concepts about kubernetes rbac.
66 | 
67 | ## Alternatives
68 | 
69 | - Implement a full-featured rbac platform inner the chaos-dashboard.
70 |   It's so complex and high-cost, and it is NOT cloud-native at all.
71 | 
72 | ## Unresolved questions
73 | 
74 | None.
75 | 


--------------------------------------------------------------------------------
/text/2020-11-09-chaos-mesh-workflow.md:
--------------------------------------------------------------------------------
  1 | # Workflow
  2 | 
  3 | ## Summary
  4 | 
  5 | A built-in workflow engine with frontend support is needed for Chaos Mesh.
  6 | 
  7 | ## Motivation
  8 | 
  9 | Both running chaos experiments parallelly or sequentially are important for
 10 | simulating production errors. Chaos Mesh is shipped with a feature-rich chaos
 11 | toolbox. Composing them together can make it much more powerful. With a workflow
 12 | engine, an experiment can turn from single chaos into a series of chaos with
 13 | some external logic (like checking health).
 14 | 
 15 | Previously, the `duration` and `scheduler` fields have enabled a very
 16 | fundamental "workflow" feature, so after implementing the workflow engine, these
 17 | fields should be deprecated.
 18 | 
 19 | ## Detailed design
 20 | 
 21 | The specification for a `Workflow` custom resource can be concluded in the
 22 | following example:
 23 | 
 24 | ```yaml
 25 | spec:
 26 |   entry: serial-chaos
 27 |   templates:
 28 |   - name: networkchaos
 29 |     type: NetworkChaos
 30 |     duration: "30m"
 31 |     selector:
 32 |       labelSelectors:
 33 |         "component": "tikv"
 34 |     delay:
 35 |       latency: "90ms"
 36 |       correlation: "25"
 37 |       jitter: "90ms"
 38 |   - name: iochaos
 39 |     type: IOChaos
 40 |     duration: "10m"
 41 |     selector:
 42 |       labelSelectors:
 43 |         app: "etcd"
 44 |     volumePath: /var/run/etcd
 45 |     path: /var/run/etcd/**/*
 46 |     errno: 5
 47 |     percent: 50
 48 |   - name: parallel-chaos
 49 |     type: Parallel
 50 |     tasks:
 51 |     - ref: networkchaos
 52 |     - ref: iochaos
 53 |   - name: check
 54 |     type: Task
 55 |     task:
 56 |       image: 'alpine:3.6'
 57 |       command: ['/app']
 58 |     branch:
 59 |     - when: "stdout == 'continue'"
 60 |       ref: serial-chaos
 61 |     - when: "stdout == 'continue'"
 62 |       ref: serial-chaos
 63 |   - name: serial-chaos
 64 |     type: Serial
 65 |     deadline: "30m"
 66 |     tasks:
 67 |     - ref: networkchaos
 68 |     - ref: iochaos
 69 |     - ref: parallel-chaos
 70 |     - type: Suspend
 71 |       duration: "30m"
 72 |     - ref: check
 73 | ```
 74 | 
 75 | ### `templates` field
 76 | 
 77 | `templates` field describes all the templates. One of them is the entry of the
 78 | workflow (which is specified by the `entry` field). The `templates` is a slice
 79 | of template, each of them is a `*Chaos`, `Parallel`, `Serial` or `Task`. They
 80 | are distinguished according to the `type` field.
 81 | 
 82 | The following document will describe all these different kinds of `template`.
 83 | 
 84 | #### `*Chaos`
 85 | 
 86 | They are the same with the specification of already defined `*Chaos` types
 87 | without `scheduler` field. When running into this part, the workflow engine
 88 | should create a `*Chaos` resource, and delete it after the duration. If it
 89 | deletes the resource successfully, this step finishes and continues.
 90 | 
 91 | The created `*Chaos` resource should be a "common chaos", as Chaos Mesh doesn't
 92 | support to set duration without a scheduler.
 93 | 
 94 | #### `Parallel`
 95 | 
 96 | For a `Parallel` task, the engine should spawn these tasks at the same time, and
 97 | finish iff all the tasks have finished. The `tasks` field is a list of tasks,
 98 | which could be a reference to a template (by name), or an inline template.
 99 | 
100 | You can find an example of the inline template in the `Suspend` task of
101 | `serial-chaos`.
102 | 
103 | If the `deadline` field is set, all the running child task will be killed and
104 | turn into `DeadlineExceeded` status.
105 | 
106 | #### `Serial`
107 | 
108 | For a `Serial` task, the engine should run these tasks one by one. It enters the
109 | next step iff the former one succeeded. After all these steps finished, this
110 | task turns into the finished phase.
111 | 
112 | If one of the tasks failed or exceeded the deadline, the following task will not
113 | run and the current task will turn into the corresponding status directly.
114 | 
115 | If the `deadline` field is set, all the running child task will be killed and
116 | turn into `DeadlineExceeded` status.
117 | 
118 | #### `Suspend`
119 | 
120 | For a `Suspend` task, the engine should do nothing. After a period of time
121 | (specified by `duration` field), this task turns into the finished phase.
122 | 
123 | #### `Task`
124 | 
125 | `Task` is a special kind of template to integrate users' processes into the
126 | workflow. In this step, the users can setup a container (with Kubernetes's pod)
127 | to run their own process. This step finishes once the pod turns into
128 | `PodSucceeded` phase. The description of the task could be more complicated with
129 | more fields if needed in the future. Every branch under the `branch` field will
130 | run parallelly iff the `when` expression returns true. There will be a lot of
131 | variables provided by the engine to use in the `when` expression, such as
132 | `stdout`, `stderr`...
133 | 
134 | We will provide something like "Downward API" that makes users' processes could see
135 | status of workflow. So users could make any conditions based on workflow status.
136 | 
137 | Combining `Parallel` and `Serial` can provide a powerful expression in task
138 | combination. However, as it can only represent series-parallel graph, it's still
139 | a subset of `dag`.
140 | 
141 | #### Status
142 | 
143 | The status of a `Workflow` and the transition between different status could be
144 | the heart of a workflow engine. Here is an example:
145 | 
146 | ```yaml
147 | status:
148 |   finishedAt: null
149 |   startedAt: Thu, 05 Nov 2020 14:23:25 +0800
150 |   phase: Running
151 |   nodes:
152 |   - name: entry
153 |     displayName: serial-chaos
154 |     type: Serial
155 |     children:
156 |     - entry[0]
157 |     - entry[1]
158 |     startTime: Thu, 05 Nov 2020 14:23:25 +0800
159 |     deadline: 50m
160 |     phase: Running
161 |     spec:
162 |     - ref: networkchaos
163 |     - ref: iochaos
164 |     - ref: parallel-chaos
165 |     - type: Suspend
166 |       duration: "30m"
167 |     - ref: check
168 |   - name: entry[0]
169 |     displayName: networkchaos
170 |     type: NetworkChaos
171 |     startTime: Thu, 05 Nov 2020 14:23:25 +0800
172 |     finishedAt: Thu, 05 Nov 2020 14:33:25 +0800
173 |     duration: 10m
174 |     phase: Succeeded
175 |     spec:
176 |       selector:
177 |         labelSelectors:
178 |           "component": "tikv"
179 |       delay:
180 |         latency: "90ms"
181 |         correlation: "25"
182 |         jitter: "90ms"
183 |   - name: entry[1]
184 |     displayName: parallel-chaos
185 |     type: Parallel
186 |     startTime: Thu, 05 Nov 2020 14:33:25 +0800
187 |     deadline: 10m
188 |     phase: Running
189 |     children:
190 |     - entry[1][0]
191 |     - entry[1][1]
192 |     spec:
193 |     - ref: networkchaos
194 |     - ref: iochaos
195 |   - name: entry[1][0]
196 |     displayName: networkchaos
197 |     type: NetworkChaos
198 |     startTime: Thu, 05 Nov 2020 14:33:25 +0800
199 |     duration: 10m
200 |     phase: Running
201 |     spec:
202 |       selector:
203 |         labelSelectors:
204 |           "component": "tikv"
205 |       delay:
206 |         latency: "90ms"
207 |         correlation: "25"
208 |         jitter: "90ms"
209 |   - name: entry[1][1]
210 |     displayName: iochaos
211 |     type: IOChaos
212 |     startTime: Thu, 05 Nov 2020 14:33:25 +0800
213 |     duration: 10m
214 |     phase: Running
215 |     spec:
216 |       selector:
217 |         labelSelectors:
218 |           app: "etcd"
219 |       volumePath: /var/run/etcd
220 |       path: /var/run/etcd/**/*
221 |       errno: 5
222 |       percent: 50
223 | ```
224 | 
225 | The `displayName` of a node is the corresponding template name. If it's inlined,
226 | the `displayName` will be generated by its parent template (e.g.
227 | `serial-chaos[3]`).
228 | 
229 | ### Cronjob
230 | 
231 | It has been proved to be a bad idea to manage the running logic and cron
232 | scheduler in the same resource. From the practice in Chaos Mesh 0.x and 1.x,
233 | managing a "twophase" scheduler is really complicated and full of bugs. As the
234 | status between a `CronWorkflow` and a normal `Workflow` could be much different,
235 | we will split them into two CRD. Here is an example of the spec of
236 | `CronWorkflow`:
237 | 
238 | ```yaml
239 | spec:
240 |   cron: "*/5 * * * *"
241 |   entry: networkchaos
242 |   templates:
243 |   - name: networkchaos
244 |     type: NetworkChaos
245 |     duration: "30m"
246 |     selector:
247 |       labelSelectors:
248 |         "component": "tikv"
249 |     delay:
250 |       latency: "90ms"
251 |       correlation: "25"
252 |       jitter: "90ms"
253 | status:
254 |   activeWorkflow:
255 |   - default/current-cron-job-xxxx
256 |   - default/current-cron-job-zzzz
257 |   lastScheduleTime: Thu, 05 Nov 2020 14:33:25 +0800
258 | ```
259 | 
260 | ### Implementation references
261 | 
262 | `Argo` is a really great workflow engine with a workflow definition. It is a
263 | guideline for us to design and implement this feature.
264 | 
265 | #### Reconciler logic
266 | 
267 | We should manage "nodes" in status and the reconciler should act like a state
268 | machine for these "nodes".
269 | 
270 | Creating nodes and updating phase could have side-effects, such as
271 | creating/deleting containers or creating/deleting chaos mesh resources.
272 | 
273 | #### Structure definition
274 | 
275 | The `Workflow` structure could be really complicated. Defining it in detail with
276 | fields could result in duplicated codes with existing `*Chaos`. Or we could set
277 | `Workflow` as `unstructured` or set the `templates` as
278 | `[]k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1.JSON` and decoding
279 | it with `mapstructure` pkg.
280 | 
281 | Without defining the detailed type, we cannot get a basic validation from
282 | kubebuilder, so that we need to write a validating webhook for `Workflow`.
283 | 
284 | ## Alternatives
285 | 
286 | ### Provide Argo templates
287 | 
288 | Providing Argo templates seems a good solution to integrate Chaos Mesh with a
289 | powerful workflow engine. From the technological aspect, it could be the best
290 | solution. However, Chaos Mesh has the ambition to become an integrated platform
291 | for chaos engineering, which should have a workflow engine inside out of the
292 | box.
293 | 
294 | Installing Argo for the users is also not a choice. Because the CRD is not in
295 | namespace scope, installing Argo with Chaos Mesh for users could break existing
296 | Argo in the users' cluster
297 | 
298 | ## Unresolved questions
299 | 
300 | No. AFAIK
301 | 


--------------------------------------------------------------------------------
/text/2021-03-11-unified-selector.md:
--------------------------------------------------------------------------------
 1 | # Unified Selector
 2 | 
 3 | ## Summary
 4 | 
 5 | The whole controller framework should do more things for the implementation of
 6 | chaos. Now, every implementation of chaos selects pods by themselves. However,
 7 | in order to track the running status, the controller framework should know the
 8 | concrete status of every selected targets, which would be a disaster to record
 9 | these status inside the implementation of every chaos :(. An unified selector
10 | framework in the controller framework could solve this problem. This RFC will
11 | talk about the design of this unified selector.
12 | 
13 | ## Motivation
14 | 
15 | There have been a lot of problems about current selector. For example, the
16 | IOChaos should select volumes, but it selects container now (and at first, it
17 | selects pods). Some chaos use `containerName string` to select a container while
18 | some use `containerNames []string` to select several containers at one time. If
19 | we can abstract selectors into one place, these errors won't happen. The
20 | developer of every chaos will not need to consider "select" or these platform
21 | dependent things.
22 | 
23 | Another benifit is that it could help the controller to track the status. With
24 | it, the controller would know which target (pod/container) has been injected and
25 | which has not. It's the first step towards the target of a standalone `Schedule`
26 | CRD.
27 | 
28 | ## Detailed Design
29 | 
30 | Every chaos specification would define a function to get the specification of
31 | selectors:
32 | 
33 | ```go
34 | type StatefulObjectWithSelector interface {
35 |     v1alpha1.StatefulObject
36 | 
37 |     GetSelectorSpecs() map[string]interface{}
38 | }
39 | ```
40 | 
41 | The method `GetSelectorSpecs` will return a map from `string` to selector
42 | specification (like `struct {v1alpha1.PodSelectorSpec, v1alpha1.PodMode,
43 | v1alpha1.Value}`). The key would be the identifier of the selector, as there
44 | will be multiple selectors in one chaos specification, for example the `.` and
45 | `.Target` selectors in `NetworkChaos`. The controller will iterate this map to
46 | select every `SelectSpec`. It will construct a unified selector first, and then
47 | use this selector to `Select` targets. The unified selector may contain a lot of
48 | implementation inside. The construction would be like:
49 | 
50 | ```go
51 | selector := selector.New(selector.SelectorParams{
52 |     PodSelector: pod.New(r.Client, r.Reader, config.ControllerCfg.ClusterScoped, config.ControllerCfg.TargetNamespace, config.ControllerCfg.AllowedNamespaces, config.ControllerCfg.IgnoredNamespaces),
53 |     ContainerSelector: container.New(r.Client, r.Reader, config.ControllerCfg.ClusterScoped, config.ControllerCfg.TargetNamespace, config.ControllerCfg.AllowedNamespaces, config.ControllerCfg.IgnoredNamespaces),
54 | })
55 | ```
56 | 
57 | With the help of dependency injection, it would be constructed easier. The
58 | `Selector` method of `PodSelector` would be like:
59 | 
60 | ```go
61 | func (impl *PodSelector) Select(ctx context.Context, ps *v1alpha1.PodSelector) ([]interface{}, error)
62 | ```
63 | 
64 | The type of second parameter will decide which selector to use. For example, if
65 | you want to select with `*v1alpha1.PodSelector`, then the unified selector will
66 | use `*PodSelector`, and if you want to select with
67 | `*v1alpha1.ContainerSelector`, then the unified selector will use
68 | `*ContainerSelector` to select.
69 | 
70 | The definition of the unified selector would be:
71 | 
72 | ```go
73 | func (s *Selector) Select(ctx context.Context, spec interface{}) ([]interface{}, error)
74 | ```
75 | 
76 | The controller would construct the `Selector` first, and then use the `Selector`
77 | to select the chaos targets. However, as the target may have multiple type (it
78 | could be a pod, a container, a volume or an AWS machine), we can only return an
79 | `interface{}`, and the implementation of chaos would assert the type by
80 | themselves.
81 | 
82 | After selecting every `selectSpecs`, the controller will iterate over selected
83 | items, and call `Apply`/`Recover` for them. All selected items, current item and
84 | the identifier of the `selectorSpec` of current item will be passed into the
85 | `Apply`/`Recover` function.
86 | 
87 | ## Alternatives
88 | 
89 | This is a RFC about internal design, and there is little choice. If you have
90 | better idea, please comment.
91 | 


--------------------------------------------------------------------------------
/text/2021-07-23-extensible-chaosctl.md:
--------------------------------------------------------------------------------
  1 | # Extensible Chaosctl
  2 | 
  3 | ## Summary
  4 | 
  5 | A tool to control the status of Chaos Mesh as much as possible in an extensible
  6 | way.
  7 | 
  8 | ## Motivation
  9 | 
 10 | Currently, the chaosctl is a debug tool to collect logs and other information
 11 | from several kinds of chaos. However, some of its functions are implemented by
 12 | executing commands in the namespace of target pods through the chaos-daemon.
 13 | 
 14 | For example, in the iochaos debug part, the chaosctl executes `cat /proc/mounts`
 15 | [command](https://github.com/chaos-mesh/chaos-mesh/blob/4b8fb5ba1518fda0d144c8df9239dcb0381ff485/pkg/chaosctl/debug/iochaos/iochaos.go#L54)
 16 | straightly in the namepsace of target pods, which is dangerous and causes
 17 | develpment difficulties in the long time.
 18 | 
 19 | To implement more features for chaosctl, we must refactor it for more security
 20 | and extensibility.
 21 | 
 22 | ## Detailed Design
 23 | 
 24 | For more security, functions like reading file or listing processes should be
 25 | implemented in the server side. In other words, we need an API server, and make
 26 | the chaosctl a pure client.
 27 | 
 28 | So, what APIs do we needs? To implement the features that current chaosctl
 29 | supports, maybe we can provide restful APIs like `/networkchaos/iptables` or
 30 | `/iochaos/mounts`. However, in some cases we may need `/networkchaos/iptables`
 31 | and in other cases we need `/networkchaos/ipset` or both of them, should we
 32 | provide API for each of them to avoid unsued information?
 33 | 
 34 | > I think the unsued information will reduce the debug efficiency very much.
 35 | 
 36 | On the other hand, there are too many kinds of resources related with each
 37 | other: the scheduler, the xxxchaos, the podcxxxchaos, the pod, the container and
 38 | the process, how can we provide all APIs with their arrangement?
 39 | 
 40 | > For example, how can we provide both of `/xxxchaos/podchaos` and
 41 | > `/podchaos/xxxchaos` conveniently?
 42 | 
 43 | To resolve these problems, we need structural APIs.
 44 | 
 45 | ### Nested Resources
 46 | 
 47 | Nested resources, also called subresources, may accomplish our goals. For
 48 | example, if we register the struture of `networkchaos` resources, its
 49 | subresources like `iptables` and `ipset` will be registerd automatically. We can
 50 | access the networkchaos resources by path `/networkchaos` and access its
 51 | subresources by path `/networkchaos/iptables` or `/networkchaos/ipset`.
 52 | 
 53 | ```go
 54 | type NetWorksChaos struct {
 55 |     Iptables []*IptablesRule
 56 |     Ipset    []*IpsetRule
 57 | }
 58 | ```
 59 | 
 60 | However, there are two main drawbacks of this solution. Firstly, it can not
 61 | support cascade queries, we must regard `/networkchaos` and
 62 | `/networkchaos/<name>` as different resources. Secondly, there is almost no
 63 | library to conveniently build nested API servers in the ecosystem of golang.
 64 | 
 65 | ### GraphQL
 66 | 
 67 | The GraphQL is one of the most famous solutions for structural APIs. In above
 68 | case, we can easy fetch iptables resources only by query `{ networkchaos
 69 | {iptables } }`. However, the main drawback of this solution is that queries are
 70 | not convenient to edit in cli, we need translation between resource paths and
 71 | GraphQL queries.
 72 | 
 73 | For example, we can translate resource path `/networkchaos/iptables` to query
 74 | `{ networkchaos { iptables } }`.
 75 | 
 76 | ### API Server
 77 | 
 78 | We can choose one of the nested resources and GraphQL as our API solution, I
 79 | prefer GraphQL as its ecosystem is better than nested resources.
 80 | 
 81 | Moreover, server should provide additional resources like `logs`.
 82 | 
 83 | ### Client
 84 | 
 85 | #### Usage
 86 | 
 87 | - identify resources
 88 | 
 89 | We can identify target resources by path. For example, the path
 90 | `/networkchaos/<name>/podchaos/<pod-name>` will identify the podnetworkchaos
 91 | with `id=<pod-name>` owned by networkchaos with `id=<name>`. If the `id` is
 92 | ignore and resource type is plural, like `/networkchaoses/podchaoses`, all
 93 | resources belonging to these kinds will be identified.
 94 | 
 95 | - `show` and `desc`
 96 | 
 97 | We provide two basic debug subcommands, `get` and `desc`. The `get` command only
 98 | shows key information like `name` or `id` of resources while `desc` command
 99 | shows full information.
100 | 
101 | - `delete`
102 | 
103 | We will provide `delete` subcommand, witch will delete identified resources.
104 | 
105 | #### Translation
106 | 
107 | If we choose GraphQL as the API solution, we must translate paths to queries.
108 | The plural subpaths will be tanslated to GraphQL fragments without any
109 | parameters. For example, resource paths `/networkchaoses/iptableses` will be
110 | translated to query `{ networkchaos { iptables } }`.
111 | 
112 | And the cascade subpaths will be tanslate to fragments with paramenters. For
113 | example, resource paths `/networkchaos/<name>/iptables/<podName>` will be
114 | translated to following query.
115 | 
116 | ```GraphQL
117 | # { "name": "<name>", "podName": "<podName>" }
118 | query GetIptables($name: String!, $podName: String!) {
119 |   networkchaos(name: $name) {
120 |     iptables(name: $podName)
121 |   }
122 | }
123 | ```
124 | 
125 | #### Auto-Completion
126 | 
127 | We can improve auto-completion with the schema and data. For example, when the
128 | user types `chaosctl get /network`, we can complete the command to `chaosctl get
129 | /networkchaos/` or `chaosctl get /networkchaoses` by schema. Then, if the user
130 | choose `chaosctl get /networkchaos/`, we can send query
131 | `{ networkchaos { name}}` and complete the command to
132 | `chaosctl get /networkchaos/<name>` by the result.
133 | 
134 | ## Alternatives
135 | 
136 | Implement all APIs one by one as our requirements.
137 | 


--------------------------------------------------------------------------------
/text/2021-08-11-implement-physical-machine-chaos.md:
--------------------------------------------------------------------------------
  1 | # Implement Physical Machine Chaos in Chaos Mesh
  2 | 
  3 | ## Background
  4 | 
  5 | Now we have implemented some chaos in [Chaosd](https://github.com/chaos-mesh/chaosd),
  6 | which is used to inject fault to the physical machine. It's a simple command-line
  7 | tool, and also can work as a server. It lacks UI（just like Chaos Mesh Dashboard
  8 | to create and manage experiments, and can't orchestrate the experiment just like
  9 | the [workflow](https://chaos-mesh.org/docs/create-chaos-mesh-workflow).
 10 | 
 11 | ## Proposal
 12 | 
 13 | Implement physical machine chaos in Chaos Mesh, and then we can reuse the
 14 | Dashboard and workflow for it.
 15 | 
 16 | There are two ways to implement it:
 17 | 
 18 | ### 1. Treat physical machines as a new selector
 19 | 
 20 | In this way, we will reuse the chaos type in Chaos Mesh, and implement a new
 21 | selector to choose physical machines.
 22 | 
 23 | For example, below is a YAML config for NetworkChaos:
 24 | 
 25 | ```YAML
 26 | apiVersion: chaos-mesh.org/v1alpha1
 27 | kind: NetworkChaos
 28 | metadata:
 29 |   name: network-delay
 30 |   namespace: busybox
 31 | spec:
 32 |   action: delay
 33 |   mode: all
 34 |   selector:
 35 |     namespaces:
 36 |       - busybox
 37 |   delay:
 38 |     latency: "5s"
 39 | ```
 40 | 
 41 | It will inject network delay on all pods in the busybox namespace.
 42 | 
 43 | For physical machine, we can implement a new selector, the YAML config may looks
 44 | like:
 45 | 
 46 | ```YAML
 47 | apiVersion: chaos-mesh.org/v1alpha1
 48 | kind: NetworkChaos
 49 | metadata:
 50 |   name: network-delay
 51 |   namespace: chaos-testing
 52 | spec:
 53 |   action: delay
 54 |   selector:
 55 |     physicalMachines:
 56 |       - 123.123.123.123:123
 57 |       - 124.124.124.124:124
 58 |   delay:
 59 |     latency: "5s"
 60 | ```
 61 | 
 62 | We replace the selector `namespaces` to `physicalMachines` , and the
 63 | `123.123.123.123:123` and `124.124.124.124:124` are the addresses of Chaosd server.
 64 | 
 65 | #### Advantage of proposal 1
 66 | 
 67 | The physics experiment and K8s experiment are unified, the only difference is
 68 | the selectors.
 69 | 
 70 | #### Disadvantage of proposal 1
 71 | 
 72 | Higher implementation costs.
 73 | 
 74 | * We know that Chaos Mesh is designed for K8s, there are too many codes coupled
 75 | to K8s.
 76 | * The chaos type is related to a selector, which means each chaos has only one
 77 | target type. For example, DNSChaos inject fault to some containers, NetworkChaos
 78 | inject fault to some pods. This means we need to implement the physical machine
 79 | selector for every chaos.
 80 | * The config and implementation of the same chaos type for K8s and physical
 81 | machines are different, like DNSChaos, JVMChaos, etc. It is difficult to unify them.
 82 | 
 83 | ### 2. Treat chaos on the physical machine as a new chaos type
 84 | 
 85 | Implement physical machine chaos as a new chaos type in Chaos Mesh. Add a new
 86 | crd named `PhysicalMachineChaos` , the config includes:
 87 | 
 88 | * action: the subtypes of `PhysicalMachineChaos`, the action can be `stress-cpu`,
 89 | `stress-mem`, `network-delay`, `network-loss` and so on.
 90 | * address: the addresses of chaosd server.
 91 | * config related to the action: for example, for `stress-cpu` action, we need to
 92 | set `load` and `workers`.
 93 | 
 94 | Here is the sample YAML config for network delay:
 95 | 
 96 | ```YAML
 97 | apiVersion: chaos-mesh.org/v1alpha1
 98 | kind: PhysicalMachineChaos
 99 | metadata:
100 |   name: physical-network-delay
101 |   namespace: chaos-testing
102 | spec:
103 |   action: network-delay
104 |   address:
105 |     - http://172.16.112.130:31767
106 |   network-delay:
107 |     device: “ens33”
108 |     hostname: “baidu.com”
109 |     duration: "5s"
110 | ```
111 | 
112 | Here is the sample YAML config for CPU stress:
113 | 
114 | ```YAML
115 | apiVersion: chaos-mesh.org/v1alpha1
116 | kind: PhysicalMachineChaos
117 | metadata:
118 |   name: physical-stress-cpu
119 |   namespace: chaos-testing
120 | spec:
121 |   action: stress-cpu
122 |   address:
123 |     - http://172.16.112.130:31767
124 |   stress-cpu:
125 |     workers: 3
126 |     load:  10
127 | ```
128 | 
129 | #### Advantage of proposal 2
130 | 
131 | * Experiments for the physical machine are relatively independent, and the
132 | implementation of it has no effect on other chaos.
133 | * Low development cost.
134 | 
135 | #### Disadvantage of proposal 2
136 | 
137 | We need to define a lot of chaos subtypes in the `PhysicalMachineChaos` , it's a
138 | bit complicated to define it.
139 | 
140 | ### Summary
141 | 
142 | I prefer to develop through the second option, keep the physical machine
143 | experiment separate, so it is more flexible.
144 | 


--------------------------------------------------------------------------------
/text/2021-09-09-physical-machine-auth.md:
--------------------------------------------------------------------------------
  1 | # Physical Machine Auth
  2 | 
  3 | ## Summary
  4 | 
  5 | PhysicalMachineChaos-based auth solution, including end-user authorization
  6 | and service-to-service authentication.
  7 | 
  8 | ## Motivation
  9 | 
 10 | Chaos Dashboard's Authentication and Authorization scheme is only applicable
 11 | to single-cluster Kubernetes, and the Chaos Mesh platform needs to be adapted
 12 | to more scenarios, such as multi-cluster Kubernetes, physical machines, cloud
 13 | infrastructure, etc.
 14 | 
 15 | ### Goals
 16 | 
 17 | - end-user authorization on physical machine chaos
 18 | - service-to-service authentication on physical machine chaos
 19 | 
 20 | ### Non-Goals
 21 | 
 22 | - end-user authentication on physical machine chaos (same with other chaos
 23 | which is using token)
 24 | 
 25 | ## Detailed design
 26 | 
 27 | ### End-user authorization
 28 | 
 29 | #### New Custom Resource: `PhysicalMachine`
 30 | 
 31 | Because there is no concept of pod to filter the target physical machine
 32 | when performing chaos injection on a physical machine, a resource needs to
 33 | be introduced to represent the physical machine, which is the new custom
 34 | resource `PhysicalMachine`. Here is a sample of `PhysicalMachine`.
 35 | 
 36 | ```yaml
 37 | apiVersion: chaos-mesh.org/v1alpha1
 38 | kind: PhysicalMachine
 39 | metadata:
 40 |   name: pm-172.16.112.130
 41 |   namespace: chaos-testing
 42 |   labels:
 43 |     chaos-mesh/physical-machine-group: abc
 44 |     kubernetes.io/arch: arm64
 45 |     kubernetes.io/os: linux
 46 | spec:
 47 |   address: https://172.16.112.130:31767
 48 | ```
 49 | 
 50 | #### Changes to `PhysicalMachineChaos`
 51 | 
 52 | Remove the `address` field in `PhysicalMachineChaos`, and use the `selector`
 53 | field to filter the injection range of the experiment.
 54 | 
 55 | ```yaml
 56 | apiVersion: chaos-mesh.org/v1alpha1
 57 | kind: PhysicalMachineChaos
 58 | metadata:
 59 |   name: physical-network-delay
 60 |   namespace: chaos-testing
 61 | spec:
 62 |   action: network-delay
 63 |   network-delay:
 64 |     device: “ens33”
 65 |     hostname: “baidu.com”
 66 |     duration: "5s"
 67 |   selector:
 68 |     namespaces:
 69 |       - testA
 70 |     labelSelectors:
 71 |       chaos-mesh/physical-machine-group: abc
 72 | ```
 73 | 
 74 | #### RBAC
 75 | 
 76 | According to the above design, there is no need to change the existing
 77 | `Role`/`ClusterRole` content, because the access to `PhysicalMachine` is
 78 | included in the api group of `chaos-mesh.org`.
 79 | 
 80 | We have two different ranges of access rights:
 81 | 
 82 | - clustered scope: permissions for `PhysicalMachines` in all namespaces,
 83 | so you can experiment with all physical machines.
 84 | - namespaced scope：permissions for the `PhysicalMachine` in the current
 85 | namespace, and you can experiment with multiple physical machines in that
 86 | namespace.
 87 | 
 88 | ### service-to-service authentication
 89 | 
 90 | To secure the service-to-service communication, establish an mTLS connection
 91 | between chaos-controller-manager and chaosd.
 92 | 
 93 | The following certificates are required:
 94 | 
 95 | - CA certificate: generated when deploying `Chaos Mesh` with `helm` or
 96 | `install.sh`, and saved in the secret named
 97 | `chaos-mesh-controller-manager-client-certs` (and on each physical machine)
 98 | - certificate of chaos-controller-manager side: generated when deploying
 99 | `Chaos Mesh` with `helm` or `install.sh`, and saved in the secret named
100 | `chaos-mesh-controller-manager-client-certs`
101 | - certificate of chaosd side: generated automatically or manually when adding
102 | physical machine information to the cluster, saved on each physical machine
103 | 
104 | #### User Cases
105 | 
106 | Here is a description of three different usage scenarios of user cases.
107 | 
108 | #### Case1: Automatically generate certificates
109 | 
110 | Prerequisites:
111 | 
112 | 1. User deployed `Chaos Mesh` in Kubernetes cluster with security mode
113 | 1. The node executing the `chaosctl` command can ssh to
114 | the target physical machine
115 | 1. The node executing the `chaosctl` command can access the Kubernetes cluster
116 | 
117 | Steps:
118 | 
119 | 1. User prepares the physical machine using `chaosctl`, the command might be
120 | `chaosctl physical-machine init --server=127.0.0.1 --port=2333`, in this step,
121 | `chaosctl` generates the certificates on the physical machine side and copies
122 | all the required certificates to the target physical machine, then create the
123 | `PhysicalMachine` CR in Kubernetes cluster (BTW, `chaosctl pm` can be used
124 | instead of `chaosctl physical-machine`)
125 | 1. User starts chaosd service on the physical machine
126 | 1. User creates a physical machine experiment on the dashboard
127 | 1. Chaos-controller-manager establishes an mTLS connection when requesting
128 | the chaosd service on the physical machine
129 | 
130 | #### Case2: Manually generate certificates
131 | 
132 | Prerequisites:
133 | 
134 | 1. User deployed `Chaos Mesh` in Kubernetes cluster with security mode
135 | 1. The node executing the `chaosctl physical-machine create` command can
136 | access the Kubernetes cluster
137 | 
138 | Steps:
139 | 
140 | 1. User copies the CA certificate from the kubernetes cluster
141 | to the physical machine
142 | 1. User uses `chaosctl` to generate the certificates on the physical machine,
143 | the command might be `chaosctl physical-machine generate`
144 | 1. User uses `chaosctl` to create `PhysicalMachine` resource in Kubernetes
145 | cluster, the command might be
146 | `chaosctl physical-machine create --server=127.0.0.1 --port=2333`
147 | 1. User starts chaosd service on the physical machine
148 | 1. User creates a physical machine experiment on the dashboard
149 | 1. Chaos-controller-manager establishes an mTLS connection when requesting
150 | the chaosd service on the physical machine
151 | 
152 | #### Case3: Without mTLS authentication (Not recommended)
153 | 
154 | Prerequisites:
155 | 
156 | 1. User deployed `Chaos Mesh` in Kubernetes cluster without security mode
157 | 
158 | Steps:
159 | 
160 | 1. User uses `chaosctl` to create `PhysicalMachine` resource in Kubernetes
161 | cluster, the command might be
162 | `chaosctl physical-machine create --server=127.0.0.1 --port=2333 --protocol=http`
163 | 1. User starts chaosd service on the physical machine
164 | 1. User creates a physical machine experiment on the dashboard
165 | 1. Chaos-controller-manager will use HTTP to request the chaosd service
166 | 
167 | ## Drawbacks
168 | 
169 | Because `PhysicalMachine` is a namespaced scope resource, users can create
170 | the same physical machine information on different namespaces, which may
171 | cause duplicate injection problems and  it requires the chaosd service
172 | API to support idempotency.
173 | 
174 | ## Alternatives
175 | 
176 | NA
177 | 
178 | ## Unresolved questions
179 | 
180 | ### Not support a role to access multiple specified `PhysicalMachines`
181 | 
182 | One option considered is to directly use the `resourceNames` field in
183 | Kubernetes RBAC to control access by specifying the `name` field of the
184 | PhysicalMachine resource. However, in practice, we found that `resourceNames`
185 | only supports `GET` and `DELETE` APIs, not `LIST`, `WATCH`, `CREATE` and
186 | `DELETECOLLECTION` APIs, which makes it impossible for users to select the
187 | physical machine they want when operating on the dashboard. If there is a
188 | practical need, we may use `OPA`, `gatekeeper` and other policy frameworks to
189 | control the fine-grained permissions more easily.
190 | 


--------------------------------------------------------------------------------
/text/2021-09-27-refine-error-handling.md:
--------------------------------------------------------------------------------
  1 | # Refine Error Handling
  2 | 
  3 | ## Background
  4 | 
  5 | There are a lot of different error handling pattern in Chaos Mesh. Some of the
  6 | errors in the code are equipped with a backtrace, while some are not. The mass
  7 | in error handling has been blocking us from diagnosing the problem and handling
  8 | potential errors. The Chaos Mesh code base imports `github.com/pingcap/errors`,
  9 | `github.com/pkg/errors` and the `errors` in standard library.
 10 | 
 11 | ## Proposal
 12 | 
 13 | Most of the error handling pattern follows [uber
 14 | go-style](https://github.com/uber-go/guide/blob/master/style.md), and some rules
 15 | are modified or removed because of the latest update of `pkg/errors` and
 16 | `errors` in standard library.
 17 | 
 18 | ### Error Types
 19 | 
 20 | When returning errors, consider the following to determine the best choice:
 21 | 
 22 | 1. Does the client need to extract the information of error? If so, you should
 23 |    use a custom type, and implement `Error()` method. You should also wrap it
 24 |    with `errors.WithStack` to make sure it has a backtrace.
 25 | 
 26 |    Example:
 27 | 
 28 |    ```go
 29 |    type ErrNotFound struct {
 30 |       File string
 31 |    }
 32 | 
 33 |    func open(file string) error {
 34 |       return errors.WithStack(ErrNotFound{file: file})
 35 |    }
 36 |    ```
 37 | 
 38 |    Then the user will be able to use `errors.As` to extract the `ErrNotFound`
 39 |    error, and get the `File` field. However, you should think **carefully**
 40 |    whether this field is necessary for the caller to handle the error. In former
 41 |    example, including the `File` in the error is really a **bad** case:
 42 | 
 43 |    1. The caller actually know the file. It doesn't provide more information.
 44 |    2. The caller don't need to know the file to handle the error in most cases.
 45 | 
 46 |    In this case, a simple error variable with wrap is prefered, e.g.
 47 | 
 48 |    ```go
 49 |    var ErrNotFound = errors.New("not found")
 50 | 
 51 |    func open(file string) error {
 52 |       return errors.Wrapf(ErrNotFound, "open file %s", file)
 53 |    }
 54 |    ```
 55 | 
 56 | 2. Is it an error which needs to be detected? If so, you should use a global
 57 |    public variable to store the error, and wrap it with `errors.WithStack` when
 58 |    returning the error.
 59 | 
 60 |    Example:
 61 | 
 62 |    ```go
 63 |    var (
 64 |       ErrPodNotFound = errors.New("pod not found")
 65 | 
 66 |       ErrPodNotRunning = errors.New("pod not running")
 67 |    )
 68 | 
 69 |    func handle() error {
 70 |       return errors.WithStack(ErrPodNotFound)
 71 |    }
 72 |    ```
 73 | 
 74 | 3. Combine of rule 1 and rule 2. The context information is needed by the
 75 |    caller, but it's also widely used by multiple functions. The variable name
 76 |    could be more detailed than the type.
 77 | 
 78 |    Example:
 79 | 
 80 |    ```go
 81 |    type ContainerRuntimeClientConnectErr struct {
 82 |       ContainerRuntime string
 83 |    }
 84 | 
 85 |    var DockerConnectErr = ContainerRuntimeClientConnectErr {
 86 |       ContainerRuntime: "docker"
 87 |    }
 88 |    ```
 89 | 
 90 | 4. Is this an error which will not be detected but appears in a lot of
 91 |    functions? If so, you should use a global private variable, and wrap it with
 92 |    `errors.WithStack`. If you want to share the same error (like `Not Found`) in
 93 |    multiple packages, but the callers will never need to detect the `Not Found`,
 94 |    please create a new `notFound` error under every package.
 95 | 
 96 |    ```go
 97 |    var errNotFound = errors.New("not found")
 98 |    ```
 99 | 
100 | 5. Is this really a simple error that will not appear in other functions? If so,
101 |    you should use an inline `errors.New`. The `errors.New` in `pkg/errors` is
102 |    already equipped with a stack backtrace, so you don't need to add it again.
103 | 
104 |    Example:
105 | 
106 |    ```go
107 |    func open(file string) error {
108 |       return errors.New("not found")
109 |    }
110 |    ```
111 | 
112 | 6. Are you propagating an error returned by other functions? If it's returned by
113 |    function inside the Chaos Mesh, we could assume this error is already
114 |    equipped with a stack backtrace, so there is not need to call
115 |    `errors.WithStack`. However, if it's returned by other libraries, it should
116 |    to call `errors.WithStack` to equip it with a stack backtrace. For more
117 |    information, see the section on error wrapping.
118 | 
119 |    ```go
120 |    func startProcess(cmd *exec.Cmd) error {
121 |       err := cmd.Start()
122 |       if err != nil {
123 |          return nil, errors.WithStack(err)
124 |       }
125 | 
126 |       return nil
127 |    }
128 |    ```
129 | 
130 | ### Error Wrapping
131 | 
132 | * Return the original error if there is no additional context to add. If the
133 |   original error is not equipped with a stack, return with `errors.WithStack`.
134 | * Add context using
135 |   [`"pkg/errors".Wrap`](https://pkg.go.dev/github.com/pkg/errors#Wrap) so that
136 |   the error message provides more context
137 | 
138 |    ```go
139 |    var ErrNotFound = errors.New("not found")
140 | 
141 |    func open(file string) error {
142 |       return errors.Wrapf(ErrNotFound, "open file %s", file)
143 |    }
144 |    ```
145 | 
146 | * Use [`"pkg/errors".Errorf`](https://pkg.go.dev/github.com/pkg/errors#Errorf)
147 |   with if the callers do not need to detect or handle that specific error case.
148 | 
149 | The [`"pkg/errors".Wrap`](https://pkg.go.dev/github.com/pkg/errors#Wrap) and
150 | [`"pkg/errors".Errorf`](https://pkg.go.dev/github.com/pkg/errors#Errorf) will
151 | add a stack trace, so you don't need to wrap it with `WithStack` again.
152 | 
153 | The context usually includes: what you are doing, the object of the operation
154 | (like the pod name, the chaos name...).
155 | 
156 | ```go
157 | func startProcess(cmd *exec.Cmd) error {
158 |    err := cmd.Start()
159 |    if err != nil {
160 |       return nil, errors.Errorf("start process: %w", err)
161 |    }
162 | 
163 |    return nil
164 | }
165 | ```
166 | 
167 | A more realistic example in `chaos-daemon` is:
168 | 
169 | ```go
170 | func (s *DaemonServer) ExecStressors(ctx context.Context,
171 |    req *pb.ExecStressRequest) (*pb.ExecStressResponse, error) {
172 |    ...
173 | 
174 |    control, err := cgroups.Load(daemonCgroups.V1, daemonCgroups.PidPath(int(pid)))
175 |    if err != nil {
176 |       return nil, errors.Wrapf(err, "load cgroup of pid %d", pid)
177 |    }
178 | 
179 |    ...
180 | }
181 | ```
182 | 
183 | When adding context to returned errors, keep the context succinct by avoiding
184 | phrases like "failed to", which state the obvious and pile up as the error
185 | percolates up through the stack:
186 | 
187 | <table>
188 | <thead><tr><th>Bad</th><th>Good</th></tr></thead>
189 | <tbody>
190 | <tr><td>
191 | 
192 | ```go
193 | s, err := store.New()
194 | if err != nil {
195 |    return errors.Errorf(
196 |       "failed to create new store: %w", err)
197 | }
198 | ```
199 | 
200 | </td><td>
201 | 
202 | ```go
203 | s, err := store.New()
204 | if err != nil {
205 |    return errors.Errorf(
206 |       "new store: %w", err)
207 | }
208 | ```
209 | 
210 | </td></tr>
211 | </tbody>
212 | </table>
213 | 
214 | ### Error Handling
215 | 
216 | The key of error handling is to identify whether this error is an expected one.
217 | For example, if the Chaos Mesh is recovering an injection, but the kubernetes
218 | returns `PodNotFound`, we need to "handle" it by just ignoring.
219 | 
220 | Identifying errors could have two different situation: if the target error is a
221 | type, you could use `errors.As` to extract the error information, and if the
222 | target error is a variable (created by `errors.New`, in most situation), you
223 | could use `errors.Is`.
224 | 
225 | The `error` type of golang is really a chaos, a type and a varaible can both
226 | mean a kind of error, and you should treat them in different pattern. If it's a
227 | customized error type, it could be verified through `errors.As`.
228 | 
229 | Example:
230 | 
231 | ```go
232 | type ContainerRuntimeClientConnectErr struct {
233 |    ContainerRuntime string
234 | }
235 | 
236 | var crccErr ContainerRuntimeClientConnectErr
237 | if errors.As(err, &crccErr) {
238 |    fmt.Println(crccErr.ContainerRuntime)
239 | }
240 | ```
241 | 
242 | If this kind of error is a variable, it could be checked through `errors.Is`:
243 | 
244 | Example:
245 | 
246 | ```go
247 | var DockerConnectErr = ContainerRuntimeClientConnectErr {
248 |    ContainerRuntime: "docker"
249 | }
250 | 
251 | if errors.Is(err, DockerConnectErr) {
252 |    fmt.Println("Docker Connect Error!")
253 | }
254 | ```
255 | 
256 | If this kind of error is not exported (e.g. an inline error or unexported
257 | variable/type), and you really need to detect it, please modify the callee
258 | function to export the error.
259 | 
260 | #### Handling Error from Kubernetes
261 | 
262 | The error from kubernetes client is usually derived from an HTTP status. The
263 | only way to identify them is to use `k8sError.Is*`, e.g. `k8sError.IsNotFound`.
264 | 
265 | Wrapping an error in the Kubernetes is fine, as the `k8sError.Is*` will finally
266 | use `errors.As` to extract the information from the error.
267 | 
268 | #### Handling Error from Grpc
269 | 
270 | The error returned from grpc call is also too simple, and will lose the
271 | hierarchy error stack. The only way to identify an error returned from the grpc
272 | call is to use the `IsXXX` provided by the callee. For example:
273 | 
274 | ```go
275 | _, err = pbClient.RecoverTimeOffset(ctx, &pb.TimeRequest{
276 |    ContainerId: containerId,
277 | })
278 | if err != nil {
279 |    if chaosdaemonerr.IsContainerNotFound(err) {
280 |       // ignore the container not found error
281 |       return v1alpha1.NotInjected, nil
282 |    }
283 |    return v1alpha1.Injected, errors.WithStack(err)
284 | }
285 | ```
286 | 
287 | Wrapping an error from grpc is also fine, the `IsContainerNotFound` function
288 | should handle the situation where this error is `Wrap`ed multiple times.
289 | 
290 | ### The End of Error
291 | 
292 | All error will finally be consumed by something. In this section, we will
293 | discuss when an error is not fully handled (which is not possible in most
294 | situations, as you don't know the full set of the error kinds in go), what
295 | should you do.
296 | 
297 | #### Before the logger setuped
298 | 
299 | 1. If the logger hasn't setuped, and the error is not bearable. You should call
300 |    `log.Fatal(err)`.
301 | 2. If the logger hasn't setuped, and the error is acceptable. You should print
302 |    it with `fmt.Printf("%v", err)`.
303 | 
304 | #### Inside a reconciler
305 | 
306 | Use `r.Log.Error` to print the error, with some contexts, and use
307 | `r.Recorder.Event` to send kubernetes event if needed. The `recorder.Failed` is
308 | a simple representation of error event.
309 | 
310 | #### Inside a grpc function implementation
311 | 
312 | Make sure the error is printed in the log, with the stack information (which is
313 | the default behavior for the zapr error log).
314 | 
315 | 1. If the error needn't to be identified by the client, you could simply return
316 |    it (and it will become an `Unknown Error` with the `err.Error()` as message).
317 | 2. If the error should be detected by the client, you should pick up a status
318 |    code, and return `"grpc/status".Error(code, message)`. The message should be
319 |    used to take apart this error from others. You should also provide a
320 |    `IsXXXError` function in an standalone error package to assert the error.
321 | 
322 | A template implementation for `IsXXX` could be:
323 | 
324 | ```go
325 | package error
326 | 
327 | import (
328 |    "github.com/pkg/errors"
329 | 
330 |    "google.golang.org/grpc/status"
331 |    "google.golang.org/grpc/code"
332 | )
333 | 
334 | type StatError interface {
335 |    error
336 |    GRPCStatus() *Status
337 | }
338 | 
339 | func IsNotFound(err error) bool {
340 |    if grpcError := StatError(nil); errors.As(err, &grpcError) {
341 |       status := grpcError.GRPCStatus()
342 |       return status.Code() == code.NotFound && status.Message() == ErrNotFoundMsg
343 |    }
344 |    return false
345 | }
346 | ```
347 | 
348 | Adding a customized interface `StatError` is required, as it's not provided by
349 | the `grpc/status` package. Here is a feature request
350 | [issue](https://github.com/grpc/grpc-go/issues/2934#issuecomment-624749630) for
351 | this.
352 | 
353 | It would also be suggested to define a variable to represent the error message
354 | 
355 | ```go
356 | const ErrNotFound = "grpc/status".Error(code.NotFound, ErrNotFoundMsg)
357 | ```
358 | 
359 | However, inside the chaos-daemon, the error still passes like the normal go
360 | error, which means you need to do some convert at the **end** of the execution.
361 | For example:
362 | 
363 | ```go
364 | // Error generation
365 | package crclients
366 | 
367 | var ContainerNotFound = errors.New("container not found")
368 | 
369 | func ContainerKill(containerId string) err {
370 |    return errors.Wrapf(ContainerNotFound, "not found id: %d", containerId)
371 | }
372 | 
373 | // Convert error to which is suitable for grpc
374 | package errors
375 | 
376 | type StatError interface {
377 |    error
378 |    GRPCStatus() *Status
379 | }
380 | 
381 | const ErrContainerNotFound = "grpc/status".Error(code.NotFound, ContainerNotFound.String())
382 | 
383 | func IsContainerNotFound(err error) bool {
384 |    if grpcError := StatError(nil); errors.As(err, &grpcError) {
385 |       status := grpcError.GRPCStatus()
386 |       return status.Code() == code.NotFound && status.Message() == ContainerNotFound.String()
387 |    }
388 |    return false
389 | }
390 | 
391 | package chaosdaemon
392 | 
393 | func (s *DaemonServer) ContainerKill(ctx context.Context, req *pb.ContainerRequest) (*empty.Empty, error) {
394 |    err := ContainerKill(req.ContainerId)
395 | 
396 |    if errors.Is(err, crclients.ContainerNotFound) {
397 |       // this log is necessary to keep the stack trace, as the stack trace and all additional information
398 |       // will lost when returning to the grpc caller.
399 |       log.Error(err, "kill container")
400 |       return nil, chaosdaemonErr.ErrContainerNotFound
401 |    }
402 | }
403 | ```
404 | 
405 | Then the client ("chaos-controller-manager") can check this error with the help
406 | of `IsContainerNotFound`:
407 | 
408 | ```go
409 | _, err := pbClient.ContainerKill(ctx, &pb.ContainerRequest{
410 |    Action: &pb.ContainerAction{
411 |       Action: pb.ContainerAction_KILL,
412 |    },
413 |    ContainerId: containerId,
414 | })
415 | 
416 | if chaosdaemonErr.IsContainerNotFound(err) {
417 |    fmt.Println("It's container not found!")
418 | }
419 | ```
420 | 
421 | It's inconvenient to extract any information from the error, as it's just a
422 | string (and an error code), which is only enough to assert the "kind". However,
423 | it's enough for current Chaos Mesh codebase (but I don't know whether we will
424 | need to find ways to extract information in the future).
425 | 
426 | #### Inside the dashboard apiserver
427 | 
428 | Every error types have been defined in the
429 | `/pkg/dashboard/apiserver/utils/error.go`, with a comment for the status code.
430 | If you need to return an error, please choose one from them, and send them
431 | through
432 | `"github.com/chaos-mesh/chaos-mesh/pkg/dashboard/apiserver/utils".SetAPIError`.
433 | For example:
434 | 
435 | ```go
436 | utils.SetAPIError(c, utils.ErrInternalServer.WrapWithNoMessage(err))
437 | ```
438 | 
439 | The `c` is the `gin.Context`. After set the error, this request will be returned
440 | a json error message:
441 | 
442 | ```json
443 | {
444 |    "code": code,
445 |    "type": typeName,
446 |    "message": err.Error(),
447 |    "full_text": fmt.Sprintf("%+v", err)
448 | }
449 | ```
450 | 
451 | ### Print
452 | 
453 | If you will return the error to the parent function, it's suggested not to print
454 | it out, as it will be printed by its ancestors.
455 | 
456 | #### Console
457 | 
458 | If the logger has setupped, you could use `log.Error` to print the error. The
459 | `zapr` implementation of `logr` will add `errorVerbose` and `errorCause` fields
460 | to represent the full information of the error (with `%+v`).
461 | 
462 | If the logger hasn't setupped, you could use `fmt.Printf("%+v", err)` to print
463 | the error, or with additional context.
464 | 
465 | #### Kubernetes Events
466 | 
467 | All kubernetes events should be sent through `recorder.Event`, to make sure all
468 | information could be extracted from attribution. The message of kubernetes
469 | events shouldn't be long, so please not add the full error stack.
470 | 
471 | ## Alternatives
472 | 
473 | There are tons of error handling styles. Feel free to propose your opinion
474 | through the comments, and let us discuss and decide the most suitable one for
475 | Chaos Mesh.
476 | 


--------------------------------------------------------------------------------
/text/2021-10-08-monitoring-metrics-about-chaos-mesh.md:
--------------------------------------------------------------------------------
  1 | # Monitoring Metrics about Chaos Mesh
  2 | 
  3 | ## Summary
  4 | 
  5 | More metrics are needed for improving the observability of `chaos-controller-manager`,
  6 | `chaos-daemon` and `chaos-dashboard`. Now, we already have several metrics in
  7 | `chaos-controller-manager` and `chaos-daemon`, but it's not enough.
  8 | 
  9 | ## Motivation
 10 | 
 11 | At present, we only collected several metrics about webhooks like
 12 | `chaos_mesh_injections_total`, chaos experiments information in
 13 | `chaos-controller-manager`, and HTTP and gRPC in `chaos-daemon`.
 14 | These metrics are not easy to reflect the overall appearance of chaos-mesh system,
 15 | so we need more metrics to improve the observability.
 16 | 
 17 | According to this proposal https://github.com/chaos-mesh/chaos-mesh/issues/2198 ,
 18 | below is several metrics about logic pattern and performance should be implemented
 19 | and exposed on `/metrics` HTTP endpoint:
 20 | 
 21 | `chaos-controller-manager`:
 22 | 
 23 | - time histogram for each `Reconcile()` in `Reconciler`(already provided by `controller-runtime`)
 24 | - count of Chaos Experiments, Schedule, and Workflow
 25 | - count of the emitted event by `chaos-controller-manager`
 26 | - common metrics of gRPC client
 27 | - metrics for kubernetes webhook (already provided by `controller-runtime`)
 28 | 
 29 | `chaos-daemon`:
 30 | 
 31 | - common metrics of grpc-server (already provided)
 32 | - count of processes controlled by `bpm` (background process manager)
 33 | - count of iptables/ipset/tc rules
 34 | 
 35 | `chaos-dashboard`:
 36 | 
 37 | - time histogram for each HTTP query
 38 | - count for archived object
 39 | - time histogram for archive reconciler (already provided by `controller-runtime`)
 40 | 
 41 | ## Detailed design
 42 | 
 43 | ### Metrics Plan
 44 | 
 45 | The metrics design is as follows. Here are a few things should be noded:
 46 | 
 47 | - The implementations of `chaos_controller_manager_chaos_experiments_count` and
 48 |   `chaos_daemon_grpc_server_handling_seconds` have been provided,
 49 |   (called `chaos_mesh_experiments` and `grpc_server_handling_seconds`). It is just
 50 |   hoping to modify the names to standardize naming. More about naming, see this
 51 |   guideline: [Metric and label naming](https://prometheus.io/docs/practices/naming/).
 52 | - Time histogram metrics for each `Reconcile()` in `Reconciler` have been provided
 53 |   by `controller-runtime` called `controller_runtime_reconcile_time_seconds`.
 54 | - Metrics for kubernetes webhook also have been provided by `controller-runtime`.
 55 |   They are called `controller_runtime_webhook_latency_seconds`, `controller_runtime_webhook_requests_total`,
 56 |   and `controller_runtime_webhook_requests_in_flight`.
 57 | 
 58 | <!-- markdownlint-disable MD013 -->
 59 | 
 60 | | Name                                                  | Description                                                  | Type         | Label                                                 | Buckets                      |
 61 | | ----------------------------------------------------- | ------------------------------------------------------------ | ------------ | ----------------------------------------------------- | ---------------------------- |
 62 | | chaos_controller_manager_chaos_experiments            | Total number of chaos experiments and their phases           | GaugeVec     | namespace, kind, phase                                | /                            |
 63 | | chaos_controller_manager_chaos_schedules              | Total number of chaos schedules                              | GaugeVec     | namespace                                             | /                            |
 64 | | chaos_controller_manager_chaos_workflows              | Total number of chaos workflows                              | GuageVec     | namespace                                             | /                            |
 65 | | chaos_controller_manager_emitted_event_total          | Total number of the emitted event by chaos-controller-manager | CounterVec   | type, reason, namespace                               | /                            |
 66 | | chaos_controller_manager_grpc_client_handling_seconds | common metrics of grpc-client                                | HistogramVec | grpc_code, grpc_method, grpc_service, grpc_type       | DefBuckets                   |
 67 | | chaos_daemon_grpc_server_handling_seconds             | common metrics of grpc-server                                | HistogramVec | grpc_code, grpc_method, grpc_service, grpc_type       | ChaosDaemonGrpcServerBuckets |
 68 | | chaos_daemon_bpm_controlled_process_total             | Total count of bpm controlled processes                      | Count        | /                                                     | /                            |
 69 | | chaos_daemon_bpm_controlled_processes                 | Current number of bpm controlled processes                   | Guage        | /                                                     | /                            |
 70 | | chaos_daemon_iptables_packets                         | Total number of iptables packets                             | GaugeVec     | namespace, pod, container, table, chain, policy, rule | /                            |
 71 | | chaos_daemon_iptables_packet_bytes                    | Total bytes of iptables packets                              | GaugeVec     | namespace, pod, container, table, chain, policy, rule | /                            |
 72 | | chaos_daemon_ipset_members                            | Total number of ipset members                                | GuageVec     | namespace, pod, container                             | /                            |
 73 | | chaos_daemon_tcs                                      | Total number of tc rules                                     | GuageVec     | namespace, pod, container                             | /                            |
 74 | | chaos_dashboard_http_request_duration_seconds         | Time histogram for each HTTP query                           | HistogramVec | path, method, status                                  | DefBuckets                   |
 75 | | chaos_dashboard_archived_experiments                  | Total number of archived chaos experiments                   | GaugeVec     | namespace, type                                       | /                            |
 76 | | chaos_dashboard_archived_schedules                    | Total number of archived chaos schedules                     | GaugeVec     | namespace                                             | /                            |
 77 | | chaos_dashboard_archived_workflows                    | Total number of archived chaos workflows                     | GaugeVec     | namespace                                             | /                            |
 78 | 
 79 | <!-- markdownlint-enable -->
 80 | 
 81 | The current design of Buckets is shown in the table below. The distribution of
 82 | these data need to be obtained later in order to adjust the values so that the
 83 | number of samples in each bucket is similar.
 84 | 
 85 | <!-- markdownlint-disable MD013 -->
 86 | 
 87 | | Buckets Name                 | Value                                                             | Description                                                                |
 88 | |------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------|
 89 | | DefBuckets                   | `[]float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}`     | default prometheus buckets                                                 |
 90 | | ChaosDaemonGrpcServerBuckets | `[]float64{0.001, 0.01, 0.1, 0.3, 0.6, 1, 3, 6, 10}`              | the buckets settings have been implemented, just set constants for clarity |
 91 | <!-- markdownlint-enable -->
 92 | 
 93 | ### Collecting Plan
 94 | 
 95 | Here will introduce the metrics collection ways in complex scenarios. Common
 96 | collection methods such as pull mode can be found in `controllers/metrics/metrics.go`.
 97 | 
 98 | #### gRPC
 99 | 
100 | The implementation of `chaos_controller_manager_grpc_client_handling_seconds`
101 | and `chaos_daemon_grpc_server_handling_seconds` will be provided by
102 | <https://github.com/grpc-ecosystem/go-grpc-prometheus>. It should be noted that
103 | the metrics name needs to be replaced in the original implementation of `chaos-daemon`.
104 | 
105 | ```go
106 | // pkg/chaosdaemon/server.go
107 | func newGRPCServer(reg prometheus.Registerer, ...) (*grpc.Server, error) {
108 |     withHistogramName := func(name string) {
109 |         return func(opts *prometheus.HistogramOpts) {
110 |             opts.Name = name
111 |         }
112 |     }
113 | 
114 |     grpcMetrics := grpc_prometheus.NewServerMetrics()
115 |     grpcMetrics.EnableHandlingTimeHistogram(
116 |         // here we set customized ChaosDeamonGrpcServerBuckets, and customized histogram name
117 |         grpc_prometheus.WithHistogramBuckets(ChaosDaemonGrpcServerBuckets),
118 |         withHistogramName("chaos_deamon_grpc_server_handling_seconds"),
119 |     )
120 |     reg.MustRegister(grpcMetrics)
121 | 
122 |     // ...
123 | }
124 | ```
125 | 
126 | For the implementation of `chaos_controller_manager_grpc_client_handling_seconds`,
127 | add an option function to `GrpcBuilder` in `pkg/grpc/utils.go` and register
128 | `grpc_prometheus.DefaultClientMetrics` for `controllermetrics.Registry`:
129 | 
130 | ```go
131 | // pkg/grpc/utils.go
132 | func (it *GrpcBuilder) WithGrpcMetricsCollection() *GrpcBuilder {
133 |     it.options = append(it.options,
134 |         grpc.WithUnaryInterceptor(grpc_prometheus.UnaryClientInterceptor),
135 |         grpc.WithStreamInterceptor(grpc_prometheus.StreamClientInterceptor),
136 |     )
137 |     return it
138 | }
139 | 
140 | // cmd/chaos-controller-manager/main.go
141 | func Run(params RunParams) error {
142 |     // ...
143 |     // register grpc_prometheus client metrics
144 |     controllermetrics.Registry.MustRegister(grpc_prometheus.DefaultClientMetrics)
145 |     // ...
146 | }
147 | ```
148 | 
149 | #### BPM
150 | 
151 | To collect the metrics of `chaos_daemon_bpm_controlled_process_count`, we
152 | consider that in BPM each Identifier corresponds to a process. So we could
153 | believe that the length of identifiers of BPM is the metric value.
154 | 
155 | #### iptables / ipset / tc
156 | 
157 | For metrics such as `chaos_daemon_iptables_packets`, we should enter the
158 | container network namespace to collect it, such as
159 | `/usr/bin/nsenter -n/proc/%d/ns/net - iptables-save`. In order to collect these
160 | metrics, first we need to list the PIDs of all containers, and then run commands
161 | such as `iptables-save -c` using BPM, and parse the output to obtain information.
162 | 
163 | 1. First, `crclients.ContainerRuntimeInfoClient` needs to provide a new
164 |    interface `ListContainerPIDs() uint32[]`, for each runtime:
165 |    - Docker: call `ContainerAPIClient.ConatinerList` to get PIDs directly, see
166 |      <https://pkg.go.dev/github.com/docker/docker@v20.10.8+incompatible/client#ContainerAPIClient>.
167 |    - containerd: call `Client.Containers` to get the PIDs directly, see
168 |      <https://pkg.go.dev/github.com/containerd/containerd#Client.Containers>.
169 |    - CRI-O: call the gRPC interface `ListContainers` to obtaining the ContainerID,
170 |      and then get the PIDs by `GetPidFromContainerID` which is already
171 |      implemented. See
172 |      <https://github.com/kubernetes/cri-api/blob/a3ef1f9ba8e3252064955235a2e277b0586e1709/pkg/apis/runtime/v1/api.proto#L78>
173 | 2. Then, we need the pod name and container name for each containerPID. So
174 |   `GetLabelsFromContainerID` should be provided by `crclients.ContainerRuntimeInfoClient`
175 |   similarly as above.
176 | 3. Run the command and parse the output:
177 |    - For iptables metrics: use [iptables_exporter](https://github.com/retailnext/iptables_exporter/blob/master/iptables/parser.go)
178 |      to parse the result of `iptables-save -c` to get the number of chains and
179 |      rules, packet number, and byte number.
180 |    - For ipset metrics: parse `ipset list` to obtain the name and members of each
181 |      set in the ipset.
182 |    - For tc metrics: parse `tc qdisc` to get the number of rules.
183 | 


--------------------------------------------------------------------------------
/text/2021-11-17-ui-monorepo.md:
--------------------------------------------------------------------------------
  1 | # Scalable codebase in UI via monorepo
  2 | 
  3 | ## Summary
  4 | 
  5 | Use [Yarn Workspaces](https://classic.yarnpkg.com/lang/en/docs/workspaces/)
  6 | to easily extend the front-end codebase.
  7 | 
  8 | ## Motivation
  9 | 
 10 | > Why are we doing this?
 11 | 
 12 | As the project grows, so does the number of features that need to be supported
 13 | on the front-end. But the current architecture of front-end is a single-repo, which
 14 | means all dependencies are intertwined. With this system, isolating applications
 15 | and dependencies is difficult. You can only put the common logic under a certain
 16 | folder, like `lib` or `components`, it's not possible to put the logic out of the
 17 | `src` folder.
 18 | 
 19 | So we decided to use monorepo to separate the front-end codebase.
 20 | 
 21 | > What use cases does it support?
 22 | 
 23 | According to Chaos Mesh itself, in the future, the front-end will plan to generate
 24 | the corresponding typescript definitions through CRDs. Code like this is application
 25 | agnostic, and putting them together with the original code can be a huge hindrance
 26 | to testing and management.
 27 | 
 28 | > What is the expected outcome?
 29 | 
 30 | Before monorepo, the codebase looks like this:
 31 | 
 32 | ```shell
 33 | ui/
 34 | ├── src/
 35 | │   ├── api/
 36 | │   ├── components/
 37 | │   ├── components-mui/
 38 | │   ├── lib/
 39 | │   ├── pages/
 40 | ├── package.json
 41 | ```
 42 | 
 43 | After using monorepo:
 44 | 
 45 | ```shell
 46 | ui/
 47 | ├── app/ # original src
 48 | │   ├── api/
 49 | │   ├── components/
 50 | │   ├── lib/
 51 | │   ├── pages/
 52 | ├── packages/
 53 | │   ├── mui-extends/ # components-mui
 54 | │   ├── crd/ # generate ts code from CRDs
 55 | ├── package.json
 56 | ```
 57 | 
 58 | ## Detailed design
 59 | 
 60 | First, it's needed to treat the current `src` folder as a package, we decided to
 61 | rename it to `app` and move it to `ui/app`.
 62 | 
 63 | Then extract the public logic, like `husky` and `lint-staged`, we will use them
 64 | in the root:
 65 | 
 66 | ```shell
 67 | yarn add husky lint-staged prettier prettier-plugin-import-sort import-sort-style-eslint -DW # --dev --ignore-workspace-root-check
 68 | ```
 69 | 
 70 | Then add the below into `ui/package.json`:
 71 | 
 72 | ```json
 73 | {
 74 |    "workspaces": [
 75 |     "app",
 76 |     "packages/*"
 77 |   ]
 78 | }
 79 | ```
 80 | 
 81 | Finally, we are ready to create packages:
 82 | 
 83 | ```shell
 84 | mkdir -p packages/mui-extends
 85 | cd packages/mui-extends
 86 | npm init
 87 | ```
 88 | 
 89 | > Note:
 90 | >
 91 | > Some commands will be changed, like:
 92 | >
 93 | > - `yarn start:default` -> `yarn workspace @ui/app start:default`
 94 | 
 95 | ## Drawbacks
 96 | 
 97 | A good explanation:
 98 | 
 99 | https://fossa.com/blog/pros-cons-using-monorepos/
100 | 
101 | ## Alternatives
102 | 
103 | There should be no better way to manage future non-application related logic, after
104 | all, the front-end code is also attached to the entire chaos-mesh codebase.
105 | 
106 | ## Unresolved questions
107 | 
108 | No.
109 | 


--------------------------------------------------------------------------------
/text/2021-12-09-logging.md:
--------------------------------------------------------------------------------
  1 | # Logging
  2 | 
  3 | ## Summary
  4 | 
  5 | Refine the guide for logging for developing Chaos Mesh.
  6 | 
  7 | ## Motivation
  8 | 
  9 | <!-- Why are we doing this? What use cases does it support? What is the expected
 10 | outcome? -->
 11 | 
 12 | Logging in each component in Chaos Mesh is bad. Related issue:
 13 | https://github.com/chaos-mesh/chaos-mesh/issues/2149. Because of the mess on
 14 | logging, the debugging and profiling of the components are difficult.
 15 | 
 16 | This proposal aims to improve the logging observability of the components and
 17 | the development experience on printing logs.
 18 | 
 19 | After passed this proposal, I would write a developing guide for logging in the
 20 | repo of Chaos Mesh.
 21 | 
 22 | ## Requirements
 23 | 
 24 | - structured logging
 25 | - leveled logging
 26 | - runtime level configure
 27 | - could logging in everywhere (every line of code)
 28 | 
 29 | ## Detailed design
 30 | 
 31 | <!-- This is the bulk of the RFC. Explain the design in enough detail that:
 32 | 
 33 | - It is reasonably clear how the feature would be implemented.
 34 | - Corner cases are dissected by example.
 35 | - How the feature is used. -->
 36 | 
 37 | We are going to use [logr](https://github.com/go-logr/logr) and
 38 | [zap](https://github.com/uber-go/zap) (with
 39 | [zapr](https://github.com/go-logr/zapr) as the shim between them) to construct
 40 | the logging facilities.
 41 | 
 42 | [logr](https://github.com/go-logr/logr) is a library that provides a common
 43 | interface for logging, in other words, logging facade. It is designed to be used
 44 | by different logging frameworks(likes the SLF4J API in Java), with using logr,
 45 | we could switch from one logging backend to another without many changes.
 46 | 
 47 | [zap](https://github.com/uber-go/zap) is a high performance logging framework.
 48 | [zapr](https://github.com/go-logr/zapr) is the logr implementation using zap.
 49 | 
 50 | ### Global Logger
 51 | 
 52 | Here would exists a global logger at `pkg/log`, which marked as deprecated and
 53 | only for the compatibility with the old code. Global Logger could be accessed by
 54 | `log.L()`.
 55 | 
 56 | The initial value of the global logger is `logr.DiscardLogger`, which means no
 57 | log would be printed anywhere. The global logger should be replaced ONLY ONCE
 58 | when the application starts.
 59 | 
 60 | ### Logger as dependency
 61 | 
 62 | The suggested way to use the logger is to use it as a dependency, and make
 63 | dependency explicit.
 64 | 
 65 | GOOD Example:
 66 | 
 67 | ```go
 68 | type SomeStruct struct {
 69 |   state string
 70 |   otherField string
 71 |   logger logr.Logger
 72 | }
 73 | 
 74 | func NewSomeStruct(state string, otherField string, logger logr.Logger) *SomeStruct {
 75 |   return &SomeStruct{
 76 |     state: state,
 77 |     otherField: otherField,
 78 |     logger: logger,
 79 |   }
 80 | }
 81 | 
 82 | func (it *SomeStruct) DoSomething()  {
 83 |   it.logger.Info("Doing something")
 84 | }
 85 | ```
 86 | 
 87 | MUST NOT use the implicitly logger in the constructor.
 88 | 
 89 | BAD example:
 90 | 
 91 | ```go
 92 | func NewSomeStruct(state string, otherField string) *SomeStruct {
 93 |   return &SomeStruct{
 94 |     state: state,
 95 |     otherField: otherField,
 96 |     logger: somewhere.DefaultLogger(),
 97 |   }
 98 | }
 99 | ```
100 | 
101 | And logger should be a parameter of the function, not a global variable.
102 | 
103 | GOOD example:
104 | 
105 | ```go
106 | func doSomething(logger logr.Logger) {
107 |   logger.Info("Doing something")
108 | }
109 | ```
110 | 
111 | BAD Example:
112 | 
113 | ```go
114 | var logger = somewhere.DefaultLogger()
115 | 
116 | func doSomething()  {
117 |   logger.Info("Doing something")
118 | }
119 | ```
120 | 
121 | > As you could see, it is actually a closure with an extern/global state.
122 | 
123 | ### Logging in Chaos Mesh Dashboard and gin
124 | 
125 | Chaos Mesh Dashboard use [gin](https://github.com/gin-gonic/gin) as thew web
126 | framework, and there are some logging printed when requests coming in.
127 | 
128 | There are two "gin middleware" for logging, called `Logger()` and `Recovery()`,
129 | included in the `gin.Default()` function.
130 | 
131 | So we should replace these 2 middlewares with other logger middlewares, and
132 | thanks to community, there are already some libraries for this:
133 | https://github.com/alron/ginlogr and https://github.com/gin-contrib/zap. We
134 | could choose one of them.
135 | 
136 | In other codes of Chaos Mesh Dashboard, we should use logger as same as other
137 | components.
138 | 
139 | ### Logging in cli tools
140 | 
141 | For cli tools, `stdout` and `stderr` are very important for user experience, so
142 | you could not print many logs to `stdout` as a service application. But it does
143 | not mean that you should not print anything with logger, you could still use
144 | logging to show the progress, or as the profiling tools.
145 | 
146 | ## Practices
147 | 
148 | ### Should I use the global logger
149 | 
150 | NO. At least, avoid using it in the new code.
151 | 
152 | The only reason to use the global logger is you have nowhere to access an
153 | instance of logger, which means you might write some code that relies on the
154 | global state, like some "registration pattern", or `func init()`, or some
155 | extremely simple function like `func min(a, b int) int`.
156 | 
157 | Please avoid using the global/package level variable/state when you coding, and
158 | once a "simple method" need a logger to print some message, it probably has been
159 | not "simple" anymore, please refactor it.
160 | 
161 | ### Using logger as function parameter is UGLY! Is there another way
162 | 
163 | YES. Consider to refactor the one simple function into a struct with method, and
164 | introduce the logger as the dependency of the struct.
165 | 
166 | Or use the "Functional Options", see:
167 | https://github.com/uber-go/guide/blob/master/style.md#functional-options.
168 | 
169 | If you are facing making a choice between making API clean and writing fewer
170 | codes, please choose the former one.
171 | 
172 | ### Should I log in library codes
173 | 
174 | YES. When you write codes as library or write codes under `pkg/<some-package>`
175 | which are exported, logging is not only help the development, but also help the
176 | users to understand the logic of the library. And logging is also a important
177 | part of observability of the library.
178 | 
179 | And using a logging facade (like logr) gives us flexibility without burdening
180 | library users with a certain logging framework.
181 | 
182 | Only use "high level" `V(n).Info() n > 0` in the library codes, do not use
183 | `Error()`, `V(0).Info()` and `Info()`. Because `V(0).Info()` (or not defined a
184 | level) and `Error()` would most probably print something to `stdout`, which
185 | would might break the design of cli tools.
186 | 
187 | ### Logger name
188 | 
189 | The name of the logger should not be defined by itself, also should not be
190 | defined in the constructor. It should be defined in the caller for the
191 | constructor.
192 | 
193 | GOOD Example:
194 | 
195 | ```go
196 | func main(){
197 |   val logger = initializeApplicationRootLogger()
198 |   NewWebModule(logger.WithName("web")).StartServe()
199 | }
200 | ```
201 | 
202 | BAD Example:
203 | 
204 | ```go
205 | func NewWebModule(logger logr.Logger) *WebModule {
206 |   return &WebModule{
207 |     logger: logger.WithName("web"),
208 |   }
209 | }
210 | ```
211 | 
212 | The name of the logger should follow these rules:
213 | 
214 | - use kebab-case for the name
215 | - follow pattern: `<application-name>.<component-name>.<component-name>`
216 | 
217 | For `zapr`(logr adaptor for zap), the name of logger would be prefixed with its
218 | parent logger, which means:
219 | 
220 | ```go
221 | rootLogger := <somewhere>
222 | chaosDaemonLogger := rootLogger.WithName("chaos-daemon")
223 | daemonServerLogger := chaosDaemonLogger.WithName("daemon-server")
224 | ```
225 | 
226 | The name of `chaosDaemonLogger` is `chaos-daemon`, and the name of
227 | `daemonServerLogger` is `chaos-daemon.daemon-server`.
228 | 
229 | ### Which level should I use
230 | 
231 | TL;DR, for most codes please follow these suggestions:
232 | 
233 | - use `logger.Error()` for logging errors in ERROR level
234 | - use `logger.Info()` (or `logger.V(0).Info()`) for logging in INFO or WARN
235 |   level
236 | - use `logger.V(1).Info()` for logging in DEBUG level
237 | - use `logger.V(5).Info()` for logging in TRACE level.
238 | 
239 | About the detailed meaning of each V level, we do not restrict the number of
240 | level for the logger, here is a list for the level we recommend to use:
241 | 
242 | - Error() - Error should be used to indicated unexpected error, for example,
243 |   unexpected errors returned by subroutine function calls.
244 | 
245 | - Info() - Info should be used as a human journal or diary. And it could be used
246 |   to log expected error as warning. Info() hash multiple levels:
247 |   - V(0) - This is the default level, would ALWAYS be visible to users.
248 |     - CLI argument handling, print args before the program starts
249 |     - Application configuration
250 |     - Expected errors as warning message
251 |     - Journal for application's major behavior
252 |       - service endpoint: resolve requests
253 |       - long-time batch-job: job start, job complete
254 |     - Acquirement on system resources: listen on port, persistent in file, etc.
255 |   - V(1) - Default log level for debug.
256 |     - Expected error that repeat frequently, relates to conditions would be
257 |       corrected, like `StatusReasonConflict` when reconciling
258 |     - Component configuration and initialize
259 |   - V(2) - Useful state changes and conditional branch.
260 |     - State which is useful for debugging changed
261 |     - Choose a branch in a conditional statement
262 |   - V(3) - Extended information about changes
263 |     - Logging with state updates like `err := updateState(); err == nil`
264 |     - Logging error with more context before wrap and return it
265 |   - V(4) - Debug level verbosity, any other behavior that changes the state
266 |   - V(5) - Default log level for trace.
267 |     - Progressing in a for-loop
268 |     - Omitted things within "best effort" pattern
269 |     - Use default value with not configured value
270 |   - V(6) - Communication with other components
271 |     - Logging in handler before processing/resolving request
272 |     - RPC call with other components/services
273 |   - V(7) - Detailed information about communication
274 |     - Detailed payload and status of request/response within RPC
275 | 
276 | The number of level still limited by the backend logging framework, for example,
277 | you could only create 128 level in zap.
278 | 
279 | ### Relations between error and logging
280 | 
281 | You could resolve an error by log it with `Error()`, it usually means this
282 | program could not handle this error properly, and let the user know. But if your
283 | codes could handle this error, including you "throwing" this error to the upper
284 | level, you should not log it with `Error()`.
285 | 
286 | You could also use debug level `Info()` to log error, with more detailed context
287 | and behaviors. It is not conflict with the former rule.
288 | 
289 | ## Drawbacks
290 | 
291 | <!-- Why should we not do this? -->
292 | 
293 | ## Alternatives
294 | 
295 | <!-- - Why is this design the best in the space of possible designs?
296 | - What other designs have been considered and what is the rationale for not
297 |   choosing them?
298 | - What is the impact of not doing this? -->
299 | 
300 | ### Why zap? And why not other logger framework as logr backend
301 | 
302 | There are still many other logging frameworks supported by logr:
303 | https://github.com/go-logr/logr#implementations-non-exhaustive. I select zap
304 | only because I am familiar with it, and I think the API of zap is easy to use
305 | and has lots of configure options for features.
306 | 
307 | Please comment on it if you have any suggestions.
308 | 
309 | ## Unresolved questions
310 | 
311 | <!-- What parts of the design are still to be determined? -->
312 | 
313 | logr just release its v1.x.x stable version, and brings BEARKING CHANGES with
314 | its v0.x.x: it changes the `logr.Logger` from `interface` to `struct`, see
315 | https://github.com/go-logr/logr/pull/42.
316 | 
317 | Some library we used in Chaos Mesh, like `controller-runtime`, has upgrade to
318 | `logr` in v1.x.x in master branch, but not has released the stable version. The
319 | latest version for `controller-runtime` is `0.10.3`, and `0.11.0-beta.0` has
320 | been released at `2020-11-10`, and it still need several months to be stable.
321 | 
322 | So we have to use logr v0.4.0 for a while, and several months later, we will
323 | upgrade to logr v1.x.x.
324 | 


--------------------------------------------------------------------------------
/text/2021-12-29-openapi-to-typescript-api-client-and-forms.md:
--------------------------------------------------------------------------------
  1 | # OpenAPI to TypeScript API Client and Forms
  2 | 
  3 | > Updated on 2022-11-03:
  4 | >
  5 | > After much practice, I found that using [Orval](https://orval.dev/) instead of
  6 | > [OpenAPITools/openapi-generator](https://github.com/OpenAPITools/openapi-generator)
  7 | > eliminates the dependency on the JRE, which solves the cons mentioned below.
  8 | >
  9 | > So I plan to use Orval in the future to generate the client.
 10 | >
 11 | > The other parts in this RFC still remain unchanged. :)
 12 | 
 13 | ## Summary
 14 | 
 15 | Use the OpenAPI specification to generate a TypeScript API client and forms.
 16 | 
 17 | ## Motivation
 18 | 
 19 | Ref: https://github.com/chaos-mesh/chaos-mesh/issues/2615.
 20 | 
 21 | Currently, the chaos types are split between the front-end and back-end, and we
 22 | can‘t reuse types that have already been defined.
 23 | 
 24 | You can find these **extra** type definitions in https://github.com/chaos-mesh/chaos-mesh/blob/release-2.1/ui/app/src/components/NewExperimentNext/data/types.ts.
 25 | 
 26 | This leads to several problems:
 27 | 
 28 | 1. The front-end needs to be manually synchronized when there is an update to the
 29 |    type definition.
 30 | 2. As `1` continues to emerge, there are more and more things to maintain manually.
 31 | 
 32 | Considering our existing maintenance cost and past experience, we are not able
 33 | to synchronize the corresponding descriptions to the front-end in a timely manner.
 34 | 
 35 | This also works against those who want to contribute to Chaos Mesh, and it would
 36 | be great if only modifying the API will synchronize the changes to the UI.
 37 | 
 38 | So the best solution for now is to automate the generation of the schemas needed
 39 | for the front-end.
 40 | 
 41 | ## Detailed design
 42 | 
 43 | I will split the details into two parts:
 44 | 
 45 | - The first part is the generation of the TypeScript schemas.
 46 | - The second part is how to use the generated files to produce a forms skeleton.
 47 | 
 48 | ### Generate TypeScript Schemas
 49 | 
 50 | Luckily, there has serval tools can help us to generate the schemas, all of them
 51 | has pros and cons, finally I choose the
 52 | [OpenAPITools/openapi-generator](https://github.com/OpenAPITools/openapi-generator),
 53 | because:
 54 | 
 55 | - Pros
 56 |   - It's official.
 57 |   - It's a very popular tool and has a huge community.
 58 | - Cons
 59 |   - The nodejs package is just a wrapper of jar, so JRE is needed.
 60 | 
 61 | Other tries:
 62 | 
 63 | - [openapi-typescript](https://www.npmjs.com/package/openapi-typescript)
 64 | - [openapi-typescript-codegen](https://www.npmjs.com/package/openapi-typescript-codegen)
 65 | 
 66 | Both of these tools also have it's pros and cons, as opposite of above described.
 67 | 
 68 | Then we can use create a new package to handle the generation:
 69 | 
 70 | ```sh
 71 | mkdir -p ui/packages/openapi
 72 | cd ui/package/openapi
 73 | yarn init
 74 | ```
 75 | 
 76 | Install dependencies and add the `generate` script:
 77 | 
 78 | ```json
 79 | {
 80 |   "scripts": {
 81 |     "generate": "export TS_POST_PROCESS_FILE='../../node_modules/.bin/prettier --write'; openapi-generator-cli generate -c openapiconfig.json -i ../../../pkg/dashboard/swaggerdocs/swagger.yaml -g typescript-axios -o ../../app/src/openapi --enable-post-process-file"
 82 |   }
 83 | }
 84 | ```
 85 | 
 86 | This script will output the files into `ui/app/src/openapi` directory directly to
 87 | to avoid additional compilation.
 88 | 
 89 | ### Use TypeScript Compiler API to Generate Forms
 90 | 
 91 | To generate forms, we have think about these things before we write the code:
 92 | 
 93 | - Construct the interface of a form field.
 94 | - How to handle dependencies between fields?
 95 | - How to find shared fields?
 96 | - Something special.
 97 | 
 98 | For the first item, a frontend form field component should normally contains properties
 99 | at least:
100 | 
101 | - type
102 | - label
103 | - value
104 | - description
105 | 
106 | Convert to a JSON:
107 | 
108 | ```json
109 | {
110 |   "field": "text",
111 |   "label": "Name",
112 |   "value": "",
113 |   "helperText": "Fill your name"
114 | }
115 | ```
116 | 
117 | The types we currently have are:
118 | 
119 | ```ts
120 | type FieldType =
121 |   | "text"
122 |   | "textarea"
123 |   | "number"
124 |   | "select"
125 |   | "label"
126 |   | "autocomplete";
127 | ```
128 | 
129 | Most of them are inherited from the HTML Input types, except `label` and `autocomplete`.
130 | The `label` is represented as `string[]`. The `autocomplete` can be ignored for
131 | now because all generated fields won't use it.
132 | 
133 | Done. Next we need to think about how to handle dependencies between fields.
134 | 
135 | #### Dependencies between fields
136 | 
137 | Mostly, we use an `action` field to distinguish exactly what we want a chaos to
138 | do with the injection. So I'll start from here, but how we can find different actions?
139 | The key problem is OpenAPI generator can't convert Go `type alias` to TS `enums`,
140 | for example:
141 | 
142 | ```go
143 | const (
144 |         Ec2Stop AWSChaosAction = "ec2-stop"
145 |         Ec2Restart AWSChaosAction = "ec2-restart"
146 |         DetachVolume AWSChaosAction = "detach-volume"
147 | )
148 | ```
149 | 
150 | We expect these can covert to:
151 | 
152 | ```ts
153 | enum AWSChaosAction {
154 |   Ec2Stop = "ec2-stop",
155 |   Ec2Restart = "ec2-restart",
156 |   DetachVolume = "detach-volume",
157 | }
158 | ```
159 | 
160 | But unfortunately at last we only have:
161 | 
162 | ```ts
163 | export interface V1alpha1AWSChaosSpec {
164 |   /**
165 |    * @type {string}
166 |    * @memberof V1alpha1AWSChaosSpec
167 |    */
168 |   action?: string;
169 | 
170 |   //...
171 | }
172 | ```
173 | 
174 | This prevents us from using the TS compiler API to read the enum, but the solution
175 | is also simple, **since we can't get the key information through the code, we need
176 | to define it manually**.
177 | 
178 | How about defining a json file? This is what I first thought, but @STRRL reminded
179 | me that **we can write it in the comments**. Yes, like `kubebuilder`'s marker, we
180 | can also define markers.
181 | 
182 | So the following are all markers we will using:
183 | 
184 | - +kubebuilder:validation:Enum=action1;action2;action3
185 | 
186 |   +ui:form:enum=action1;action2;action3
187 | 
188 |   > Reuse the kubebuilder validation marker to indicate what actions we have.
189 |   > For uniformity, you can also write as `+ui:form:enum`.
190 |   >
191 |   > Except for the `action`, this can also reuse to define a `select` field.
192 | 
193 | - +ui:form:when=action=='action1'
194 | 
195 |   > Indicate this property belongs to which action.
196 |   >
197 |   > The value is an `expression` which needs to be evaluated at runtime.
198 | 
199 | - +ui:form:ignore
200 | 
201 |   > Ignore this property.
202 | 
203 | For example:
204 | 
205 | ```go
206 | type AWSChaosSpec struct {
207 |         // +ui:form:enum=ec2-stop;ec2-restart;detach-volume
208 |         // +kubebuilder:validation:Enum=ec2-stop;ec2-restart;detach-volume
209 |         Action AWSChaosAction `json:"action"`
210 | 
211 |         //...
212 | }
213 | 
214 | type AWSSelector struct {
215 |         // Endpoint indicates the endpoint of the aws server. Just used it in test now.
216 |         // +ui:form:ignore
217 |         // +optional
218 |         Endpoint *string `json:"endpoint,omitempty"`
219 | 
220 |         //...
221 | 
222 |         // DeviceName indicates the name of the device.
223 |         // +ui:form:when=action=='detach-volume'
224 |         // +optional
225 |         DeviceName *string `json:"deviceName,omitempty" webhook:"AWSDeviceName,nilable"`
226 | }
227 | ```
228 | 
229 | The rest of the steps are simple, use regular expressions to read them:
230 | 
231 | ```js
232 | // Part of the code
233 | const UI_FORM_ENUM = /\+ui:form:enum=(.+)\s/;
234 | const KUBEBUILDER_VALIDATION_ENUM = /\+kubebuilder:validation:Enum=(.+)\s/;
235 | 
236 | /**
237 |  * Get enum array from jsdoc comment.
238 |  *
239 |  * @export
240 |  * @param {string} s
241 |  * @return {string[]}
242 |  */
243 | export function getUIFormEnum(s) {
244 |   let matched = s.match(UI_FORM_ENUM) || s.match(KUBEBUILDER_VALIDATION_ENUM);
245 | 
246 |   return matched ? matched[1].split(";") : [];
247 | }
248 | 
249 | // ...
250 | 
251 | // Assuming that the Node is found
252 | const { escapedText: identifier } = node.name; // identifier
253 | const comment = node.jsDoc[0].comment;
254 | 
255 | if (identifier === "action") {
256 |   // Get all actions
257 |   actions = getUIFormEnum(comment);
258 | }
259 | ```
260 | 
261 | Similarly we can handle other markers in this way. So far we have solved this problem.
262 | 
263 | #### Shared fields
264 | 
265 | Now this problem also becomes simple, since we have defined markers, we can default
266 | to:
267 | 
268 | **The fields without `+ui:form:when=action=='xxx'` and `+ui:form:ignore` are shared**.
269 | 
270 | I think this part can be ignored for code details, it is enough to understand this
271 | given condition.
272 | 
273 | #### Chaos without action
274 | 
275 | Here is another case, what if a chaos doesn't have an action? Like `KernelChaos`
276 | and `TimeChaos`.
277 | 
278 | This Chaos will output an empty actions array `export const actions = []` as a placeholder.
279 | 
280 | #### Non-primitive type
281 | 
282 | If a field is not primitive type, TypeScript will use an `TypeReference` to represent
283 | it. This is different from primitive types, once the type reference appears, we
284 | need to recursively get the corresponding type symbol. We will use the checker to
285 | achieve this:
286 | 
287 | ```ts
288 | const program = ts.createProgram([source], {
289 |   target: ts.ScriptTarget.ES2015,
290 | });
291 | const sourceFile = program.getSourceFile(source);
292 | const checker = program.getTypeChecker(); // this is what we need
293 | 
294 | const type = checker.getTypeAtLocation(typeRef); // get the final type
295 | ```
296 | 
297 | So we still need to use a new field type to represent it. Here is an example:
298 | 
299 | ```js
300 | {
301 |     field: "ref",
302 |     label: "callchain",
303 |     multiple: true,
304 |     children: [
305 |         {
306 |             field: "text",
307 |             label: "funcname",
308 |             value: "",
309 |             helperText: ""
310 |         },
311 |         {
312 |             field: "text",
313 |             label: "parameters",
314 |             value: "",
315 |             helperText: ""
316 |         },
317 |         {
318 |             field: "text",
319 |             label: "predicate",
320 |             value: "",
321 |             helperText: ""
322 |         }
323 |     ]
324 | },
325 | ```
326 | 
327 | When a field is represented as a `ref`, then it must contain the `children` key,
328 | which means it will render a children group to the interface.
329 | 
330 | There is also a `mutiple` key to indicate whether children are rendered repeatedly.
331 | 
332 | #### Result
333 | 
334 | Finally, we will generate `AWSChaos` like this:
335 | 
336 | ```ts
337 | /**
338 |  * This file was auto-generated by @ui/openapi.
339 |  * Do not make direct changes to the file.
340 |  */
341 | 
342 | export const actions = [],
343 |   data = [
344 |     {
345 |       field: "text",
346 |       label: "awsRegion",
347 |       value: "",
348 |       helperText: "AWSRegion defines the region of aws.",
349 |     },
350 |     {
351 |       field: "text",
352 |       label: "deviceName",
353 |       value: "",
354 |       helperText:
355 |         "Optional. DeviceName indicates the name of the device. Needed in detach-volume.",
356 |       when: "action=='detach-volume'",
357 |     },
358 |     {
359 |       field: "text",
360 |       label: "ec2Instance",
361 |       value: "",
362 |       helperText: "Ec2Instance indicates the ID of the ec2 instance.",
363 |     },
364 |     {
365 |       field: "text",
366 |       label: "secretName",
367 |       value: "",
368 |       helperText: "Optional. SecretName defines the name of kubernetes secret.",
369 |     },
370 |     {
371 |       field: "text",
372 |       label: "volumeID",
373 |       value: "",
374 |       helperText:
375 |         "Optional. EbsVolume indicates the ID of the EBS volume. Needed in detach-volume.",
376 |       when: "action=='detach-volume'",
377 |     },
378 |   ];
379 | ```
380 | 
381 | Also `KernelChaos`:
382 | 
383 | ```ts
384 | export const actions = [],
385 |   data = [
386 |     {
387 |       field: "ref",
388 |       label: "failKernRequest",
389 |       children: [
390 |         {
391 |           field: "ref",
392 |           label: "callchain",
393 |           multiple: true,
394 |           children: [
395 |             {
396 |               field: "text",
397 |               label: "funcname",
398 |               value: "",
399 |               helperText: "xxx",
400 |             },
401 |             {
402 |               field: "text",
403 |               label: "parameters",
404 |               value: "",
405 |               helperText: "xxx",
406 |             },
407 |             {
408 |               field: "text",
409 |               label: "predicate",
410 |               value: "",
411 |               helperText: "xxx",
412 |             },
413 |           ],
414 |         },
415 |         {
416 |           field: "text",
417 |           label: "failtype",
418 |           value: 0,
419 |           helperText: "xxx",
420 |         },
421 |         {
422 |           field: "label",
423 |           label: "headers",
424 |           value: [],
425 |           helperText: "xxx",
426 |         },
427 |         {
428 |           field: "text",
429 |           label: "probability",
430 |           value: 0,
431 |           helperText: "xxx",
432 |         },
433 |         {
434 |           field: "text",
435 |           label: "times",
436 |           value: 0,
437 |           helperText: "xxx",
438 |         },
439 |       ],
440 |     },
441 |   ];
442 | 
443 | export default shared;
444 | ```
445 | 
446 | But this is not an ideal result and there are still some details that need to be
447 | addressed, like:
448 | 
449 | - ~~Some useless markers remain in the `helperText`.~~
450 | - ~~If an action only uses shared fields, the spread operator will be excess.~~
451 | - Output components directly? (enhancement)
452 | - ~~$ref siblings aren't supported in OpenAPI v2~~
453 | 
454 |   For example:
455 | 
456 |   ```yaml
457 |   attr:
458 |     $ref: "#/definitions/v1alpha1.AttrOverrideSpec"
459 |     description: |-
460 |       Attr defines the overrided attribution
461 |       +ui:form:when=action=='attrOverride'
462 |       +optional
463 |     type: object
464 |   ```
465 | 
466 |   The above definitions will convert to:
467 | 
468 |   ```ts
469 |   /**
470 |   *
471 |   * @type {V1alpha1AttrOverrideSpec}
472 |   * @memberof V1alpha1IOChaosSpec
473 |   */
474 |   attr?: V1alpha1AttrOverrideSpec
475 |   ```
476 | 
477 |   The `description` will be lost.
478 | 
479 | ## Drawbacks
480 | 
481 | The biggest disadvantage of automation is that we generate a lot of useless structures,
482 | but luckily we can fix this in the compilation phase.
483 | 
484 | There is also the fact that we have to adapt the existing code to what the automation
485 | generates, which can still be a lot of work. Automation solves the code burden,
486 | but it can become difficult to debug.
487 | 
488 | ## Alternatives
489 | 
490 | There still have a way to do these things, use Go code to generate the code, but
491 | it is not the best way because: **It's only benefit the part of converting native
492 | types to typescript types, but it's hard to generate real typescript
493 | types (Write from scratch)**.
494 | 
495 | ## Unresolved questions
496 | 
497 | Already described in [Result](#result).
498 | 


--------------------------------------------------------------------------------
/text/2022-01-17-keep-a-changelog.md:
--------------------------------------------------------------------------------
  1 | # Keep A Changelog
  2 | 
  3 | ## Summary
  4 | 
  5 | <!-- One para explanation of the proposal. -->
  6 | 
  7 | Proposal for keeping a changelog for https://github.com/chaos-mesh/chaos-mesh.
  8 | 
  9 | ## Motivation
 10 | 
 11 | <!-- Why are we doing this? What use cases does it support? What is the expected
 12 | outcome? -->
 13 | 
 14 | It would consume a lot hand work to collect all the changes into release notes
 15 | before releasing a new version. It might be better to keep a changelog, commit
 16 | the changes for each PR.
 17 | 
 18 | Thanks to @yangkeao, he introduces us https://keepachangelog.com/ with best
 19 | practices.
 20 | 
 21 | ## Detailed design
 22 | 
 23 | <!-- This is the bulk of the RFC. Explain the design in enough detail that:
 24 | 
 25 | - It is reasonably clear how the feature would be implemented.
 26 | - Corner cases are dissected by example.
 27 | - How the feature is used. -->
 28 | 
 29 | This proposal contains nothing about technical details and codes, it only
 30 | affects several steps about creating and reviewing PRs.
 31 | 
 32 | Most of following contents are copied from https://keepachangelog.com/, we would
 33 | follow the best practices and pattern from it.
 34 | 
 35 | ### Guiding Principles
 36 | 
 37 | - Changelogs are for humans, not machines.
 38 | - There should be an entry for every single version.
 39 | - The same types of changes should be grouped.
 40 | - Versions and sections should be linkable.
 41 | - The latest version comes first.
 42 | - The release date of each version is displayed.
 43 | - Mention whether you follow [Semantic Versioning](https://semver.org/).
 44 | 
 45 | ### Types of changes
 46 | 
 47 | - `Added` for new features.
 48 | - `Changed` for changes in existing functionality.
 49 | - `Deprecated` for soon-to-be removed features.
 50 | - `Removed` for now removed features.
 51 | - `Fixed` for any bug fixes.
 52 | - `Security` in case of vulnerabilities.
 53 | 
 54 | When creating a new `Unreleased` section, we would keep all the types of changes
 55 | with "Nothing", and trim empty entries when releasing a new version.
 56 | 
 57 | ### When to create new items in the changelog
 58 | 
 59 | - As a contributor, when opening a new PR, if you think this PR should be
 60 |   considered in the changelog/release notes, please create a new item in the
 61 |   changelog.
 62 | - As a reviewer, if you think a PR without updating changelog is enough
 63 |   important for taking one, please ask the author of the PR to create one.
 64 | - As a contributor, if you find some a change is enough important for taking a
 65 |   changelog, but it doesn't, please open a new PR to update the changelog.
 66 | - As a reviewer/committer/maintainer, when you releasing a new version, sync the
 67 |   changelog with the released version.
 68 | 
 69 | ### CHANGELOG.md in release-* branches
 70 | 
 71 | We should maintain a changelog for each active `release-*` branches, for now,
 72 | they are `release-2.0` and `release-2.1`. I would create a new file
 73 | `CHANGELOG.md` in each branch with the existing release notes. `CHANGELOG.md`
 74 | should also contains an `Unreleased` section for next patch/bugfix release.
 75 | 
 76 | When we cherry-pick a PR into `release-*` branch, if the original PR already
 77 | have items in `Unreleased` section, it could be also cherry-picked into here.
 78 | 
 79 | #### When releasing major, minor, or bugfix/patch version
 80 | 
 81 | The detailed step of creating a new release should be updated in release guide
 82 | `RELEASE.md`. We should manually change the `Unreleased` section to `[x.y.z] -
 83 | YYYY-MM-DD` in the `release-*` branch, and update `CHANGELOG.md` in `master`
 84 | branch manually.
 85 | 
 86 | So `CHANGELOG.md` in `release-*` only contains sections for minor and
 87 | bugfix/patch releases, and `CHANGELOG.md` in `master` contains sections for all
 88 | the releases.
 89 | 
 90 | As one principal is "The latest version comes first", and with semantic
 91 | versioning's comparing rule, `2.1.2 > 2.0.6`. So sections in `CHANGELOG.md` in
 92 | `master` should be like
 93 | 
 94 | ```markdown
 95 | 
 96 | - [2.1.2] - 2021-12-29
 97 | 
 98 | (contents)
 99 | 
100 | - [2.1.1] - 2021-12-10
101 | 
102 | (contents)
103 | 
104 | - [2.1.0] - 2021-11-30
105 | 
106 | (contents)
107 | 
108 | - [2.0.6] - 2021-12-29
109 | 
110 | (contents)
111 | 
112 | - [2.0.5] - 2021-11-25
113 | 
114 | (contents)
115 | 
116 | ```
117 | 
118 | ### What would be the first step
119 | 
120 | We would create a file called `CHANGELOG.md` in `master` branch, which content:
121 | 
122 | ```markdown
123 | # Chaos Mesh Changelog
124 | 
125 | (descriptions)
126 | 
127 | (guideline, link to this rfc)
128 | 
129 | ## [Unreleased]
130 | 
131 | ### Added
132 | 
133 | - some [#(pr-number)](https://link)
134 | - entries [#(pr-number)](https://link) [#(pr-number)](https://link)
135 | - (I would collect items from commits in master branch)
136 | 
137 | ### Changed
138 | 
139 | - ditto
140 | 
141 | ### Deprecated
142 | 
143 | (leave "Nothing" only in "Unreleased" section)
144 | - Nothing
145 | 
146 | ### Removed
147 | 
148 | - Nothing
149 | 
150 | ### Fixed
151 | 
152 | - Nothing
153 | 
154 | ### Security
155 | 
156 | - Nothing
157 | 
158 | ## [2.1.2] - 2021-12-29
159 | 
160 | ### Changed
161 | 
162 | - some [#(pr-number)](https://link)
163 | - entries [#(pr-number)](https://link) [#(pr-number)](https://link)
164 | - (I would collect items from existing release notes)
165 | 
166 | ### Fixed
167 | 
168 | - ditto
169 | 
170 | ## [2.0.6] - 2021-12-29
171 | 
172 | ### Changed
173 | 
174 | - ditto
175 | 
176 | ### Fixed
177 | 
178 | - ditto
179 | ```
180 | 
181 | And create `CHANGELOG.md` in each `release-*` branch with similar content.
182 | 
183 | ## Drawbacks
184 | 
185 | <!-- Why should we not do this? -->
186 | 
187 | No predictable drawbacks could block this proposal now.
188 | 
189 | ## Alternatives
190 | 
191 | <!-- - Why is this design the best in the space of possible designs?
192 | - What other designs have been considered and what is the rationale for not
193 |   choosing them?
194 | - What is the impact of not doing this? -->
195 | 
196 | No other alternative solutions yet.
197 | 
198 | ## Unresolved questions
199 | 
200 | <!-- What parts of the design are still to be determined? -->
201 | 
202 | Does Chaos Mesh follows Semantic Versioning?
203 | 
204 | I think Chaos Mesh does not follow Semantic Versioning now. We would release
205 | major versions with more reasons about marketing, and we introduce API breaking
206 | changes in minor releases. It does not matter now, because
207 | https://github.com/chaos-mesh/chaos-mesh is not designed to be used as
208 | dependency by other projects. I think we should mention that in the changelog.
209 | 
210 | About "Types of changes", which one would be better?
211 | 
212 | Pattern A:
213 | 
214 | - `Added` for new features.
215 | - `Changed` for changes in existing functionality.
216 | - `Deprecated` for soon-to-be removed features.
217 | - `Removed` for now removed features.
218 | - `Fixed` for any bug fixes.
219 | - `Security` in case of vulnerabilities.
220 | 
221 | Pattern B:
222 | 
223 | - `New Features` for new features.
224 | - `Enhancements` for changes in existing functionality.
225 | - `Deprecated` for soon-to-be removed features.
226 | - `Removed` for now removed features.
227 | - `Fixed` for any bug fixes.
228 | - `Security` in case of vulnerabilities.
229 | 
230 | I prefer Pattern A because a "change" might not be "enhancement". Maybe we could
231 | mix them.
232 | 


--------------------------------------------------------------------------------
/text/2022-02-21-workflow-status-check.md:
--------------------------------------------------------------------------------
  1 | # Status Check in Workflow
  2 | 
  3 | ## Summary
  4 | 
  5 | The state check is responsible for collecting the system state before,
  6 | during and after the chaos workflow execution and is used to determine
  7 | whether the chaos workflow is successful or not. The chaos workflow can
  8 | also be stopped automatically when the system becomes unhealthy during
  9 | the execution of the chaos workflow.
 10 | 
 11 | ## Motivation
 12 | 
 13 | When currently executing chaos workflows, the user cannot quickly determine
 14 | the impact of the chaos workflow on the system.
 15 | 
 16 | One conceivable path is:
 17 | 
 18 | 1. click to start the workflow
 19 | 1. manually check the key panel on monitor system
 20 | 1. if the system is observed to become unhealthy
 21 | 1. go back to Chaos Dashboard and manually stop the workflow
 22 | 
 23 | It is clearly a user-unfriendly design. To optimize this process, it is
 24 | necessary to introduce status check in workflow.
 25 | 
 26 | ## Detailed design
 27 | 
 28 | ### Concept
 29 | 
 30 | #### StatusCheck Template
 31 | 
 32 | `StatusCheck Template` enables `StatusCheck` to be quickly reused by
 33 | multiple workflows. Users can create the `StatusCheck Template` in advance
 34 | and then create the `StatusCheck` by referring to the `StatusCheck Template`
 35 | when creating the workflow.
 36 | 
 37 | #### StatusCheck
 38 | 
 39 | `StatusCheck` defines how the user wants to check the health status of
 40 | the system. Users can create `StatusCheck` by referring to the
 41 | `StatusCheck Template` or by customizing it on Chaos Dashboard.
 42 | 
 43 | ### `StatusCheck` Properties
 44 | 
 45 | #### General Properties
 46 | 
 47 | - Execution mode, it could be Continuous or Synchronous
 48 | - Overall execution time, it corresponds to the `deadline` of the
 49 |   `WorkflowNode`, after which the execution of `StatusCheck` stops
 50 | - Timeout of single execution, different types of StatusCheck have different
 51 |   implementations. For example, in HTTP StatusCheck, it is the response time
 52 |   of the request, beyond which the execution is considered to have failed
 53 | - Number of retries
 54 |   - BTW, Synchronous StatusCheck also has a retry mechanism to prevent
 55 |     the impact of system jitter
 56 | - Retry interval time
 57 | - Failure to abort the workflow
 58 | - The number of the execution history is saved
 59 | 
 60 | Here are the details of the execution mode. Continuous StatusCheck and
 61 | Synchronous StatusCheck are both supported as children of
 62 | `Parallel WorkflowNode` or `Serial WorkflowNode`, or as `EntryNode`
 63 | (the statuscheck as `EntryNode` is not really meaningful, but it can be
 64 | written that way).
 65 | 
 66 | The recommended scenario for Continuous StatusCheck is specified here:
 67 | Continuous StatusCheck is a child of the `EntryNode` which is Parallel.
 68 | In this scenario, it means that the status check will continue
 69 | throughout the workflow execution.
 70 | 
 71 | ```yaml
 72 | templates:
 73 |   - name: the-entry
 74 |     templateType: Parallel
 75 |     deadline: 240s
 76 |     children:
 77 |       - status-check
 78 |       - node1
 79 |       - node2
 80 |   - name: status-check
 81 |     templateType: StatusCheck
 82 |     ...
 83 | ```
 84 | 
 85 | The yaml example for the rest of the cases is as follows:
 86 | 
 87 | ```yaml
 88 | templates:
 89 |   - name: node0
 90 |     templateType: Serial
 91 |     deadline: 240s
 92 |     children:
 93 |       - status-check
 94 |       - node1
 95 |       - node2
 96 |   - name: status-check
 97 |     templateType: StatusCheck
 98 |     ...
 99 | ```
100 | 
101 | #### The Status of `StatusCheck`
102 | 
103 | - Conditions of current StatusCheck
104 | - StatusCheck execution histories, including the execution time and outcomes
105 | 
106 | ### Status Check with HTTP
107 | 
108 | HTTP StatusCheck determines the health of the system by response code,
109 | response time, or response body returned from the request URL.
110 | 
111 | #### HTTP StatusCheck Properties
112 | 
113 | - Request URL
114 |   - For example: system health API, system key API, Grafana alert API
115 | - Request Method
116 | - Request Header
117 |   - For example: AUTH-KEY
118 | - Response Time
119 | - Response Code
120 | - Response Body
121 | 
122 | here is a example yaml of HTTP StatusCheck:
123 | 
124 | ```yaml
125 | apiVersion: chaos-mesh.org/v1alpha1
126 | kind: StatusCheck
127 | metadata:
128 |   name: try-workflow-status-check
129 |   annotations:
130 |     "experiment.chaos-mesh.org/abort": "false"
131 |     "experiment.chaos-mesh.org/description": "try-workflow-status-check"
132 | spec:
133 |   mode: Synchronous
134 |   type: HTTP
135 |   deadline: 20s
136 |   timeoutSeconds: 1
137 |   failureThreshold: 3
138 |   periodSeconds: 3
139 |   historyLimit: 10
140 |   abortIfFailed: true
141 |   http:
142 |     url: http://1.1.1.1:8080
143 |     method: GET
144 |     body: ""
145 |     headers:
146 |     - name: a
147 |       value: b
148 |     criteria:
149 |       responseCode: "200-209"
150 | status:
151 |   conditions:
152 |     - type: Abort # ProbeSuccess/Accomplished/DeadlineExceed/Abort
153 |       status: "False"
154 |       reason: "Unknown"
155 |   records:
156 |     - probeTime: 2018-01-01T00:00:00Z
157 |       outcome: Success # Success/Failure
158 | ```
159 | 
160 | HTTP StatusCheck VS existing `Workflow HTTP Request Task`
161 | 
162 | - `Workflow HTTP Request Task` is an instant request, not a continuous request
163 | - `Workflow HTTP Request Task` can not stop the workflow
164 | 
165 | #### How to abort `Workflow`
166 | 
167 | There are two ways to abort `Workflow`:
168 | 
169 | - When StatusCheck Failed, it could abort workflow automatically
170 | - Users could abort Workflow manually, by adding the annotation to the workflow
171 | 
172 | ## Drawbacks
173 | 
174 | - Saving the history of `StatusCheck` in the `status` of `StatusCheck`
175 | 
176 |    Since an object in Kubernetes cannot exceed `1M`, we have to consider the
177 |    extreme case of a huge amount of `StatusCheck` history, so we added a
178 |    `HistoryLimit` field to limit the number of times it can be saved.
179 | 
180 | ## Alternatives
181 | 
182 | - `StatusCheck` in `Experiment` or `Schedule`?
183 | 
184 |    At the moment it seems to be an appropriate choice to put `StatusCheck` in
185 |    Workflow. We don't want to inflate `Experiment` or `Schedule` functionality
186 |    too much, so if there are scenarios that require `StatusCheck`,
187 |    then it is recommended to use `Workflow`.
188 | - `StatusCheck` with other types?
189 | 
190 |    For example, to get data from Prometheus metrics, or to execute some
191 |    commands.
192 | 
193 |    If you want to get data from Prometheus metrics, it is more efficient to
194 |    determine if grafana triggers an alarm through HTTP requests (the data from
195 |    Prometheus needs to be calculated by promQL, while grafana alarms are
196 |    configured in advance, which seems easier to use)
197 | 
198 |    If the HTTP StatusCheck does not meet your needs, feel free to make any
199 |    suggestions (BTW, it is better to explain your usage scenarios,
200 |    so that we can better help you solve the problem)
201 | 
202 | ## Unresolved questions
203 | 


--------------------------------------------------------------------------------