├── .github └── workflows │ └── markdownlint.yml ├── .gitignore ├── .markdownlint.json ├── LICENSE ├── README.md ├── media └── kubernetes-dashboard-login.png ├── package-lock.json ├── package.json ├── template.md └── text ├── 2020-09-08-easily-send-http-request-on-workflow-task.md ├── 2020-10-22-authn-and-authz-on-chaos-dashboard.md ├── 2020-11-09-chaos-mesh-workflow.md ├── 2021-03-11-unified-selector.md ├── 2021-07-23-extensible-chaosctl.md ├── 2021-08-11-implement-physical-machine-chaos.md ├── 2021-09-09-physical-machine-auth.md ├── 2021-09-27-refine-error-handling.md ├── 2021-10-08-monitoring-metrics-about-chaos-mesh.md ├── 2021-11-17-ui-monorepo.md ├── 2021-12-09-logging.md ├── 2021-12-29-openapi-to-typescript-api-client-and-forms.md ├── 2022-01-17-keep-a-changelog.md └── 2022-02-21-workflow-status-check.md /.github/workflows/markdownlint.yml: -------------------------------------------------------------------------------- 1 | # This workflow will do a clean install of node dependencies, build the source code and run tests across different versions of node 2 | # For more information see: https://help.github.com/actions/language-and-framework-guides/using-nodejs-with-github-actions 3 | 4 | name: markdownlint 5 | 6 | on: 7 | push: 8 | branches: [main] 9 | pull_request: 10 | branches: [main] 11 | 12 | jobs: 13 | build: 14 | runs-on: ubuntu-latest 15 | steps: 16 | - uses: actions/checkout@v3 17 | - name: Use Node.js 14.x 18 | uses: actions/setup-node@v3 19 | with: 20 | node-version: 14.x 21 | cache: "npm" 22 | - run: npm ci 23 | - run: npm run lint 24 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules/ 2 | -------------------------------------------------------------------------------- /.markdownlint.json: -------------------------------------------------------------------------------- 1 | { 2 | "default": true, 3 | "MD013": { 4 | "code_blocks": false 5 | }, 6 | "MD033": false, 7 | "MD034": false 8 | } 9 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Chaos Mesh RFCs 2 | 3 | Many changes, including bug fixes and documentation improvements can be 4 | implemented and reviewed via the normal GitHub pull request workflow. 5 | 6 | Some changes though are "substantial", and we ask that these be put through a 7 | bit of a design process and produce a consensus among the Chaos Mesh community. 8 | 9 | The "RFC" (request for comments) process is intended to provide a consistent 10 | and controlled path for new features to enter the project, so that all 11 | stakeholders can be confident about the direction the project is evolving in. 12 | 13 | ## How to submit an RFC 14 | 15 | 1. Copy `template.md` into `text/YYYY-MM-DD-my-feature.md`. 16 | 2. Write the document and fill in the blanks. 17 | 3. Submit a pull request. 18 | 19 | ## Timeline of an RFC 20 | 21 | 1. An RFC is submitted as a PR. 22 | 2. Discussion takes place, and the text is revised in response. 23 | 3. The PR is merged or closed when at least two project maintainers reach 24 | consensus. 25 | 26 | ## Style of an RFC 27 | 28 | We follow lint rules listed in 29 | [markdownlint](https://github.com/DavidAnson/markdownlint/blob/main/doc/Rules.md). 30 | 31 | Run lints (you must have [Node.js](https://nodejs.org) installed): 32 | 33 | ```bash 34 | # Install linters: npm install 35 | npm run lint 36 | ``` 37 | 38 | ## License 39 | 40 | This content is licensed under Apache License, Version 2.0, 41 | ([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0) 42 | 43 | ## Contributions 44 | 45 | Unless you explicitly state otherwise, any contribution intentionally submitted 46 | for inclusion in the work by you, as defined in the Apache-2.0 license, shall 47 | be dual licensed as above, without any additional terms or conditions. 48 | -------------------------------------------------------------------------------- /media/kubernetes-dashboard-login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaos-mesh/rfcs/a00e8824b58ead04f39b08d705a481e61a1bf11d/media/kubernetes-dashboard-login.png -------------------------------------------------------------------------------- /package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "rfcs", 3 | "version": "0.0.0", 4 | "lockfileVersion": 1, 5 | "requires": true, 6 | "dependencies": { 7 | "argparse": { 8 | "version": "1.0.10", 9 | "resolved": "https://registry.npmjs.org/argparse/-/argparse-1.0.10.tgz", 10 | "integrity": "sha512-o5Roy6tNG4SL/FOkCAN6RzjiakZS25RLYFrcMttJqbdd8BWrnA+fGz57iN5Pb06pvBGvl5gQ0B48dJlslXvoTg==", 11 | "requires": { 12 | "sprintf-js": "~1.0.2" 13 | } 14 | }, 15 | "balanced-match": { 16 | "version": "1.0.0", 17 | "resolved": "https://registry.npmjs.org/balanced-match/-/balanced-match-1.0.0.tgz", 18 | "integrity": "sha1-ibTRmasr7kneFk6gK4nORi1xt2c=" 19 | }, 20 | "brace-expansion": { 21 | "version": "1.1.11", 22 | "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.11.tgz", 23 | "integrity": "sha512-iCuPHDFgrHX7H2vEI/5xpz07zSHB00TpugqhmYtVmMO6518mCuRMoOYFldEBl0g187ufozdaHgWKcYFb61qGiA==", 24 | "requires": { 25 | "balanced-match": "^1.0.0", 26 | "concat-map": "0.0.1" 27 | } 28 | }, 29 | "commander": { 30 | "version": "2.9.0", 31 | "resolved": "https://registry.npmjs.org/commander/-/commander-2.9.0.tgz", 32 | "integrity": "sha1-nJkJQXbhIkDLItbFFGCYQA/g99Q=", 33 | "requires": { 34 | "graceful-readlink": ">= 1.0.0" 35 | } 36 | }, 37 | "concat-map": { 38 | "version": "0.0.1", 39 | "resolved": "https://registry.npmjs.org/concat-map/-/concat-map-0.0.1.tgz", 40 | "integrity": "sha1-2Klr13/Wjfd5OnMDajug1UBdR3s=" 41 | }, 42 | "deep-extend": { 43 | "version": "0.5.1", 44 | "resolved": "https://registry.npmjs.org/deep-extend/-/deep-extend-0.5.1.tgz", 45 | "integrity": "sha512-N8vBdOa+DF7zkRrDCsaOXoCs/E2fJfx9B9MrKnnSiHNh4ws7eSys6YQE4KvT1cecKmOASYQBhbKjeuDD9lT81w==" 46 | }, 47 | "entities": { 48 | "version": "2.0.3", 49 | "resolved": "https://registry.npmjs.org/entities/-/entities-2.0.3.tgz", 50 | "integrity": "sha512-MyoZ0jgnLvB2X3Lg5HqpFmn1kybDiIfEQmKzTb5apr51Rb+T3KdmMiqa70T+bhGnyv7bQ6WMj2QMHpGMmlrUYQ==" 51 | }, 52 | "esprima": { 53 | "version": "4.0.1", 54 | "resolved": "https://registry.npmjs.org/esprima/-/esprima-4.0.1.tgz", 55 | "integrity": "sha512-eGuFFw7Upda+g4p+QHvnW0RyTX/SVeJBDM/gCtMARO0cLuT2HcEKnTPvhjV6aGeqrCB/sbNop0Kszm0jsaWU4A==" 56 | }, 57 | "fs.realpath": { 58 | "version": "1.0.0", 59 | "resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz", 60 | "integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8=" 61 | }, 62 | "get-stdin": { 63 | "version": "5.0.1", 64 | "resolved": "https://registry.npmjs.org/get-stdin/-/get-stdin-5.0.1.tgz", 65 | "integrity": "sha1-Ei4WFZHiH/TFJTAwVpPyDmOTo5g=" 66 | }, 67 | "glob": { 68 | "version": "7.1.6", 69 | "resolved": "https://registry.npmjs.org/glob/-/glob-7.1.6.tgz", 70 | "integrity": "sha512-LwaxwyZ72Lk7vZINtNNrywX0ZuLyStrdDtabefZKAY5ZGJhVtgdznluResxNmPitE0SAO+O26sWTHeKSI2wMBA==", 71 | "requires": { 72 | "fs.realpath": "^1.0.0", 73 | "inflight": "^1.0.4", 74 | "inherits": "2", 75 | "minimatch": "^3.0.4", 76 | "once": "^1.3.0", 77 | "path-is-absolute": "^1.0.0" 78 | } 79 | }, 80 | "graceful-readlink": { 81 | "version": "1.0.1", 82 | "resolved": "https://registry.npmjs.org/graceful-readlink/-/graceful-readlink-1.0.1.tgz", 83 | "integrity": "sha1-TK+tdrxi8C+gObL5Tpo906ORpyU=" 84 | }, 85 | "ignore": { 86 | "version": "5.1.8", 87 | "resolved": "https://registry.npmjs.org/ignore/-/ignore-5.1.8.tgz", 88 | "integrity": "sha512-BMpfD7PpiETpBl/A6S498BaIJ6Y/ABT93ETbby2fP00v4EbvPBXWEoaR1UBPKs3iR53pJY7EtZk5KACI57i1Uw==" 89 | }, 90 | "inflight": { 91 | "version": "1.0.6", 92 | "resolved": "https://registry.npmjs.org/inflight/-/inflight-1.0.6.tgz", 93 | "integrity": "sha1-Sb1jMdfQLQwJvJEKEHW6gWW1bfk=", 94 | "requires": { 95 | "once": "^1.3.0", 96 | "wrappy": "1" 97 | } 98 | }, 99 | "inherits": { 100 | "version": "2.0.4", 101 | "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.4.tgz", 102 | "integrity": "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ==" 103 | }, 104 | "ini": { 105 | "version": "1.3.5", 106 | "resolved": "https://registry.npmjs.org/ini/-/ini-1.3.5.tgz", 107 | "integrity": "sha512-RZY5huIKCMRWDUqZlEi72f/lmXKMvuszcMBduliQ3nnWbx9X/ZBQO7DijMEYS9EhHBb2qacRUMtC7svLwe0lcw==" 108 | }, 109 | "js-yaml": { 110 | "version": "3.13.1", 111 | "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-3.13.1.tgz", 112 | "integrity": "sha512-YfbcO7jXDdyj0DGxYVSlSeQNHbD7XPWvrVWeVUujrQEoZzWJIRrCPoyk6kL6IAjAG2IolMK4T0hNUe0HOUs5Jw==", 113 | "requires": { 114 | "argparse": "^1.0.7", 115 | "esprima": "^4.0.0" 116 | } 117 | }, 118 | "jsonc-parser": { 119 | "version": "2.2.1", 120 | "resolved": "https://registry.npmjs.org/jsonc-parser/-/jsonc-parser-2.2.1.tgz", 121 | "integrity": "sha512-o6/yDBYccGvTz1+QFevz6l6OBZ2+fMVu2JZ9CIhzsYRX4mjaK5IyX9eldUdCmga16zlgQxyrj5pt9kzuj2C02w==" 122 | }, 123 | "linkify-it": { 124 | "version": "3.0.2", 125 | "resolved": "https://registry.npmjs.org/linkify-it/-/linkify-it-3.0.2.tgz", 126 | "integrity": "sha512-gDBO4aHNZS6coiZCKVhSNh43F9ioIL4JwRjLZPkoLIY4yZFwg264Y5lu2x6rb1Js42Gh6Yqm2f6L2AJcnkzinQ==", 127 | "requires": { 128 | "uc.micro": "^1.0.1" 129 | } 130 | }, 131 | "lodash.differencewith": { 132 | "version": "4.5.0", 133 | "resolved": "https://registry.npmjs.org/lodash.differencewith/-/lodash.differencewith-4.5.0.tgz", 134 | "integrity": "sha1-uvr7yRi1UVTheRdqALsK76rIVLc=" 135 | }, 136 | "lodash.flatten": { 137 | "version": "4.4.0", 138 | "resolved": "https://registry.npmjs.org/lodash.flatten/-/lodash.flatten-4.4.0.tgz", 139 | "integrity": "sha1-8xwiIlqWMtK7+OSt2+8kCqdlph8=" 140 | }, 141 | "markdown-it": { 142 | "version": "11.0.0", 143 | "resolved": "https://registry.npmjs.org/markdown-it/-/markdown-it-11.0.0.tgz", 144 | "integrity": "sha512-+CvOnmbSubmQFSA9dKz1BRiaSMV7rhexl3sngKqFyXSagoA3fBdJQ8oZWtRy2knXdpDXaBw44euz37DeJQ9asg==", 145 | "requires": { 146 | "argparse": "^1.0.7", 147 | "entities": "~2.0.0", 148 | "linkify-it": "^3.0.1", 149 | "mdurl": "^1.0.1", 150 | "uc.micro": "^1.0.5" 151 | } 152 | }, 153 | "markdownlint": { 154 | "version": "0.21.1", 155 | "resolved": "https://registry.npmjs.org/markdownlint/-/markdownlint-0.21.1.tgz", 156 | "integrity": "sha512-8kc88w5dyEzlmOWIElp8J17qBgzouOQfJ0LhCcpBFrwgyYK6JTKvILsk4FCEkiNqHkTxwxopT2RS2DYb/10qqg==", 157 | "requires": { 158 | "markdown-it": "11.0.0" 159 | } 160 | }, 161 | "markdownlint-cli": { 162 | "version": "0.24.0", 163 | "resolved": "https://registry.npmjs.org/markdownlint-cli/-/markdownlint-cli-0.24.0.tgz", 164 | "integrity": "sha512-AusUxaX4sFayUBFTCKeHc8+fq73KFqIUW+ZZZYyQ/BvY0MoGAnE2C/3xiawSE7WXmpmguaWzhrXRuY6IrOLX7A==", 165 | "requires": { 166 | "commander": "~2.9.0", 167 | "deep-extend": "~0.5.1", 168 | "get-stdin": "~5.0.1", 169 | "glob": "~7.1.2", 170 | "ignore": "~5.1.4", 171 | "js-yaml": "~3.13.1", 172 | "jsonc-parser": "~2.2.0", 173 | "lodash.differencewith": "~4.5.0", 174 | "lodash.flatten": "~4.4.0", 175 | "markdownlint": "~0.21.0", 176 | "markdownlint-rule-helpers": "~0.12.0", 177 | "minimatch": "~3.0.4", 178 | "minimist": "~1.2.5", 179 | "rc": "~1.2.7" 180 | } 181 | }, 182 | "markdownlint-rule-helpers": { 183 | "version": "0.12.0", 184 | "resolved": "https://registry.npmjs.org/markdownlint-rule-helpers/-/markdownlint-rule-helpers-0.12.0.tgz", 185 | "integrity": "sha512-Q7qfAk+AJvx82ZY52OByC4yjoQYryOZt6D8TKrZJIwCfhZvcj8vCQNuwDqILushtDBTvGFmUPq+uhOb1KIMi6A==" 186 | }, 187 | "mdurl": { 188 | "version": "1.0.1", 189 | "resolved": "https://registry.npmjs.org/mdurl/-/mdurl-1.0.1.tgz", 190 | "integrity": "sha1-/oWy7HWlkDfyrf7BAP1sYBdhFS4=" 191 | }, 192 | "minimatch": { 193 | "version": "3.0.4", 194 | "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-3.0.4.tgz", 195 | "integrity": "sha512-yJHVQEhyqPLUTgt9B83PXu6W3rx4MvvHvSUvToogpwoGDOUQ+yDrR0HRot+yOCdCO7u4hX3pWft6kWBBcqh0UA==", 196 | "requires": { 197 | "brace-expansion": "^1.1.7" 198 | } 199 | }, 200 | "minimist": { 201 | "version": "1.2.5", 202 | "resolved": "https://registry.npmjs.org/minimist/-/minimist-1.2.5.tgz", 203 | "integrity": "sha512-FM9nNUYrRBAELZQT3xeZQ7fmMOBg6nWNmJKTcgsJeaLstP/UODVpGsr5OhXhhXg6f+qtJ8uiZ+PUxkDWcgIXLw==" 204 | }, 205 | "once": { 206 | "version": "1.4.0", 207 | "resolved": "https://registry.npmjs.org/once/-/once-1.4.0.tgz", 208 | "integrity": "sha1-WDsap3WWHUsROsF9nFC6753Xa9E=", 209 | "requires": { 210 | "wrappy": "1" 211 | } 212 | }, 213 | "path-is-absolute": { 214 | "version": "1.0.1", 215 | "resolved": "https://registry.npmjs.org/path-is-absolute/-/path-is-absolute-1.0.1.tgz", 216 | "integrity": "sha1-F0uSaHNVNP+8es5r9TpanhtcX18=" 217 | }, 218 | "rc": { 219 | "version": "1.2.8", 220 | "resolved": "https://registry.npmjs.org/rc/-/rc-1.2.8.tgz", 221 | "integrity": "sha512-y3bGgqKj3QBdxLbLkomlohkvsA8gdAiUQlSBJnBhfn+BPxg4bc62d8TcBW15wavDfgexCgccckhcZvywyQYPOw==", 222 | "requires": { 223 | "deep-extend": "^0.6.0", 224 | "ini": "~1.3.0", 225 | "minimist": "^1.2.0", 226 | "strip-json-comments": "~2.0.1" 227 | }, 228 | "dependencies": { 229 | "deep-extend": { 230 | "version": "0.6.0", 231 | "resolved": "https://registry.npmjs.org/deep-extend/-/deep-extend-0.6.0.tgz", 232 | "integrity": "sha512-LOHxIOaPYdHlJRtCQfDIVZtfw/ufM8+rVj649RIHzcm/vGwQRXFt6OPqIFWsm2XEMrNIEtWR64sY1LEKD2vAOA==" 233 | } 234 | } 235 | }, 236 | "sprintf-js": { 237 | "version": "1.0.3", 238 | "resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.0.3.tgz", 239 | "integrity": "sha1-BOaSb2YolTVPPdAVIDYzuFcpfiw=" 240 | }, 241 | "strip-json-comments": { 242 | "version": "2.0.1", 243 | "resolved": "https://registry.npmjs.org/strip-json-comments/-/strip-json-comments-2.0.1.tgz", 244 | "integrity": "sha1-PFMZQukIwml8DsNEhYwobHygpgo=" 245 | }, 246 | "uc.micro": { 247 | "version": "1.0.6", 248 | "resolved": "https://registry.npmjs.org/uc.micro/-/uc.micro-1.0.6.tgz", 249 | "integrity": "sha512-8Y75pvTYkLJW2hWQHXxoqRgV7qb9B+9vFEtidML+7koHUFapnVJAZ6cKs+Qjz5Aw3aZWHMC6u0wJE3At+nSGwA==" 250 | }, 251 | "wrappy": { 252 | "version": "1.0.2", 253 | "resolved": "https://registry.npmjs.org/wrappy/-/wrappy-1.0.2.tgz", 254 | "integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8=" 255 | } 256 | } 257 | } 258 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "rfcs", 3 | "version": "0.0.0", 4 | "devDependencies": {}, 5 | "scripts": { 6 | "lint": "markdownlint text/*.md" 7 | }, 8 | "dependencies": { 9 | "markdownlint-cli": "^0.24.0" 10 | } 11 | } 12 | -------------------------------------------------------------------------------- /template.md: -------------------------------------------------------------------------------- 1 | # Title 2 | 3 | ## Summary 4 | 5 | One para explanation of the proposal. 6 | 7 | ## Motivation 8 | 9 | Why are we doing this? What use cases does it support? What is the expected 10 | outcome? 11 | 12 | ## Detailed design 13 | 14 | This is the bulk of the RFC. Explain the design in enough detail that: 15 | 16 | - It is reasonably clear how the feature would be implemented. 17 | - Corner cases are dissected by example. 18 | - How the feature is used. 19 | 20 | ## Drawbacks 21 | 22 | Why should we not do this? 23 | 24 | ## Alternatives 25 | 26 | - Why is this design the best in the space of possible designs? 27 | - What other designs have been considered and what is the rationale for not 28 | choosing them? 29 | - What is the impact of not doing this? 30 | 31 | ## Unresolved questions 32 | 33 | What parts of the design are still to be determined? 34 | -------------------------------------------------------------------------------- /text/2020-09-08-easily-send-http-request-on-workflow-task.md: -------------------------------------------------------------------------------- 1 | # Easily Send HTTP Request On Workflow Task 2 | 3 | - [Easily Send HTTP Request On Workflow Task](#easily-send-http-request-on-workflow-task) 4 | - [Summary](#summary) 5 | - [Motivation](#motivation) 6 | - [Detailed design](#detailed-design) 7 | - [Rendering `Task` for sending HTTP request](#rendering-task-for-sending-http-request) 8 | - [Why `curl`](#why-curl) 9 | - [Frontend form parameters](#frontend-form-parameters) 10 | - [Load request body from ConfigMap as file](#load-request-body-from-configmap-as-file) 11 | - [Advanced configuration](#advanced-configuration) 12 | - [Frontend component for this type of `Task` and preview of generated result](#frontend-component-for-this-type-of-task-and-preview-of-generated-result) 13 | - [New context variable: `http`](#new-context-variable-http) 14 | - [Structure of `http`](#structure-of-http) 15 | - [Example conditions](#example-conditions) 16 | - [Drawbacks](#drawbacks) 17 | - [parsing curl command inline back into the parameters](#parsing-curl-command-inline-back-into-the-parameters) 18 | - [parser of context variable `http`](#parser-of-context-variable-http) 19 | - [Alternatives](#alternatives) 20 | - [Alternative solution 1: New type of WorkflowNode/Template for sending HTTP request](#alternative-solution-1-new-type-of-workflownodetemplate-for-sending-http-request) 21 | - [Unresolved questions](#unresolved-questions) 22 | 23 | ## Summary 24 | 25 | Rendering `curl` command line into Workflow `Task` with parameters, with common 26 | used parameters for HTTP. And more useful context variables like `json` or `http`. 27 | 28 | > I used to design the rendering logic with pure frontend/typescript, but as the 29 | > requirement of "parsing curl command line", I think using golang is better to 30 | > reuse the codes and round-trip testing. 31 | 32 | ## Motivation 33 | 34 | The design of `Task` node in Workflow is very common-usable, users could set any 35 | workload as one `ContainerSpec` inside the `Task`, and select branches with 36 | conditions. But it's quite complex to use the `Task`, users need to write an 37 | entire one-line shell to call the utilities in docker image, such as `curl -X 38 | GET https://your.application/helath -H 'Required-Token: token'`. And we do not 39 | want to introduce more types of WorkflowNode for doing that, so I think 40 | rendering the original require of "sending HTTP request" to `Task` is a good 41 | idea. 42 | 43 | As the certain context of "sending HTTP request", we could introduce some 44 | dedicated context variables only for parsing HTTP status code, and response 45 | body. For most usage, an HTTP endpoint like `https://your.application/health` 46 | will return `20x` as application is healthy, `50x` as application is not 47 | available. 48 | 49 | ## Detailed design 50 | 51 | ### Rendering `Task` for sending HTTP request 52 | 53 | The basic idea is There are some utilities called "curl command line builder", 54 | we could borrow the core logic from that. 55 | 56 | #### Why `curl` 57 | 58 | We will raise a pod inside of the kubernetes cluster, so we should consider 59 | about the size of image and multi architecture/platform support. 60 | 61 | I will use `curlimages/curl:7.78.0` as the docker image. It's the latest stable 62 | curl for now(2021-09-07). It's small(< 4MB). It supports most common 63 | architectures: `386`, `amd64`, `arm64` and so on. 64 | 65 | Why not use other tools like `httpie`? 66 | 67 | `httpie` provides more easy flags and better colorful output of HTTP response. 68 | But the image is much larger(about 25MB) and no support for other architectures. 69 | 70 | Why not use python/javascript script files with `python` and `node` dockerimage? 71 | 72 | Both of official images are large(about 60MB for python and 40MB for node). And 73 | it's much harder for generating codes instead of generating command line. 74 | 75 | Why not build another standalone binary tool? 76 | 77 | We do not want reinvent another wheel. :D 78 | 79 | #### Frontend form parameters 80 | 81 | We will provide the most common used parameters on the frontend: 82 | 83 | - HTTP method 84 | - URL 85 | - Custom headers 86 | - Request body in string 87 | - Path of the file that as the content of request body 88 | 89 | "Request body in string" and "Path of the file that as the content of request body" 90 | are exclusive. 91 | 92 | > I used to mount "configmap" into the pod, and use it's content as the request body 93 | > directly. But it's could not be implemented now, because the name of configmap 94 | > would not appear in the command line, so parser could not rebuild this, round-trip 95 | > not works. 96 | 97 | #### Load request body from ConfigMap as file 98 | 99 | `curl` supports loading content from file as request body with `-d`. We could 100 | load an existed configmap as the content of request body, the mount path of 101 | configmap could be configured. 102 | 103 | #### Advanced configuration 104 | 105 | And we could provide advanced configurations: 106 | 107 | - flags: custom flags that will be appended at the end of command line directly 108 | - image: replace the default image `curlimages/curl:7.78.0`, useful in air-gap 109 | cluster 110 | 111 | #### Frontend component for this type of `Task` and preview of generated result 112 | 113 | We want store the state into one or several `annotation` of templates, but it 114 | could not be done in short future. There are no embedded `metav1.ObjectMeta` 115 | inside the `Template`. We'd better to treat `Template` with `WorkflowNode` as 116 | `PodTemplateSpec` to `Pod`. I think we should split CRDs for workflow into 117 | another subgroup, then upgrade to `v1alpha2`. 118 | 119 | If we want to show the form of configuration of HTTP request in frontend, we 120 | could only parse the command line of `curl`, and that is awful: it's hard to 121 | keep consistently between original parameters and parsed parameters. I prefer 122 | NOT to do that. But this feature is very important for user to modifying the 123 | exists workflow. So I have to implement this. Welcome to other way to implement 124 | this. 125 | 126 | The generated `Task` should show the preview instantly. 127 | 128 | ### New context variable: `http` 129 | 130 | Although we could use `stdout` to collect the output of `curl`, it meaningless 131 | that we do not provide the parsed HTTP context for the next conditional 132 | branches. 133 | 134 | So we will provide another context variable called `http` for introducing the 135 | parsed HTTP context. It contains very simplified and common-used HTTP request 136 | and response, easy for being used in the conditions. 137 | 138 | #### Structure of `http` 139 | 140 | ```golang 141 | type ContextVariableHTTP struct { 142 | Request HTTPRequest `json:"request"` 143 | Response HTTPResponse `json:"response"` 144 | } 145 | type HTTPRequest struct { 146 | Method string `json:"method"` 147 | URL string `json:"url"` 148 | Header http.Header `json:"header"` 149 | Body []byte `json:"body"` 150 | } 151 | type HTTPResponse struct { 152 | StatusCode int `json:"statusCode"` 153 | Header http.Header `json:"header"` 154 | Body []byte `json:"body"` 155 | } 156 | 157 | ``` 158 | 159 | #### Example conditions 160 | 161 | Response returns 20x: 162 | 163 | `http.response.statusCode >= 200 && http.response.statusCode < 300` 164 | 165 | Response returns 50x: 166 | 167 | `http.response.statusCode >= 500` 168 | 169 | And used could select these fields in context variable `http`, and use it in the 170 | condition. 171 | 172 | ## Drawbacks 173 | 174 | There are several things that bother me a long time, please leave comments if 175 | you have any idea! 176 | 177 | ### parsing curl command inline back into the parameters 178 | 179 | Here is one concerned thing about "parsing curl command inline back into the 180 | parameters". 181 | 182 | Origin demand: 183 | 184 | - restore the parameters of HTTP request, for the next show and modification on 185 | workflow view. 186 | - restore the parameters of HTTP request, for the context variable 187 | `http.request` 188 | 189 | Excepted implementation: 190 | 191 | Store all the parameters with `annotation`, with specific prefix like 192 | `chaos-mesh.opg/workflow.http.method=GET` 193 | 194 | Actual implementations: 195 | 196 | Because struct `Template` in WorkflowSpec does not embed `metav1.Object`, there 197 | is nowhere to place the annotations. So we have to parse the rendered "curl 198 | command line". We need a lot of works and tests for keep the parsed result is 199 | consistent with the original parameters. 200 | 201 | ### parser of context variable `http` 202 | 203 | I want to parse the output of the `curl`, with flag `-i`. It will print both response 204 | header and body to `stdout`. So we need a parser, with a lot of testcase about 205 | `curl`'s output and expected `HTTPResponse`. 206 | 207 | And about the `-L` and http `301`/`302`, I think only keep the last response is 208 | the right way. 209 | 210 | Another thing is the `stdout` context variable is not only contains `stdout`, 211 | but also `stderr`. It might bring confusing for users. But kubernetes does not 212 | provide a way to split `stdout` and `stderr` from the `log` subresource from 213 | `Pod`, we have no idea expected implement collector for each container runtime. 214 | It will not break anything yet, but it does bring a little mess. Maybe we need a 215 | rename, from `stdout` to `log`. 216 | 217 | ## Alternatives 218 | 219 | ### Alternative solution 1: New type of WorkflowNode/Template for sending HTTP request 220 | 221 | I do really want to add some annotations as the metadata of rendering this Template, 222 | but I will not turn them into the "official" fields of CRD. Current CRD is powerful 223 | enough for describing the behavior. 224 | 225 | ## Unresolved questions 226 | 227 | - This implementation is not so friendly to users with pure command-line and 228 | yaml files. 229 | -------------------------------------------------------------------------------- /text/2020-10-22-authn-and-authz-on-chaos-dashboard.md: -------------------------------------------------------------------------------- 1 | # Authentication and authorization on chaos dashboard 2 | 3 | ## Summary 4 | 5 | We need using authentication and authorization to restrict user's permission and 6 | action. 7 | 8 | ## Motivation 9 | 10 | Users could create, suspend, get, and delete chaos experiments via chaos-dashboard. 11 | But chaos-dashboard does NOT contains and features about permission management, users 12 | could do anything, as long as he could access the chaos-dashboard UI. It is a secure 13 | issue. 14 | 15 | We need permission management about: 16 | 17 | - access control for **Resource**. 18 | - User A could create/get IO chaos experiments. 19 | - User B could create/get Network chaos experiments. 20 | - access control for **Action**. 21 | - User A could create/get chaos experiments. 22 | - User B could only get chaos experiments. 23 | 24 | ## Detailed design 25 | 26 | ### Login 27 | 28 | Users are asked for a `Service Account Token` to login. Like kubernetes dashboard: 29 | 30 | ![kubernetes login](../media/kubernetes-dashboard-login.png) 31 | 32 | ### Create new users 33 | 34 | Chaos dashboard do NOT provide any features about creating user. System administrators 35 | should manually create `ServiceAccount` with certain username, then bind to a `Role`. 36 | 37 | ### Implementation references 38 | 39 | Things to do: 40 | 41 | - frontend asking user input token to login 42 | - frontend will attach the token while sending requests to backend 43 | - backend will use a certain token to create a new kube client 44 | - backend need support multi-user 45 | 46 | > We could references auth module in kubernetes-dashboard while implementing this. 47 | 48 | We will provide some pre-set `Role`, like: 49 | 50 | - Admin: could create/get any chaos experiments. 51 | - Viewer: could only get chaos experiments. 52 | 53 | System administrators could also create their own roles, for advanced permission 54 | control. 55 | 56 | ### Advantages 57 | 58 | - This solution depends on kubernetes rbac, reducing many logics for permission management. 59 | - System administrators could change each user's permission as their requirements, 60 | it's very flexible. 61 | - Using kubernetes rbac could also restrict permission with `kubectl`. 62 | 63 | ## Drawbacks 64 | 65 | - Users should understand basic concepts about kubernetes rbac. 66 | 67 | ## Alternatives 68 | 69 | - Implement a full-featured rbac platform inner the chaos-dashboard. 70 | It's so complex and high-cost, and it is NOT cloud-native at all. 71 | 72 | ## Unresolved questions 73 | 74 | None. 75 | -------------------------------------------------------------------------------- /text/2020-11-09-chaos-mesh-workflow.md: -------------------------------------------------------------------------------- 1 | # Workflow 2 | 3 | ## Summary 4 | 5 | A built-in workflow engine with frontend support is needed for Chaos Mesh. 6 | 7 | ## Motivation 8 | 9 | Both running chaos experiments parallelly or sequentially are important for 10 | simulating production errors. Chaos Mesh is shipped with a feature-rich chaos 11 | toolbox. Composing them together can make it much more powerful. With a workflow 12 | engine, an experiment can turn from single chaos into a series of chaos with 13 | some external logic (like checking health). 14 | 15 | Previously, the `duration` and `scheduler` fields have enabled a very 16 | fundamental "workflow" feature, so after implementing the workflow engine, these 17 | fields should be deprecated. 18 | 19 | ## Detailed design 20 | 21 | The specification for a `Workflow` custom resource can be concluded in the 22 | following example: 23 | 24 | ```yaml 25 | spec: 26 | entry: serial-chaos 27 | templates: 28 | - name: networkchaos 29 | type: NetworkChaos 30 | duration: "30m" 31 | selector: 32 | labelSelectors: 33 | "component": "tikv" 34 | delay: 35 | latency: "90ms" 36 | correlation: "25" 37 | jitter: "90ms" 38 | - name: iochaos 39 | type: IOChaos 40 | duration: "10m" 41 | selector: 42 | labelSelectors: 43 | app: "etcd" 44 | volumePath: /var/run/etcd 45 | path: /var/run/etcd/**/* 46 | errno: 5 47 | percent: 50 48 | - name: parallel-chaos 49 | type: Parallel 50 | tasks: 51 | - ref: networkchaos 52 | - ref: iochaos 53 | - name: check 54 | type: Task 55 | task: 56 | image: 'alpine:3.6' 57 | command: ['/app'] 58 | branch: 59 | - when: "stdout == 'continue'" 60 | ref: serial-chaos 61 | - when: "stdout == 'continue'" 62 | ref: serial-chaos 63 | - name: serial-chaos 64 | type: Serial 65 | deadline: "30m" 66 | tasks: 67 | - ref: networkchaos 68 | - ref: iochaos 69 | - ref: parallel-chaos 70 | - type: Suspend 71 | duration: "30m" 72 | - ref: check 73 | ``` 74 | 75 | ### `templates` field 76 | 77 | `templates` field describes all the templates. One of them is the entry of the 78 | workflow (which is specified by the `entry` field). The `templates` is a slice 79 | of template, each of them is a `*Chaos`, `Parallel`, `Serial` or `Task`. They 80 | are distinguished according to the `type` field. 81 | 82 | The following document will describe all these different kinds of `template`. 83 | 84 | #### `*Chaos` 85 | 86 | They are the same with the specification of already defined `*Chaos` types 87 | without `scheduler` field. When running into this part, the workflow engine 88 | should create a `*Chaos` resource, and delete it after the duration. If it 89 | deletes the resource successfully, this step finishes and continues. 90 | 91 | The created `*Chaos` resource should be a "common chaos", as Chaos Mesh doesn't 92 | support to set duration without a scheduler. 93 | 94 | #### `Parallel` 95 | 96 | For a `Parallel` task, the engine should spawn these tasks at the same time, and 97 | finish iff all the tasks have finished. The `tasks` field is a list of tasks, 98 | which could be a reference to a template (by name), or an inline template. 99 | 100 | You can find an example of the inline template in the `Suspend` task of 101 | `serial-chaos`. 102 | 103 | If the `deadline` field is set, all the running child task will be killed and 104 | turn into `DeadlineExceeded` status. 105 | 106 | #### `Serial` 107 | 108 | For a `Serial` task, the engine should run these tasks one by one. It enters the 109 | next step iff the former one succeeded. After all these steps finished, this 110 | task turns into the finished phase. 111 | 112 | If one of the tasks failed or exceeded the deadline, the following task will not 113 | run and the current task will turn into the corresponding status directly. 114 | 115 | If the `deadline` field is set, all the running child task will be killed and 116 | turn into `DeadlineExceeded` status. 117 | 118 | #### `Suspend` 119 | 120 | For a `Suspend` task, the engine should do nothing. After a period of time 121 | (specified by `duration` field), this task turns into the finished phase. 122 | 123 | #### `Task` 124 | 125 | `Task` is a special kind of template to integrate users' processes into the 126 | workflow. In this step, the users can setup a container (with Kubernetes's pod) 127 | to run their own process. This step finishes once the pod turns into 128 | `PodSucceeded` phase. The description of the task could be more complicated with 129 | more fields if needed in the future. Every branch under the `branch` field will 130 | run parallelly iff the `when` expression returns true. There will be a lot of 131 | variables provided by the engine to use in the `when` expression, such as 132 | `stdout`, `stderr`... 133 | 134 | We will provide something like "Downward API" that makes users' processes could see 135 | status of workflow. So users could make any conditions based on workflow status. 136 | 137 | Combining `Parallel` and `Serial` can provide a powerful expression in task 138 | combination. However, as it can only represent series-parallel graph, it's still 139 | a subset of `dag`. 140 | 141 | #### Status 142 | 143 | The status of a `Workflow` and the transition between different status could be 144 | the heart of a workflow engine. Here is an example: 145 | 146 | ```yaml 147 | status: 148 | finishedAt: null 149 | startedAt: Thu, 05 Nov 2020 14:23:25 +0800 150 | phase: Running 151 | nodes: 152 | - name: entry 153 | displayName: serial-chaos 154 | type: Serial 155 | children: 156 | - entry[0] 157 | - entry[1] 158 | startTime: Thu, 05 Nov 2020 14:23:25 +0800 159 | deadline: 50m 160 | phase: Running 161 | spec: 162 | - ref: networkchaos 163 | - ref: iochaos 164 | - ref: parallel-chaos 165 | - type: Suspend 166 | duration: "30m" 167 | - ref: check 168 | - name: entry[0] 169 | displayName: networkchaos 170 | type: NetworkChaos 171 | startTime: Thu, 05 Nov 2020 14:23:25 +0800 172 | finishedAt: Thu, 05 Nov 2020 14:33:25 +0800 173 | duration: 10m 174 | phase: Succeeded 175 | spec: 176 | selector: 177 | labelSelectors: 178 | "component": "tikv" 179 | delay: 180 | latency: "90ms" 181 | correlation: "25" 182 | jitter: "90ms" 183 | - name: entry[1] 184 | displayName: parallel-chaos 185 | type: Parallel 186 | startTime: Thu, 05 Nov 2020 14:33:25 +0800 187 | deadline: 10m 188 | phase: Running 189 | children: 190 | - entry[1][0] 191 | - entry[1][1] 192 | spec: 193 | - ref: networkchaos 194 | - ref: iochaos 195 | - name: entry[1][0] 196 | displayName: networkchaos 197 | type: NetworkChaos 198 | startTime: Thu, 05 Nov 2020 14:33:25 +0800 199 | duration: 10m 200 | phase: Running 201 | spec: 202 | selector: 203 | labelSelectors: 204 | "component": "tikv" 205 | delay: 206 | latency: "90ms" 207 | correlation: "25" 208 | jitter: "90ms" 209 | - name: entry[1][1] 210 | displayName: iochaos 211 | type: IOChaos 212 | startTime: Thu, 05 Nov 2020 14:33:25 +0800 213 | duration: 10m 214 | phase: Running 215 | spec: 216 | selector: 217 | labelSelectors: 218 | app: "etcd" 219 | volumePath: /var/run/etcd 220 | path: /var/run/etcd/**/* 221 | errno: 5 222 | percent: 50 223 | ``` 224 | 225 | The `displayName` of a node is the corresponding template name. If it's inlined, 226 | the `displayName` will be generated by its parent template (e.g. 227 | `serial-chaos[3]`). 228 | 229 | ### Cronjob 230 | 231 | It has been proved to be a bad idea to manage the running logic and cron 232 | scheduler in the same resource. From the practice in Chaos Mesh 0.x and 1.x, 233 | managing a "twophase" scheduler is really complicated and full of bugs. As the 234 | status between a `CronWorkflow` and a normal `Workflow` could be much different, 235 | we will split them into two CRD. Here is an example of the spec of 236 | `CronWorkflow`: 237 | 238 | ```yaml 239 | spec: 240 | cron: "*/5 * * * *" 241 | entry: networkchaos 242 | templates: 243 | - name: networkchaos 244 | type: NetworkChaos 245 | duration: "30m" 246 | selector: 247 | labelSelectors: 248 | "component": "tikv" 249 | delay: 250 | latency: "90ms" 251 | correlation: "25" 252 | jitter: "90ms" 253 | status: 254 | activeWorkflow: 255 | - default/current-cron-job-xxxx 256 | - default/current-cron-job-zzzz 257 | lastScheduleTime: Thu, 05 Nov 2020 14:33:25 +0800 258 | ``` 259 | 260 | ### Implementation references 261 | 262 | `Argo` is a really great workflow engine with a workflow definition. It is a 263 | guideline for us to design and implement this feature. 264 | 265 | #### Reconciler logic 266 | 267 | We should manage "nodes" in status and the reconciler should act like a state 268 | machine for these "nodes". 269 | 270 | Creating nodes and updating phase could have side-effects, such as 271 | creating/deleting containers or creating/deleting chaos mesh resources. 272 | 273 | #### Structure definition 274 | 275 | The `Workflow` structure could be really complicated. Defining it in detail with 276 | fields could result in duplicated codes with existing `*Chaos`. Or we could set 277 | `Workflow` as `unstructured` or set the `templates` as 278 | `[]k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1.JSON` and decoding 279 | it with `mapstructure` pkg. 280 | 281 | Without defining the detailed type, we cannot get a basic validation from 282 | kubebuilder, so that we need to write a validating webhook for `Workflow`. 283 | 284 | ## Alternatives 285 | 286 | ### Provide Argo templates 287 | 288 | Providing Argo templates seems a good solution to integrate Chaos Mesh with a 289 | powerful workflow engine. From the technological aspect, it could be the best 290 | solution. However, Chaos Mesh has the ambition to become an integrated platform 291 | for chaos engineering, which should have a workflow engine inside out of the 292 | box. 293 | 294 | Installing Argo for the users is also not a choice. Because the CRD is not in 295 | namespace scope, installing Argo with Chaos Mesh for users could break existing 296 | Argo in the users' cluster 297 | 298 | ## Unresolved questions 299 | 300 | No. AFAIK 301 | -------------------------------------------------------------------------------- /text/2021-03-11-unified-selector.md: -------------------------------------------------------------------------------- 1 | # Unified Selector 2 | 3 | ## Summary 4 | 5 | The whole controller framework should do more things for the implementation of 6 | chaos. Now, every implementation of chaos selects pods by themselves. However, 7 | in order to track the running status, the controller framework should know the 8 | concrete status of every selected targets, which would be a disaster to record 9 | these status inside the implementation of every chaos :(. An unified selector 10 | framework in the controller framework could solve this problem. This RFC will 11 | talk about the design of this unified selector. 12 | 13 | ## Motivation 14 | 15 | There have been a lot of problems about current selector. For example, the 16 | IOChaos should select volumes, but it selects container now (and at first, it 17 | selects pods). Some chaos use `containerName string` to select a container while 18 | some use `containerNames []string` to select several containers at one time. If 19 | we can abstract selectors into one place, these errors won't happen. The 20 | developer of every chaos will not need to consider "select" or these platform 21 | dependent things. 22 | 23 | Another benifit is that it could help the controller to track the status. With 24 | it, the controller would know which target (pod/container) has been injected and 25 | which has not. It's the first step towards the target of a standalone `Schedule` 26 | CRD. 27 | 28 | ## Detailed Design 29 | 30 | Every chaos specification would define a function to get the specification of 31 | selectors: 32 | 33 | ```go 34 | type StatefulObjectWithSelector interface { 35 | v1alpha1.StatefulObject 36 | 37 | GetSelectorSpecs() map[string]interface{} 38 | } 39 | ``` 40 | 41 | The method `GetSelectorSpecs` will return a map from `string` to selector 42 | specification (like `struct {v1alpha1.PodSelectorSpec, v1alpha1.PodMode, 43 | v1alpha1.Value}`). The key would be the identifier of the selector, as there 44 | will be multiple selectors in one chaos specification, for example the `.` and 45 | `.Target` selectors in `NetworkChaos`. The controller will iterate this map to 46 | select every `SelectSpec`. It will construct a unified selector first, and then 47 | use this selector to `Select` targets. The unified selector may contain a lot of 48 | implementation inside. The construction would be like: 49 | 50 | ```go 51 | selector := selector.New(selector.SelectorParams{ 52 | PodSelector: pod.New(r.Client, r.Reader, config.ControllerCfg.ClusterScoped, config.ControllerCfg.TargetNamespace, config.ControllerCfg.AllowedNamespaces, config.ControllerCfg.IgnoredNamespaces), 53 | ContainerSelector: container.New(r.Client, r.Reader, config.ControllerCfg.ClusterScoped, config.ControllerCfg.TargetNamespace, config.ControllerCfg.AllowedNamespaces, config.ControllerCfg.IgnoredNamespaces), 54 | }) 55 | ``` 56 | 57 | With the help of dependency injection, it would be constructed easier. The 58 | `Selector` method of `PodSelector` would be like: 59 | 60 | ```go 61 | func (impl *PodSelector) Select(ctx context.Context, ps *v1alpha1.PodSelector) ([]interface{}, error) 62 | ``` 63 | 64 | The type of second parameter will decide which selector to use. For example, if 65 | you want to select with `*v1alpha1.PodSelector`, then the unified selector will 66 | use `*PodSelector`, and if you want to select with 67 | `*v1alpha1.ContainerSelector`, then the unified selector will use 68 | `*ContainerSelector` to select. 69 | 70 | The definition of the unified selector would be: 71 | 72 | ```go 73 | func (s *Selector) Select(ctx context.Context, spec interface{}) ([]interface{}, error) 74 | ``` 75 | 76 | The controller would construct the `Selector` first, and then use the `Selector` 77 | to select the chaos targets. However, as the target may have multiple type (it 78 | could be a pod, a container, a volume or an AWS machine), we can only return an 79 | `interface{}`, and the implementation of chaos would assert the type by 80 | themselves. 81 | 82 | After selecting every `selectSpecs`, the controller will iterate over selected 83 | items, and call `Apply`/`Recover` for them. All selected items, current item and 84 | the identifier of the `selectorSpec` of current item will be passed into the 85 | `Apply`/`Recover` function. 86 | 87 | ## Alternatives 88 | 89 | This is a RFC about internal design, and there is little choice. If you have 90 | better idea, please comment. 91 | -------------------------------------------------------------------------------- /text/2021-07-23-extensible-chaosctl.md: -------------------------------------------------------------------------------- 1 | # Extensible Chaosctl 2 | 3 | ## Summary 4 | 5 | A tool to control the status of Chaos Mesh as much as possible in an extensible 6 | way. 7 | 8 | ## Motivation 9 | 10 | Currently, the chaosctl is a debug tool to collect logs and other information 11 | from several kinds of chaos. However, some of its functions are implemented by 12 | executing commands in the namespace of target pods through the chaos-daemon. 13 | 14 | For example, in the iochaos debug part, the chaosctl executes `cat /proc/mounts` 15 | [command](https://github.com/chaos-mesh/chaos-mesh/blob/4b8fb5ba1518fda0d144c8df9239dcb0381ff485/pkg/chaosctl/debug/iochaos/iochaos.go#L54) 16 | straightly in the namepsace of target pods, which is dangerous and causes 17 | develpment difficulties in the long time. 18 | 19 | To implement more features for chaosctl, we must refactor it for more security 20 | and extensibility. 21 | 22 | ## Detailed Design 23 | 24 | For more security, functions like reading file or listing processes should be 25 | implemented in the server side. In other words, we need an API server, and make 26 | the chaosctl a pure client. 27 | 28 | So, what APIs do we needs? To implement the features that current chaosctl 29 | supports, maybe we can provide restful APIs like `/networkchaos/iptables` or 30 | `/iochaos/mounts`. However, in some cases we may need `/networkchaos/iptables` 31 | and in other cases we need `/networkchaos/ipset` or both of them, should we 32 | provide API for each of them to avoid unsued information? 33 | 34 | > I think the unsued information will reduce the debug efficiency very much. 35 | 36 | On the other hand, there are too many kinds of resources related with each 37 | other: the scheduler, the xxxchaos, the podcxxxchaos, the pod, the container and 38 | the process, how can we provide all APIs with their arrangement? 39 | 40 | > For example, how can we provide both of `/xxxchaos/podchaos` and 41 | > `/podchaos/xxxchaos` conveniently? 42 | 43 | To resolve these problems, we need structural APIs. 44 | 45 | ### Nested Resources 46 | 47 | Nested resources, also called subresources, may accomplish our goals. For 48 | example, if we register the struture of `networkchaos` resources, its 49 | subresources like `iptables` and `ipset` will be registerd automatically. We can 50 | access the networkchaos resources by path `/networkchaos` and access its 51 | subresources by path `/networkchaos/iptables` or `/networkchaos/ipset`. 52 | 53 | ```go 54 | type NetWorksChaos struct { 55 | Iptables []*IptablesRule 56 | Ipset []*IpsetRule 57 | } 58 | ``` 59 | 60 | However, there are two main drawbacks of this solution. Firstly, it can not 61 | support cascade queries, we must regard `/networkchaos` and 62 | `/networkchaos/` as different resources. Secondly, there is almost no 63 | library to conveniently build nested API servers in the ecosystem of golang. 64 | 65 | ### GraphQL 66 | 67 | The GraphQL is one of the most famous solutions for structural APIs. In above 68 | case, we can easy fetch iptables resources only by query `{ networkchaos 69 | {iptables } }`. However, the main drawback of this solution is that queries are 70 | not convenient to edit in cli, we need translation between resource paths and 71 | GraphQL queries. 72 | 73 | For example, we can translate resource path `/networkchaos/iptables` to query 74 | `{ networkchaos { iptables } }`. 75 | 76 | ### API Server 77 | 78 | We can choose one of the nested resources and GraphQL as our API solution, I 79 | prefer GraphQL as its ecosystem is better than nested resources. 80 | 81 | Moreover, server should provide additional resources like `logs`. 82 | 83 | ### Client 84 | 85 | #### Usage 86 | 87 | - identify resources 88 | 89 | We can identify target resources by path. For example, the path 90 | `/networkchaos//podchaos/` will identify the podnetworkchaos 91 | with `id=` owned by networkchaos with `id=`. If the `id` is 92 | ignore and resource type is plural, like `/networkchaoses/podchaoses`, all 93 | resources belonging to these kinds will be identified. 94 | 95 | - `show` and `desc` 96 | 97 | We provide two basic debug subcommands, `get` and `desc`. The `get` command only 98 | shows key information like `name` or `id` of resources while `desc` command 99 | shows full information. 100 | 101 | - `delete` 102 | 103 | We will provide `delete` subcommand, witch will delete identified resources. 104 | 105 | #### Translation 106 | 107 | If we choose GraphQL as the API solution, we must translate paths to queries. 108 | The plural subpaths will be tanslated to GraphQL fragments without any 109 | parameters. For example, resource paths `/networkchaoses/iptableses` will be 110 | translated to query `{ networkchaos { iptables } }`. 111 | 112 | And the cascade subpaths will be tanslate to fragments with paramenters. For 113 | example, resource paths `/networkchaos//iptables/` will be 114 | translated to following query. 115 | 116 | ```GraphQL 117 | # { "name": "", "podName": "" } 118 | query GetIptables($name: String!, $podName: String!) { 119 | networkchaos(name: $name) { 120 | iptables(name: $podName) 121 | } 122 | } 123 | ``` 124 | 125 | #### Auto-Completion 126 | 127 | We can improve auto-completion with the schema and data. For example, when the 128 | user types `chaosctl get /network`, we can complete the command to `chaosctl get 129 | /networkchaos/` or `chaosctl get /networkchaoses` by schema. Then, if the user 130 | choose `chaosctl get /networkchaos/`, we can send query 131 | `{ networkchaos { name}}` and complete the command to 132 | `chaosctl get /networkchaos/` by the result. 133 | 134 | ## Alternatives 135 | 136 | Implement all APIs one by one as our requirements. 137 | -------------------------------------------------------------------------------- /text/2021-08-11-implement-physical-machine-chaos.md: -------------------------------------------------------------------------------- 1 | # Implement Physical Machine Chaos in Chaos Mesh 2 | 3 | ## Background 4 | 5 | Now we have implemented some chaos in [Chaosd](https://github.com/chaos-mesh/chaosd), 6 | which is used to inject fault to the physical machine. It's a simple command-line 7 | tool, and also can work as a server. It lacks UI(just like Chaos Mesh Dashboard 8 | to create and manage experiments, and can't orchestrate the experiment just like 9 | the [workflow](https://chaos-mesh.org/docs/create-chaos-mesh-workflow). 10 | 11 | ## Proposal 12 | 13 | Implement physical machine chaos in Chaos Mesh, and then we can reuse the 14 | Dashboard and workflow for it. 15 | 16 | There are two ways to implement it: 17 | 18 | ### 1. Treat physical machines as a new selector 19 | 20 | In this way, we will reuse the chaos type in Chaos Mesh, and implement a new 21 | selector to choose physical machines. 22 | 23 | For example, below is a YAML config for NetworkChaos: 24 | 25 | ```YAML 26 | apiVersion: chaos-mesh.org/v1alpha1 27 | kind: NetworkChaos 28 | metadata: 29 | name: network-delay 30 | namespace: busybox 31 | spec: 32 | action: delay 33 | mode: all 34 | selector: 35 | namespaces: 36 | - busybox 37 | delay: 38 | latency: "5s" 39 | ``` 40 | 41 | It will inject network delay on all pods in the busybox namespace. 42 | 43 | For physical machine, we can implement a new selector, the YAML config may looks 44 | like: 45 | 46 | ```YAML 47 | apiVersion: chaos-mesh.org/v1alpha1 48 | kind: NetworkChaos 49 | metadata: 50 | name: network-delay 51 | namespace: chaos-testing 52 | spec: 53 | action: delay 54 | selector: 55 | physicalMachines: 56 | - 123.123.123.123:123 57 | - 124.124.124.124:124 58 | delay: 59 | latency: "5s" 60 | ``` 61 | 62 | We replace the selector `namespaces` to `physicalMachines` , and the 63 | `123.123.123.123:123` and `124.124.124.124:124` are the addresses of Chaosd server. 64 | 65 | #### Advantage of proposal 1 66 | 67 | The physics experiment and K8s experiment are unified, the only difference is 68 | the selectors. 69 | 70 | #### Disadvantage of proposal 1 71 | 72 | Higher implementation costs. 73 | 74 | * We know that Chaos Mesh is designed for K8s, there are too many codes coupled 75 | to K8s. 76 | * The chaos type is related to a selector, which means each chaos has only one 77 | target type. For example, DNSChaos inject fault to some containers, NetworkChaos 78 | inject fault to some pods. This means we need to implement the physical machine 79 | selector for every chaos. 80 | * The config and implementation of the same chaos type for K8s and physical 81 | machines are different, like DNSChaos, JVMChaos, etc. It is difficult to unify them. 82 | 83 | ### 2. Treat chaos on the physical machine as a new chaos type 84 | 85 | Implement physical machine chaos as a new chaos type in Chaos Mesh. Add a new 86 | crd named `PhysicalMachineChaos` , the config includes: 87 | 88 | * action: the subtypes of `PhysicalMachineChaos`, the action can be `stress-cpu`, 89 | `stress-mem`, `network-delay`, `network-loss` and so on. 90 | * address: the addresses of chaosd server. 91 | * config related to the action: for example, for `stress-cpu` action, we need to 92 | set `load` and `workers`. 93 | 94 | Here is the sample YAML config for network delay: 95 | 96 | ```YAML 97 | apiVersion: chaos-mesh.org/v1alpha1 98 | kind: PhysicalMachineChaos 99 | metadata: 100 | name: physical-network-delay 101 | namespace: chaos-testing 102 | spec: 103 | action: network-delay 104 | address: 105 | - http://172.16.112.130:31767 106 | network-delay: 107 | device: “ens33” 108 | hostname: “baidu.com” 109 | duration: "5s" 110 | ``` 111 | 112 | Here is the sample YAML config for CPU stress: 113 | 114 | ```YAML 115 | apiVersion: chaos-mesh.org/v1alpha1 116 | kind: PhysicalMachineChaos 117 | metadata: 118 | name: physical-stress-cpu 119 | namespace: chaos-testing 120 | spec: 121 | action: stress-cpu 122 | address: 123 | - http://172.16.112.130:31767 124 | stress-cpu: 125 | workers: 3 126 | load: 10 127 | ``` 128 | 129 | #### Advantage of proposal 2 130 | 131 | * Experiments for the physical machine are relatively independent, and the 132 | implementation of it has no effect on other chaos. 133 | * Low development cost. 134 | 135 | #### Disadvantage of proposal 2 136 | 137 | We need to define a lot of chaos subtypes in the `PhysicalMachineChaos` , it's a 138 | bit complicated to define it. 139 | 140 | ### Summary 141 | 142 | I prefer to develop through the second option, keep the physical machine 143 | experiment separate, so it is more flexible. 144 | -------------------------------------------------------------------------------- /text/2021-09-09-physical-machine-auth.md: -------------------------------------------------------------------------------- 1 | # Physical Machine Auth 2 | 3 | ## Summary 4 | 5 | PhysicalMachineChaos-based auth solution, including end-user authorization 6 | and service-to-service authentication. 7 | 8 | ## Motivation 9 | 10 | Chaos Dashboard's Authentication and Authorization scheme is only applicable 11 | to single-cluster Kubernetes, and the Chaos Mesh platform needs to be adapted 12 | to more scenarios, such as multi-cluster Kubernetes, physical machines, cloud 13 | infrastructure, etc. 14 | 15 | ### Goals 16 | 17 | - end-user authorization on physical machine chaos 18 | - service-to-service authentication on physical machine chaos 19 | 20 | ### Non-Goals 21 | 22 | - end-user authentication on physical machine chaos (same with other chaos 23 | which is using token) 24 | 25 | ## Detailed design 26 | 27 | ### End-user authorization 28 | 29 | #### New Custom Resource: `PhysicalMachine` 30 | 31 | Because there is no concept of pod to filter the target physical machine 32 | when performing chaos injection on a physical machine, a resource needs to 33 | be introduced to represent the physical machine, which is the new custom 34 | resource `PhysicalMachine`. Here is a sample of `PhysicalMachine`. 35 | 36 | ```yaml 37 | apiVersion: chaos-mesh.org/v1alpha1 38 | kind: PhysicalMachine 39 | metadata: 40 | name: pm-172.16.112.130 41 | namespace: chaos-testing 42 | labels: 43 | chaos-mesh/physical-machine-group: abc 44 | kubernetes.io/arch: arm64 45 | kubernetes.io/os: linux 46 | spec: 47 | address: https://172.16.112.130:31767 48 | ``` 49 | 50 | #### Changes to `PhysicalMachineChaos` 51 | 52 | Remove the `address` field in `PhysicalMachineChaos`, and use the `selector` 53 | field to filter the injection range of the experiment. 54 | 55 | ```yaml 56 | apiVersion: chaos-mesh.org/v1alpha1 57 | kind: PhysicalMachineChaos 58 | metadata: 59 | name: physical-network-delay 60 | namespace: chaos-testing 61 | spec: 62 | action: network-delay 63 | network-delay: 64 | device: “ens33” 65 | hostname: “baidu.com” 66 | duration: "5s" 67 | selector: 68 | namespaces: 69 | - testA 70 | labelSelectors: 71 | chaos-mesh/physical-machine-group: abc 72 | ``` 73 | 74 | #### RBAC 75 | 76 | According to the above design, there is no need to change the existing 77 | `Role`/`ClusterRole` content, because the access to `PhysicalMachine` is 78 | included in the api group of `chaos-mesh.org`. 79 | 80 | We have two different ranges of access rights: 81 | 82 | - clustered scope: permissions for `PhysicalMachines` in all namespaces, 83 | so you can experiment with all physical machines. 84 | - namespaced scope:permissions for the `PhysicalMachine` in the current 85 | namespace, and you can experiment with multiple physical machines in that 86 | namespace. 87 | 88 | ### service-to-service authentication 89 | 90 | To secure the service-to-service communication, establish an mTLS connection 91 | between chaos-controller-manager and chaosd. 92 | 93 | The following certificates are required: 94 | 95 | - CA certificate: generated when deploying `Chaos Mesh` with `helm` or 96 | `install.sh`, and saved in the secret named 97 | `chaos-mesh-controller-manager-client-certs` (and on each physical machine) 98 | - certificate of chaos-controller-manager side: generated when deploying 99 | `Chaos Mesh` with `helm` or `install.sh`, and saved in the secret named 100 | `chaos-mesh-controller-manager-client-certs` 101 | - certificate of chaosd side: generated automatically or manually when adding 102 | physical machine information to the cluster, saved on each physical machine 103 | 104 | #### User Cases 105 | 106 | Here is a description of three different usage scenarios of user cases. 107 | 108 | #### Case1: Automatically generate certificates 109 | 110 | Prerequisites: 111 | 112 | 1. User deployed `Chaos Mesh` in Kubernetes cluster with security mode 113 | 1. The node executing the `chaosctl` command can ssh to 114 | the target physical machine 115 | 1. The node executing the `chaosctl` command can access the Kubernetes cluster 116 | 117 | Steps: 118 | 119 | 1. User prepares the physical machine using `chaosctl`, the command might be 120 | `chaosctl physical-machine init --server=127.0.0.1 --port=2333`, in this step, 121 | `chaosctl` generates the certificates on the physical machine side and copies 122 | all the required certificates to the target physical machine, then create the 123 | `PhysicalMachine` CR in Kubernetes cluster (BTW, `chaosctl pm` can be used 124 | instead of `chaosctl physical-machine`) 125 | 1. User starts chaosd service on the physical machine 126 | 1. User creates a physical machine experiment on the dashboard 127 | 1. Chaos-controller-manager establishes an mTLS connection when requesting 128 | the chaosd service on the physical machine 129 | 130 | #### Case2: Manually generate certificates 131 | 132 | Prerequisites: 133 | 134 | 1. User deployed `Chaos Mesh` in Kubernetes cluster with security mode 135 | 1. The node executing the `chaosctl physical-machine create` command can 136 | access the Kubernetes cluster 137 | 138 | Steps: 139 | 140 | 1. User copies the CA certificate from the kubernetes cluster 141 | to the physical machine 142 | 1. User uses `chaosctl` to generate the certificates on the physical machine, 143 | the command might be `chaosctl physical-machine generate` 144 | 1. User uses `chaosctl` to create `PhysicalMachine` resource in Kubernetes 145 | cluster, the command might be 146 | `chaosctl physical-machine create --server=127.0.0.1 --port=2333` 147 | 1. User starts chaosd service on the physical machine 148 | 1. User creates a physical machine experiment on the dashboard 149 | 1. Chaos-controller-manager establishes an mTLS connection when requesting 150 | the chaosd service on the physical machine 151 | 152 | #### Case3: Without mTLS authentication (Not recommended) 153 | 154 | Prerequisites: 155 | 156 | 1. User deployed `Chaos Mesh` in Kubernetes cluster without security mode 157 | 158 | Steps: 159 | 160 | 1. User uses `chaosctl` to create `PhysicalMachine` resource in Kubernetes 161 | cluster, the command might be 162 | `chaosctl physical-machine create --server=127.0.0.1 --port=2333 --protocol=http` 163 | 1. User starts chaosd service on the physical machine 164 | 1. User creates a physical machine experiment on the dashboard 165 | 1. Chaos-controller-manager will use HTTP to request the chaosd service 166 | 167 | ## Drawbacks 168 | 169 | Because `PhysicalMachine` is a namespaced scope resource, users can create 170 | the same physical machine information on different namespaces, which may 171 | cause duplicate injection problems and it requires the chaosd service 172 | API to support idempotency. 173 | 174 | ## Alternatives 175 | 176 | NA 177 | 178 | ## Unresolved questions 179 | 180 | ### Not support a role to access multiple specified `PhysicalMachines` 181 | 182 | One option considered is to directly use the `resourceNames` field in 183 | Kubernetes RBAC to control access by specifying the `name` field of the 184 | PhysicalMachine resource. However, in practice, we found that `resourceNames` 185 | only supports `GET` and `DELETE` APIs, not `LIST`, `WATCH`, `CREATE` and 186 | `DELETECOLLECTION` APIs, which makes it impossible for users to select the 187 | physical machine they want when operating on the dashboard. If there is a 188 | practical need, we may use `OPA`, `gatekeeper` and other policy frameworks to 189 | control the fine-grained permissions more easily. 190 | -------------------------------------------------------------------------------- /text/2021-09-27-refine-error-handling.md: -------------------------------------------------------------------------------- 1 | # Refine Error Handling 2 | 3 | ## Background 4 | 5 | There are a lot of different error handling pattern in Chaos Mesh. Some of the 6 | errors in the code are equipped with a backtrace, while some are not. The mass 7 | in error handling has been blocking us from diagnosing the problem and handling 8 | potential errors. The Chaos Mesh code base imports `github.com/pingcap/errors`, 9 | `github.com/pkg/errors` and the `errors` in standard library. 10 | 11 | ## Proposal 12 | 13 | Most of the error handling pattern follows [uber 14 | go-style](https://github.com/uber-go/guide/blob/master/style.md), and some rules 15 | are modified or removed because of the latest update of `pkg/errors` and 16 | `errors` in standard library. 17 | 18 | ### Error Types 19 | 20 | When returning errors, consider the following to determine the best choice: 21 | 22 | 1. Does the client need to extract the information of error? If so, you should 23 | use a custom type, and implement `Error()` method. You should also wrap it 24 | with `errors.WithStack` to make sure it has a backtrace. 25 | 26 | Example: 27 | 28 | ```go 29 | type ErrNotFound struct { 30 | File string 31 | } 32 | 33 | func open(file string) error { 34 | return errors.WithStack(ErrNotFound{file: file}) 35 | } 36 | ``` 37 | 38 | Then the user will be able to use `errors.As` to extract the `ErrNotFound` 39 | error, and get the `File` field. However, you should think **carefully** 40 | whether this field is necessary for the caller to handle the error. In former 41 | example, including the `File` in the error is really a **bad** case: 42 | 43 | 1. The caller actually know the file. It doesn't provide more information. 44 | 2. The caller don't need to know the file to handle the error in most cases. 45 | 46 | In this case, a simple error variable with wrap is prefered, e.g. 47 | 48 | ```go 49 | var ErrNotFound = errors.New("not found") 50 | 51 | func open(file string) error { 52 | return errors.Wrapf(ErrNotFound, "open file %s", file) 53 | } 54 | ``` 55 | 56 | 2. Is it an error which needs to be detected? If so, you should use a global 57 | public variable to store the error, and wrap it with `errors.WithStack` when 58 | returning the error. 59 | 60 | Example: 61 | 62 | ```go 63 | var ( 64 | ErrPodNotFound = errors.New("pod not found") 65 | 66 | ErrPodNotRunning = errors.New("pod not running") 67 | ) 68 | 69 | func handle() error { 70 | return errors.WithStack(ErrPodNotFound) 71 | } 72 | ``` 73 | 74 | 3. Combine of rule 1 and rule 2. The context information is needed by the 75 | caller, but it's also widely used by multiple functions. The variable name 76 | could be more detailed than the type. 77 | 78 | Example: 79 | 80 | ```go 81 | type ContainerRuntimeClientConnectErr struct { 82 | ContainerRuntime string 83 | } 84 | 85 | var DockerConnectErr = ContainerRuntimeClientConnectErr { 86 | ContainerRuntime: "docker" 87 | } 88 | ``` 89 | 90 | 4. Is this an error which will not be detected but appears in a lot of 91 | functions? If so, you should use a global private variable, and wrap it with 92 | `errors.WithStack`. If you want to share the same error (like `Not Found`) in 93 | multiple packages, but the callers will never need to detect the `Not Found`, 94 | please create a new `notFound` error under every package. 95 | 96 | ```go 97 | var errNotFound = errors.New("not found") 98 | ``` 99 | 100 | 5. Is this really a simple error that will not appear in other functions? If so, 101 | you should use an inline `errors.New`. The `errors.New` in `pkg/errors` is 102 | already equipped with a stack backtrace, so you don't need to add it again. 103 | 104 | Example: 105 | 106 | ```go 107 | func open(file string) error { 108 | return errors.New("not found") 109 | } 110 | ``` 111 | 112 | 6. Are you propagating an error returned by other functions? If it's returned by 113 | function inside the Chaos Mesh, we could assume this error is already 114 | equipped with a stack backtrace, so there is not need to call 115 | `errors.WithStack`. However, if it's returned by other libraries, it should 116 | to call `errors.WithStack` to equip it with a stack backtrace. For more 117 | information, see the section on error wrapping. 118 | 119 | ```go 120 | func startProcess(cmd *exec.Cmd) error { 121 | err := cmd.Start() 122 | if err != nil { 123 | return nil, errors.WithStack(err) 124 | } 125 | 126 | return nil 127 | } 128 | ``` 129 | 130 | ### Error Wrapping 131 | 132 | * Return the original error if there is no additional context to add. If the 133 | original error is not equipped with a stack, return with `errors.WithStack`. 134 | * Add context using 135 | [`"pkg/errors".Wrap`](https://pkg.go.dev/github.com/pkg/errors#Wrap) so that 136 | the error message provides more context 137 | 138 | ```go 139 | var ErrNotFound = errors.New("not found") 140 | 141 | func open(file string) error { 142 | return errors.Wrapf(ErrNotFound, "open file %s", file) 143 | } 144 | ``` 145 | 146 | * Use [`"pkg/errors".Errorf`](https://pkg.go.dev/github.com/pkg/errors#Errorf) 147 | with if the callers do not need to detect or handle that specific error case. 148 | 149 | The [`"pkg/errors".Wrap`](https://pkg.go.dev/github.com/pkg/errors#Wrap) and 150 | [`"pkg/errors".Errorf`](https://pkg.go.dev/github.com/pkg/errors#Errorf) will 151 | add a stack trace, so you don't need to wrap it with `WithStack` again. 152 | 153 | The context usually includes: what you are doing, the object of the operation 154 | (like the pod name, the chaos name...). 155 | 156 | ```go 157 | func startProcess(cmd *exec.Cmd) error { 158 | err := cmd.Start() 159 | if err != nil { 160 | return nil, errors.Errorf("start process: %w", err) 161 | } 162 | 163 | return nil 164 | } 165 | ``` 166 | 167 | A more realistic example in `chaos-daemon` is: 168 | 169 | ```go 170 | func (s *DaemonServer) ExecStressors(ctx context.Context, 171 | req *pb.ExecStressRequest) (*pb.ExecStressResponse, error) { 172 | ... 173 | 174 | control, err := cgroups.Load(daemonCgroups.V1, daemonCgroups.PidPath(int(pid))) 175 | if err != nil { 176 | return nil, errors.Wrapf(err, "load cgroup of pid %d", pid) 177 | } 178 | 179 | ... 180 | } 181 | ``` 182 | 183 | When adding context to returned errors, keep the context succinct by avoiding 184 | phrases like "failed to", which state the obvious and pile up as the error 185 | percolates up through the stack: 186 | 187 | 188 | 189 | 190 | 211 | 212 |
BadGood
191 | 192 | ```go 193 | s, err := store.New() 194 | if err != nil { 195 | return errors.Errorf( 196 | "failed to create new store: %w", err) 197 | } 198 | ``` 199 | 200 | 201 | 202 | ```go 203 | s, err := store.New() 204 | if err != nil { 205 | return errors.Errorf( 206 | "new store: %w", err) 207 | } 208 | ``` 209 | 210 |
213 | 214 | ### Error Handling 215 | 216 | The key of error handling is to identify whether this error is an expected one. 217 | For example, if the Chaos Mesh is recovering an injection, but the kubernetes 218 | returns `PodNotFound`, we need to "handle" it by just ignoring. 219 | 220 | Identifying errors could have two different situation: if the target error is a 221 | type, you could use `errors.As` to extract the error information, and if the 222 | target error is a variable (created by `errors.New`, in most situation), you 223 | could use `errors.Is`. 224 | 225 | The `error` type of golang is really a chaos, a type and a varaible can both 226 | mean a kind of error, and you should treat them in different pattern. If it's a 227 | customized error type, it could be verified through `errors.As`. 228 | 229 | Example: 230 | 231 | ```go 232 | type ContainerRuntimeClientConnectErr struct { 233 | ContainerRuntime string 234 | } 235 | 236 | var crccErr ContainerRuntimeClientConnectErr 237 | if errors.As(err, &crccErr) { 238 | fmt.Println(crccErr.ContainerRuntime) 239 | } 240 | ``` 241 | 242 | If this kind of error is a variable, it could be checked through `errors.Is`: 243 | 244 | Example: 245 | 246 | ```go 247 | var DockerConnectErr = ContainerRuntimeClientConnectErr { 248 | ContainerRuntime: "docker" 249 | } 250 | 251 | if errors.Is(err, DockerConnectErr) { 252 | fmt.Println("Docker Connect Error!") 253 | } 254 | ``` 255 | 256 | If this kind of error is not exported (e.g. an inline error or unexported 257 | variable/type), and you really need to detect it, please modify the callee 258 | function to export the error. 259 | 260 | #### Handling Error from Kubernetes 261 | 262 | The error from kubernetes client is usually derived from an HTTP status. The 263 | only way to identify them is to use `k8sError.Is*`, e.g. `k8sError.IsNotFound`. 264 | 265 | Wrapping an error in the Kubernetes is fine, as the `k8sError.Is*` will finally 266 | use `errors.As` to extract the information from the error. 267 | 268 | #### Handling Error from Grpc 269 | 270 | The error returned from grpc call is also too simple, and will lose the 271 | hierarchy error stack. The only way to identify an error returned from the grpc 272 | call is to use the `IsXXX` provided by the callee. For example: 273 | 274 | ```go 275 | _, err = pbClient.RecoverTimeOffset(ctx, &pb.TimeRequest{ 276 | ContainerId: containerId, 277 | }) 278 | if err != nil { 279 | if chaosdaemonerr.IsContainerNotFound(err) { 280 | // ignore the container not found error 281 | return v1alpha1.NotInjected, nil 282 | } 283 | return v1alpha1.Injected, errors.WithStack(err) 284 | } 285 | ``` 286 | 287 | Wrapping an error from grpc is also fine, the `IsContainerNotFound` function 288 | should handle the situation where this error is `Wrap`ed multiple times. 289 | 290 | ### The End of Error 291 | 292 | All error will finally be consumed by something. In this section, we will 293 | discuss when an error is not fully handled (which is not possible in most 294 | situations, as you don't know the full set of the error kinds in go), what 295 | should you do. 296 | 297 | #### Before the logger setuped 298 | 299 | 1. If the logger hasn't setuped, and the error is not bearable. You should call 300 | `log.Fatal(err)`. 301 | 2. If the logger hasn't setuped, and the error is acceptable. You should print 302 | it with `fmt.Printf("%v", err)`. 303 | 304 | #### Inside a reconciler 305 | 306 | Use `r.Log.Error` to print the error, with some contexts, and use 307 | `r.Recorder.Event` to send kubernetes event if needed. The `recorder.Failed` is 308 | a simple representation of error event. 309 | 310 | #### Inside a grpc function implementation 311 | 312 | Make sure the error is printed in the log, with the stack information (which is 313 | the default behavior for the zapr error log). 314 | 315 | 1. If the error needn't to be identified by the client, you could simply return 316 | it (and it will become an `Unknown Error` with the `err.Error()` as message). 317 | 2. If the error should be detected by the client, you should pick up a status 318 | code, and return `"grpc/status".Error(code, message)`. The message should be 319 | used to take apart this error from others. You should also provide a 320 | `IsXXXError` function in an standalone error package to assert the error. 321 | 322 | A template implementation for `IsXXX` could be: 323 | 324 | ```go 325 | package error 326 | 327 | import ( 328 | "github.com/pkg/errors" 329 | 330 | "google.golang.org/grpc/status" 331 | "google.golang.org/grpc/code" 332 | ) 333 | 334 | type StatError interface { 335 | error 336 | GRPCStatus() *Status 337 | } 338 | 339 | func IsNotFound(err error) bool { 340 | if grpcError := StatError(nil); errors.As(err, &grpcError) { 341 | status := grpcError.GRPCStatus() 342 | return status.Code() == code.NotFound && status.Message() == ErrNotFoundMsg 343 | } 344 | return false 345 | } 346 | ``` 347 | 348 | Adding a customized interface `StatError` is required, as it's not provided by 349 | the `grpc/status` package. Here is a feature request 350 | [issue](https://github.com/grpc/grpc-go/issues/2934#issuecomment-624749630) for 351 | this. 352 | 353 | It would also be suggested to define a variable to represent the error message 354 | 355 | ```go 356 | const ErrNotFound = "grpc/status".Error(code.NotFound, ErrNotFoundMsg) 357 | ``` 358 | 359 | However, inside the chaos-daemon, the error still passes like the normal go 360 | error, which means you need to do some convert at the **end** of the execution. 361 | For example: 362 | 363 | ```go 364 | // Error generation 365 | package crclients 366 | 367 | var ContainerNotFound = errors.New("container not found") 368 | 369 | func ContainerKill(containerId string) err { 370 | return errors.Wrapf(ContainerNotFound, "not found id: %d", containerId) 371 | } 372 | 373 | // Convert error to which is suitable for grpc 374 | package errors 375 | 376 | type StatError interface { 377 | error 378 | GRPCStatus() *Status 379 | } 380 | 381 | const ErrContainerNotFound = "grpc/status".Error(code.NotFound, ContainerNotFound.String()) 382 | 383 | func IsContainerNotFound(err error) bool { 384 | if grpcError := StatError(nil); errors.As(err, &grpcError) { 385 | status := grpcError.GRPCStatus() 386 | return status.Code() == code.NotFound && status.Message() == ContainerNotFound.String() 387 | } 388 | return false 389 | } 390 | 391 | package chaosdaemon 392 | 393 | func (s *DaemonServer) ContainerKill(ctx context.Context, req *pb.ContainerRequest) (*empty.Empty, error) { 394 | err := ContainerKill(req.ContainerId) 395 | 396 | if errors.Is(err, crclients.ContainerNotFound) { 397 | // this log is necessary to keep the stack trace, as the stack trace and all additional information 398 | // will lost when returning to the grpc caller. 399 | log.Error(err, "kill container") 400 | return nil, chaosdaemonErr.ErrContainerNotFound 401 | } 402 | } 403 | ``` 404 | 405 | Then the client ("chaos-controller-manager") can check this error with the help 406 | of `IsContainerNotFound`: 407 | 408 | ```go 409 | _, err := pbClient.ContainerKill(ctx, &pb.ContainerRequest{ 410 | Action: &pb.ContainerAction{ 411 | Action: pb.ContainerAction_KILL, 412 | }, 413 | ContainerId: containerId, 414 | }) 415 | 416 | if chaosdaemonErr.IsContainerNotFound(err) { 417 | fmt.Println("It's container not found!") 418 | } 419 | ``` 420 | 421 | It's inconvenient to extract any information from the error, as it's just a 422 | string (and an error code), which is only enough to assert the "kind". However, 423 | it's enough for current Chaos Mesh codebase (but I don't know whether we will 424 | need to find ways to extract information in the future). 425 | 426 | #### Inside the dashboard apiserver 427 | 428 | Every error types have been defined in the 429 | `/pkg/dashboard/apiserver/utils/error.go`, with a comment for the status code. 430 | If you need to return an error, please choose one from them, and send them 431 | through 432 | `"github.com/chaos-mesh/chaos-mesh/pkg/dashboard/apiserver/utils".SetAPIError`. 433 | For example: 434 | 435 | ```go 436 | utils.SetAPIError(c, utils.ErrInternalServer.WrapWithNoMessage(err)) 437 | ``` 438 | 439 | The `c` is the `gin.Context`. After set the error, this request will be returned 440 | a json error message: 441 | 442 | ```json 443 | { 444 | "code": code, 445 | "type": typeName, 446 | "message": err.Error(), 447 | "full_text": fmt.Sprintf("%+v", err) 448 | } 449 | ``` 450 | 451 | ### Print 452 | 453 | If you will return the error to the parent function, it's suggested not to print 454 | it out, as it will be printed by its ancestors. 455 | 456 | #### Console 457 | 458 | If the logger has setupped, you could use `log.Error` to print the error. The 459 | `zapr` implementation of `logr` will add `errorVerbose` and `errorCause` fields 460 | to represent the full information of the error (with `%+v`). 461 | 462 | If the logger hasn't setupped, you could use `fmt.Printf("%+v", err)` to print 463 | the error, or with additional context. 464 | 465 | #### Kubernetes Events 466 | 467 | All kubernetes events should be sent through `recorder.Event`, to make sure all 468 | information could be extracted from attribution. The message of kubernetes 469 | events shouldn't be long, so please not add the full error stack. 470 | 471 | ## Alternatives 472 | 473 | There are tons of error handling styles. Feel free to propose your opinion 474 | through the comments, and let us discuss and decide the most suitable one for 475 | Chaos Mesh. 476 | -------------------------------------------------------------------------------- /text/2021-10-08-monitoring-metrics-about-chaos-mesh.md: -------------------------------------------------------------------------------- 1 | # Monitoring Metrics about Chaos Mesh 2 | 3 | ## Summary 4 | 5 | More metrics are needed for improving the observability of `chaos-controller-manager`, 6 | `chaos-daemon` and `chaos-dashboard`. Now, we already have several metrics in 7 | `chaos-controller-manager` and `chaos-daemon`, but it's not enough. 8 | 9 | ## Motivation 10 | 11 | At present, we only collected several metrics about webhooks like 12 | `chaos_mesh_injections_total`, chaos experiments information in 13 | `chaos-controller-manager`, and HTTP and gRPC in `chaos-daemon`. 14 | These metrics are not easy to reflect the overall appearance of chaos-mesh system, 15 | so we need more metrics to improve the observability. 16 | 17 | According to this proposal https://github.com/chaos-mesh/chaos-mesh/issues/2198 , 18 | below is several metrics about logic pattern and performance should be implemented 19 | and exposed on `/metrics` HTTP endpoint: 20 | 21 | `chaos-controller-manager`: 22 | 23 | - time histogram for each `Reconcile()` in `Reconciler`(already provided by `controller-runtime`) 24 | - count of Chaos Experiments, Schedule, and Workflow 25 | - count of the emitted event by `chaos-controller-manager` 26 | - common metrics of gRPC client 27 | - metrics for kubernetes webhook (already provided by `controller-runtime`) 28 | 29 | `chaos-daemon`: 30 | 31 | - common metrics of grpc-server (already provided) 32 | - count of processes controlled by `bpm` (background process manager) 33 | - count of iptables/ipset/tc rules 34 | 35 | `chaos-dashboard`: 36 | 37 | - time histogram for each HTTP query 38 | - count for archived object 39 | - time histogram for archive reconciler (already provided by `controller-runtime`) 40 | 41 | ## Detailed design 42 | 43 | ### Metrics Plan 44 | 45 | The metrics design is as follows. Here are a few things should be noded: 46 | 47 | - The implementations of `chaos_controller_manager_chaos_experiments_count` and 48 | `chaos_daemon_grpc_server_handling_seconds` have been provided, 49 | (called `chaos_mesh_experiments` and `grpc_server_handling_seconds`). It is just 50 | hoping to modify the names to standardize naming. More about naming, see this 51 | guideline: [Metric and label naming](https://prometheus.io/docs/practices/naming/). 52 | - Time histogram metrics for each `Reconcile()` in `Reconciler` have been provided 53 | by `controller-runtime` called `controller_runtime_reconcile_time_seconds`. 54 | - Metrics for kubernetes webhook also have been provided by `controller-runtime`. 55 | They are called `controller_runtime_webhook_latency_seconds`, `controller_runtime_webhook_requests_total`, 56 | and `controller_runtime_webhook_requests_in_flight`. 57 | 58 | 59 | 60 | | Name | Description | Type | Label | Buckets | 61 | | ----------------------------------------------------- | ------------------------------------------------------------ | ------------ | ----------------------------------------------------- | ---------------------------- | 62 | | chaos_controller_manager_chaos_experiments | Total number of chaos experiments and their phases | GaugeVec | namespace, kind, phase | / | 63 | | chaos_controller_manager_chaos_schedules | Total number of chaos schedules | GaugeVec | namespace | / | 64 | | chaos_controller_manager_chaos_workflows | Total number of chaos workflows | GuageVec | namespace | / | 65 | | chaos_controller_manager_emitted_event_total | Total number of the emitted event by chaos-controller-manager | CounterVec | type, reason, namespace | / | 66 | | chaos_controller_manager_grpc_client_handling_seconds | common metrics of grpc-client | HistogramVec | grpc_code, grpc_method, grpc_service, grpc_type | DefBuckets | 67 | | chaos_daemon_grpc_server_handling_seconds | common metrics of grpc-server | HistogramVec | grpc_code, grpc_method, grpc_service, grpc_type | ChaosDaemonGrpcServerBuckets | 68 | | chaos_daemon_bpm_controlled_process_total | Total count of bpm controlled processes | Count | / | / | 69 | | chaos_daemon_bpm_controlled_processes | Current number of bpm controlled processes | Guage | / | / | 70 | | chaos_daemon_iptables_packets | Total number of iptables packets | GaugeVec | namespace, pod, container, table, chain, policy, rule | / | 71 | | chaos_daemon_iptables_packet_bytes | Total bytes of iptables packets | GaugeVec | namespace, pod, container, table, chain, policy, rule | / | 72 | | chaos_daemon_ipset_members | Total number of ipset members | GuageVec | namespace, pod, container | / | 73 | | chaos_daemon_tcs | Total number of tc rules | GuageVec | namespace, pod, container | / | 74 | | chaos_dashboard_http_request_duration_seconds | Time histogram for each HTTP query | HistogramVec | path, method, status | DefBuckets | 75 | | chaos_dashboard_archived_experiments | Total number of archived chaos experiments | GaugeVec | namespace, type | / | 76 | | chaos_dashboard_archived_schedules | Total number of archived chaos schedules | GaugeVec | namespace | / | 77 | | chaos_dashboard_archived_workflows | Total number of archived chaos workflows | GaugeVec | namespace | / | 78 | 79 | 80 | 81 | The current design of Buckets is shown in the table below. The distribution of 82 | these data need to be obtained later in order to adjust the values so that the 83 | number of samples in each bucket is similar. 84 | 85 | 86 | 87 | | Buckets Name | Value | Description | 88 | |------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------| 89 | | DefBuckets | `[]float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}` | default prometheus buckets | 90 | | ChaosDaemonGrpcServerBuckets | `[]float64{0.001, 0.01, 0.1, 0.3, 0.6, 1, 3, 6, 10}` | the buckets settings have been implemented, just set constants for clarity | 91 | 92 | 93 | ### Collecting Plan 94 | 95 | Here will introduce the metrics collection ways in complex scenarios. Common 96 | collection methods such as pull mode can be found in `controllers/metrics/metrics.go`. 97 | 98 | #### gRPC 99 | 100 | The implementation of `chaos_controller_manager_grpc_client_handling_seconds` 101 | and `chaos_daemon_grpc_server_handling_seconds` will be provided by 102 | . It should be noted that 103 | the metrics name needs to be replaced in the original implementation of `chaos-daemon`. 104 | 105 | ```go 106 | // pkg/chaosdaemon/server.go 107 | func newGRPCServer(reg prometheus.Registerer, ...) (*grpc.Server, error) { 108 | withHistogramName := func(name string) { 109 | return func(opts *prometheus.HistogramOpts) { 110 | opts.Name = name 111 | } 112 | } 113 | 114 | grpcMetrics := grpc_prometheus.NewServerMetrics() 115 | grpcMetrics.EnableHandlingTimeHistogram( 116 | // here we set customized ChaosDeamonGrpcServerBuckets, and customized histogram name 117 | grpc_prometheus.WithHistogramBuckets(ChaosDaemonGrpcServerBuckets), 118 | withHistogramName("chaos_deamon_grpc_server_handling_seconds"), 119 | ) 120 | reg.MustRegister(grpcMetrics) 121 | 122 | // ... 123 | } 124 | ``` 125 | 126 | For the implementation of `chaos_controller_manager_grpc_client_handling_seconds`, 127 | add an option function to `GrpcBuilder` in `pkg/grpc/utils.go` and register 128 | `grpc_prometheus.DefaultClientMetrics` for `controllermetrics.Registry`: 129 | 130 | ```go 131 | // pkg/grpc/utils.go 132 | func (it *GrpcBuilder) WithGrpcMetricsCollection() *GrpcBuilder { 133 | it.options = append(it.options, 134 | grpc.WithUnaryInterceptor(grpc_prometheus.UnaryClientInterceptor), 135 | grpc.WithStreamInterceptor(grpc_prometheus.StreamClientInterceptor), 136 | ) 137 | return it 138 | } 139 | 140 | // cmd/chaos-controller-manager/main.go 141 | func Run(params RunParams) error { 142 | // ... 143 | // register grpc_prometheus client metrics 144 | controllermetrics.Registry.MustRegister(grpc_prometheus.DefaultClientMetrics) 145 | // ... 146 | } 147 | ``` 148 | 149 | #### BPM 150 | 151 | To collect the metrics of `chaos_daemon_bpm_controlled_process_count`, we 152 | consider that in BPM each Identifier corresponds to a process. So we could 153 | believe that the length of identifiers of BPM is the metric value. 154 | 155 | #### iptables / ipset / tc 156 | 157 | For metrics such as `chaos_daemon_iptables_packets`, we should enter the 158 | container network namespace to collect it, such as 159 | `/usr/bin/nsenter -n/proc/%d/ns/net - iptables-save`. In order to collect these 160 | metrics, first we need to list the PIDs of all containers, and then run commands 161 | such as `iptables-save -c` using BPM, and parse the output to obtain information. 162 | 163 | 1. First, `crclients.ContainerRuntimeInfoClient` needs to provide a new 164 | interface `ListContainerPIDs() uint32[]`, for each runtime: 165 | - Docker: call `ContainerAPIClient.ConatinerList` to get PIDs directly, see 166 | . 167 | - containerd: call `Client.Containers` to get the PIDs directly, see 168 | . 169 | - CRI-O: call the gRPC interface `ListContainers` to obtaining the ContainerID, 170 | and then get the PIDs by `GetPidFromContainerID` which is already 171 | implemented. See 172 | 173 | 2. Then, we need the pod name and container name for each containerPID. So 174 | `GetLabelsFromContainerID` should be provided by `crclients.ContainerRuntimeInfoClient` 175 | similarly as above. 176 | 3. Run the command and parse the output: 177 | - For iptables metrics: use [iptables_exporter](https://github.com/retailnext/iptables_exporter/blob/master/iptables/parser.go) 178 | to parse the result of `iptables-save -c` to get the number of chains and 179 | rules, packet number, and byte number. 180 | - For ipset metrics: parse `ipset list` to obtain the name and members of each 181 | set in the ipset. 182 | - For tc metrics: parse `tc qdisc` to get the number of rules. 183 | -------------------------------------------------------------------------------- /text/2021-11-17-ui-monorepo.md: -------------------------------------------------------------------------------- 1 | # Scalable codebase in UI via monorepo 2 | 3 | ## Summary 4 | 5 | Use [Yarn Workspaces](https://classic.yarnpkg.com/lang/en/docs/workspaces/) 6 | to easily extend the front-end codebase. 7 | 8 | ## Motivation 9 | 10 | > Why are we doing this? 11 | 12 | As the project grows, so does the number of features that need to be supported 13 | on the front-end. But the current architecture of front-end is a single-repo, which 14 | means all dependencies are intertwined. With this system, isolating applications 15 | and dependencies is difficult. You can only put the common logic under a certain 16 | folder, like `lib` or `components`, it's not possible to put the logic out of the 17 | `src` folder. 18 | 19 | So we decided to use monorepo to separate the front-end codebase. 20 | 21 | > What use cases does it support? 22 | 23 | According to Chaos Mesh itself, in the future, the front-end will plan to generate 24 | the corresponding typescript definitions through CRDs. Code like this is application 25 | agnostic, and putting them together with the original code can be a huge hindrance 26 | to testing and management. 27 | 28 | > What is the expected outcome? 29 | 30 | Before monorepo, the codebase looks like this: 31 | 32 | ```shell 33 | ui/ 34 | ├── src/ 35 | │   ├── api/ 36 | │   ├── components/ 37 | │   ├── components-mui/ 38 | │   ├── lib/ 39 | │   ├── pages/ 40 | ├── package.json 41 | ``` 42 | 43 | After using monorepo: 44 | 45 | ```shell 46 | ui/ 47 | ├── app/ # original src 48 | │   ├── api/ 49 | │   ├── components/ 50 | │   ├── lib/ 51 | │   ├── pages/ 52 | ├── packages/ 53 | │   ├── mui-extends/ # components-mui 54 | │   ├── crd/ # generate ts code from CRDs 55 | ├── package.json 56 | ``` 57 | 58 | ## Detailed design 59 | 60 | First, it's needed to treat the current `src` folder as a package, we decided to 61 | rename it to `app` and move it to `ui/app`. 62 | 63 | Then extract the public logic, like `husky` and `lint-staged`, we will use them 64 | in the root: 65 | 66 | ```shell 67 | yarn add husky lint-staged prettier prettier-plugin-import-sort import-sort-style-eslint -DW # --dev --ignore-workspace-root-check 68 | ``` 69 | 70 | Then add the below into `ui/package.json`: 71 | 72 | ```json 73 | { 74 | "workspaces": [ 75 | "app", 76 | "packages/*" 77 | ] 78 | } 79 | ``` 80 | 81 | Finally, we are ready to create packages: 82 | 83 | ```shell 84 | mkdir -p packages/mui-extends 85 | cd packages/mui-extends 86 | npm init 87 | ``` 88 | 89 | > Note: 90 | > 91 | > Some commands will be changed, like: 92 | > 93 | > - `yarn start:default` -> `yarn workspace @ui/app start:default` 94 | 95 | ## Drawbacks 96 | 97 | A good explanation: 98 | 99 | https://fossa.com/blog/pros-cons-using-monorepos/ 100 | 101 | ## Alternatives 102 | 103 | There should be no better way to manage future non-application related logic, after 104 | all, the front-end code is also attached to the entire chaos-mesh codebase. 105 | 106 | ## Unresolved questions 107 | 108 | No. 109 | -------------------------------------------------------------------------------- /text/2021-12-09-logging.md: -------------------------------------------------------------------------------- 1 | # Logging 2 | 3 | ## Summary 4 | 5 | Refine the guide for logging for developing Chaos Mesh. 6 | 7 | ## Motivation 8 | 9 | 11 | 12 | Logging in each component in Chaos Mesh is bad. Related issue: 13 | https://github.com/chaos-mesh/chaos-mesh/issues/2149. Because of the mess on 14 | logging, the debugging and profiling of the components are difficult. 15 | 16 | This proposal aims to improve the logging observability of the components and 17 | the development experience on printing logs. 18 | 19 | After passed this proposal, I would write a developing guide for logging in the 20 | repo of Chaos Mesh. 21 | 22 | ## Requirements 23 | 24 | - structured logging 25 | - leveled logging 26 | - runtime level configure 27 | - could logging in everywhere (every line of code) 28 | 29 | ## Detailed design 30 | 31 | 36 | 37 | We are going to use [logr](https://github.com/go-logr/logr) and 38 | [zap](https://github.com/uber-go/zap) (with 39 | [zapr](https://github.com/go-logr/zapr) as the shim between them) to construct 40 | the logging facilities. 41 | 42 | [logr](https://github.com/go-logr/logr) is a library that provides a common 43 | interface for logging, in other words, logging facade. It is designed to be used 44 | by different logging frameworks(likes the SLF4J API in Java), with using logr, 45 | we could switch from one logging backend to another without many changes. 46 | 47 | [zap](https://github.com/uber-go/zap) is a high performance logging framework. 48 | [zapr](https://github.com/go-logr/zapr) is the logr implementation using zap. 49 | 50 | ### Global Logger 51 | 52 | Here would exists a global logger at `pkg/log`, which marked as deprecated and 53 | only for the compatibility with the old code. Global Logger could be accessed by 54 | `log.L()`. 55 | 56 | The initial value of the global logger is `logr.DiscardLogger`, which means no 57 | log would be printed anywhere. The global logger should be replaced ONLY ONCE 58 | when the application starts. 59 | 60 | ### Logger as dependency 61 | 62 | The suggested way to use the logger is to use it as a dependency, and make 63 | dependency explicit. 64 | 65 | GOOD Example: 66 | 67 | ```go 68 | type SomeStruct struct { 69 | state string 70 | otherField string 71 | logger logr.Logger 72 | } 73 | 74 | func NewSomeStruct(state string, otherField string, logger logr.Logger) *SomeStruct { 75 | return &SomeStruct{ 76 | state: state, 77 | otherField: otherField, 78 | logger: logger, 79 | } 80 | } 81 | 82 | func (it *SomeStruct) DoSomething() { 83 | it.logger.Info("Doing something") 84 | } 85 | ``` 86 | 87 | MUST NOT use the implicitly logger in the constructor. 88 | 89 | BAD example: 90 | 91 | ```go 92 | func NewSomeStruct(state string, otherField string) *SomeStruct { 93 | return &SomeStruct{ 94 | state: state, 95 | otherField: otherField, 96 | logger: somewhere.DefaultLogger(), 97 | } 98 | } 99 | ``` 100 | 101 | And logger should be a parameter of the function, not a global variable. 102 | 103 | GOOD example: 104 | 105 | ```go 106 | func doSomething(logger logr.Logger) { 107 | logger.Info("Doing something") 108 | } 109 | ``` 110 | 111 | BAD Example: 112 | 113 | ```go 114 | var logger = somewhere.DefaultLogger() 115 | 116 | func doSomething() { 117 | logger.Info("Doing something") 118 | } 119 | ``` 120 | 121 | > As you could see, it is actually a closure with an extern/global state. 122 | 123 | ### Logging in Chaos Mesh Dashboard and gin 124 | 125 | Chaos Mesh Dashboard use [gin](https://github.com/gin-gonic/gin) as thew web 126 | framework, and there are some logging printed when requests coming in. 127 | 128 | There are two "gin middleware" for logging, called `Logger()` and `Recovery()`, 129 | included in the `gin.Default()` function. 130 | 131 | So we should replace these 2 middlewares with other logger middlewares, and 132 | thanks to community, there are already some libraries for this: 133 | https://github.com/alron/ginlogr and https://github.com/gin-contrib/zap. We 134 | could choose one of them. 135 | 136 | In other codes of Chaos Mesh Dashboard, we should use logger as same as other 137 | components. 138 | 139 | ### Logging in cli tools 140 | 141 | For cli tools, `stdout` and `stderr` are very important for user experience, so 142 | you could not print many logs to `stdout` as a service application. But it does 143 | not mean that you should not print anything with logger, you could still use 144 | logging to show the progress, or as the profiling tools. 145 | 146 | ## Practices 147 | 148 | ### Should I use the global logger 149 | 150 | NO. At least, avoid using it in the new code. 151 | 152 | The only reason to use the global logger is you have nowhere to access an 153 | instance of logger, which means you might write some code that relies on the 154 | global state, like some "registration pattern", or `func init()`, or some 155 | extremely simple function like `func min(a, b int) int`. 156 | 157 | Please avoid using the global/package level variable/state when you coding, and 158 | once a "simple method" need a logger to print some message, it probably has been 159 | not "simple" anymore, please refactor it. 160 | 161 | ### Using logger as function parameter is UGLY! Is there another way 162 | 163 | YES. Consider to refactor the one simple function into a struct with method, and 164 | introduce the logger as the dependency of the struct. 165 | 166 | Or use the "Functional Options", see: 167 | https://github.com/uber-go/guide/blob/master/style.md#functional-options. 168 | 169 | If you are facing making a choice between making API clean and writing fewer 170 | codes, please choose the former one. 171 | 172 | ### Should I log in library codes 173 | 174 | YES. When you write codes as library or write codes under `pkg/` 175 | which are exported, logging is not only help the development, but also help the 176 | users to understand the logic of the library. And logging is also a important 177 | part of observability of the library. 178 | 179 | And using a logging facade (like logr) gives us flexibility without burdening 180 | library users with a certain logging framework. 181 | 182 | Only use "high level" `V(n).Info() n > 0` in the library codes, do not use 183 | `Error()`, `V(0).Info()` and `Info()`. Because `V(0).Info()` (or not defined a 184 | level) and `Error()` would most probably print something to `stdout`, which 185 | would might break the design of cli tools. 186 | 187 | ### Logger name 188 | 189 | The name of the logger should not be defined by itself, also should not be 190 | defined in the constructor. It should be defined in the caller for the 191 | constructor. 192 | 193 | GOOD Example: 194 | 195 | ```go 196 | func main(){ 197 | val logger = initializeApplicationRootLogger() 198 | NewWebModule(logger.WithName("web")).StartServe() 199 | } 200 | ``` 201 | 202 | BAD Example: 203 | 204 | ```go 205 | func NewWebModule(logger logr.Logger) *WebModule { 206 | return &WebModule{ 207 | logger: logger.WithName("web"), 208 | } 209 | } 210 | ``` 211 | 212 | The name of the logger should follow these rules: 213 | 214 | - use kebab-case for the name 215 | - follow pattern: `..` 216 | 217 | For `zapr`(logr adaptor for zap), the name of logger would be prefixed with its 218 | parent logger, which means: 219 | 220 | ```go 221 | rootLogger := 222 | chaosDaemonLogger := rootLogger.WithName("chaos-daemon") 223 | daemonServerLogger := chaosDaemonLogger.WithName("daemon-server") 224 | ``` 225 | 226 | The name of `chaosDaemonLogger` is `chaos-daemon`, and the name of 227 | `daemonServerLogger` is `chaos-daemon.daemon-server`. 228 | 229 | ### Which level should I use 230 | 231 | TL;DR, for most codes please follow these suggestions: 232 | 233 | - use `logger.Error()` for logging errors in ERROR level 234 | - use `logger.Info()` (or `logger.V(0).Info()`) for logging in INFO or WARN 235 | level 236 | - use `logger.V(1).Info()` for logging in DEBUG level 237 | - use `logger.V(5).Info()` for logging in TRACE level. 238 | 239 | About the detailed meaning of each V level, we do not restrict the number of 240 | level for the logger, here is a list for the level we recommend to use: 241 | 242 | - Error() - Error should be used to indicated unexpected error, for example, 243 | unexpected errors returned by subroutine function calls. 244 | 245 | - Info() - Info should be used as a human journal or diary. And it could be used 246 | to log expected error as warning. Info() hash multiple levels: 247 | - V(0) - This is the default level, would ALWAYS be visible to users. 248 | - CLI argument handling, print args before the program starts 249 | - Application configuration 250 | - Expected errors as warning message 251 | - Journal for application's major behavior 252 | - service endpoint: resolve requests 253 | - long-time batch-job: job start, job complete 254 | - Acquirement on system resources: listen on port, persistent in file, etc. 255 | - V(1) - Default log level for debug. 256 | - Expected error that repeat frequently, relates to conditions would be 257 | corrected, like `StatusReasonConflict` when reconciling 258 | - Component configuration and initialize 259 | - V(2) - Useful state changes and conditional branch. 260 | - State which is useful for debugging changed 261 | - Choose a branch in a conditional statement 262 | - V(3) - Extended information about changes 263 | - Logging with state updates like `err := updateState(); err == nil` 264 | - Logging error with more context before wrap and return it 265 | - V(4) - Debug level verbosity, any other behavior that changes the state 266 | - V(5) - Default log level for trace. 267 | - Progressing in a for-loop 268 | - Omitted things within "best effort" pattern 269 | - Use default value with not configured value 270 | - V(6) - Communication with other components 271 | - Logging in handler before processing/resolving request 272 | - RPC call with other components/services 273 | - V(7) - Detailed information about communication 274 | - Detailed payload and status of request/response within RPC 275 | 276 | The number of level still limited by the backend logging framework, for example, 277 | you could only create 128 level in zap. 278 | 279 | ### Relations between error and logging 280 | 281 | You could resolve an error by log it with `Error()`, it usually means this 282 | program could not handle this error properly, and let the user know. But if your 283 | codes could handle this error, including you "throwing" this error to the upper 284 | level, you should not log it with `Error()`. 285 | 286 | You could also use debug level `Info()` to log error, with more detailed context 287 | and behaviors. It is not conflict with the former rule. 288 | 289 | ## Drawbacks 290 | 291 | 292 | 293 | ## Alternatives 294 | 295 | 299 | 300 | ### Why zap? And why not other logger framework as logr backend 301 | 302 | There are still many other logging frameworks supported by logr: 303 | https://github.com/go-logr/logr#implementations-non-exhaustive. I select zap 304 | only because I am familiar with it, and I think the API of zap is easy to use 305 | and has lots of configure options for features. 306 | 307 | Please comment on it if you have any suggestions. 308 | 309 | ## Unresolved questions 310 | 311 | 312 | 313 | logr just release its v1.x.x stable version, and brings BEARKING CHANGES with 314 | its v0.x.x: it changes the `logr.Logger` from `interface` to `struct`, see 315 | https://github.com/go-logr/logr/pull/42. 316 | 317 | Some library we used in Chaos Mesh, like `controller-runtime`, has upgrade to 318 | `logr` in v1.x.x in master branch, but not has released the stable version. The 319 | latest version for `controller-runtime` is `0.10.3`, and `0.11.0-beta.0` has 320 | been released at `2020-11-10`, and it still need several months to be stable. 321 | 322 | So we have to use logr v0.4.0 for a while, and several months later, we will 323 | upgrade to logr v1.x.x. 324 | -------------------------------------------------------------------------------- /text/2021-12-29-openapi-to-typescript-api-client-and-forms.md: -------------------------------------------------------------------------------- 1 | # OpenAPI to TypeScript API Client and Forms 2 | 3 | > Updated on 2022-11-03: 4 | > 5 | > After much practice, I found that using [Orval](https://orval.dev/) instead of 6 | > [OpenAPITools/openapi-generator](https://github.com/OpenAPITools/openapi-generator) 7 | > eliminates the dependency on the JRE, which solves the cons mentioned below. 8 | > 9 | > So I plan to use Orval in the future to generate the client. 10 | > 11 | > The other parts in this RFC still remain unchanged. :) 12 | 13 | ## Summary 14 | 15 | Use the OpenAPI specification to generate a TypeScript API client and forms. 16 | 17 | ## Motivation 18 | 19 | Ref: https://github.com/chaos-mesh/chaos-mesh/issues/2615. 20 | 21 | Currently, the chaos types are split between the front-end and back-end, and we 22 | can‘t reuse types that have already been defined. 23 | 24 | You can find these **extra** type definitions in https://github.com/chaos-mesh/chaos-mesh/blob/release-2.1/ui/app/src/components/NewExperimentNext/data/types.ts. 25 | 26 | This leads to several problems: 27 | 28 | 1. The front-end needs to be manually synchronized when there is an update to the 29 | type definition. 30 | 2. As `1` continues to emerge, there are more and more things to maintain manually. 31 | 32 | Considering our existing maintenance cost and past experience, we are not able 33 | to synchronize the corresponding descriptions to the front-end in a timely manner. 34 | 35 | This also works against those who want to contribute to Chaos Mesh, and it would 36 | be great if only modifying the API will synchronize the changes to the UI. 37 | 38 | So the best solution for now is to automate the generation of the schemas needed 39 | for the front-end. 40 | 41 | ## Detailed design 42 | 43 | I will split the details into two parts: 44 | 45 | - The first part is the generation of the TypeScript schemas. 46 | - The second part is how to use the generated files to produce a forms skeleton. 47 | 48 | ### Generate TypeScript Schemas 49 | 50 | Luckily, there has serval tools can help us to generate the schemas, all of them 51 | has pros and cons, finally I choose the 52 | [OpenAPITools/openapi-generator](https://github.com/OpenAPITools/openapi-generator), 53 | because: 54 | 55 | - Pros 56 | - It's official. 57 | - It's a very popular tool and has a huge community. 58 | - Cons 59 | - The nodejs package is just a wrapper of jar, so JRE is needed. 60 | 61 | Other tries: 62 | 63 | - [openapi-typescript](https://www.npmjs.com/package/openapi-typescript) 64 | - [openapi-typescript-codegen](https://www.npmjs.com/package/openapi-typescript-codegen) 65 | 66 | Both of these tools also have it's pros and cons, as opposite of above described. 67 | 68 | Then we can use create a new package to handle the generation: 69 | 70 | ```sh 71 | mkdir -p ui/packages/openapi 72 | cd ui/package/openapi 73 | yarn init 74 | ``` 75 | 76 | Install dependencies and add the `generate` script: 77 | 78 | ```json 79 | { 80 | "scripts": { 81 | "generate": "export TS_POST_PROCESS_FILE='../../node_modules/.bin/prettier --write'; openapi-generator-cli generate -c openapiconfig.json -i ../../../pkg/dashboard/swaggerdocs/swagger.yaml -g typescript-axios -o ../../app/src/openapi --enable-post-process-file" 82 | } 83 | } 84 | ``` 85 | 86 | This script will output the files into `ui/app/src/openapi` directory directly to 87 | to avoid additional compilation. 88 | 89 | ### Use TypeScript Compiler API to Generate Forms 90 | 91 | To generate forms, we have think about these things before we write the code: 92 | 93 | - Construct the interface of a form field. 94 | - How to handle dependencies between fields? 95 | - How to find shared fields? 96 | - Something special. 97 | 98 | For the first item, a frontend form field component should normally contains properties 99 | at least: 100 | 101 | - type 102 | - label 103 | - value 104 | - description 105 | 106 | Convert to a JSON: 107 | 108 | ```json 109 | { 110 | "field": "text", 111 | "label": "Name", 112 | "value": "", 113 | "helperText": "Fill your name" 114 | } 115 | ``` 116 | 117 | The types we currently have are: 118 | 119 | ```ts 120 | type FieldType = 121 | | "text" 122 | | "textarea" 123 | | "number" 124 | | "select" 125 | | "label" 126 | | "autocomplete"; 127 | ``` 128 | 129 | Most of them are inherited from the HTML Input types, except `label` and `autocomplete`. 130 | The `label` is represented as `string[]`. The `autocomplete` can be ignored for 131 | now because all generated fields won't use it. 132 | 133 | Done. Next we need to think about how to handle dependencies between fields. 134 | 135 | #### Dependencies between fields 136 | 137 | Mostly, we use an `action` field to distinguish exactly what we want a chaos to 138 | do with the injection. So I'll start from here, but how we can find different actions? 139 | The key problem is OpenAPI generator can't convert Go `type alias` to TS `enums`, 140 | for example: 141 | 142 | ```go 143 | const ( 144 | Ec2Stop AWSChaosAction = "ec2-stop" 145 | Ec2Restart AWSChaosAction = "ec2-restart" 146 | DetachVolume AWSChaosAction = "detach-volume" 147 | ) 148 | ``` 149 | 150 | We expect these can covert to: 151 | 152 | ```ts 153 | enum AWSChaosAction { 154 | Ec2Stop = "ec2-stop", 155 | Ec2Restart = "ec2-restart", 156 | DetachVolume = "detach-volume", 157 | } 158 | ``` 159 | 160 | But unfortunately at last we only have: 161 | 162 | ```ts 163 | export interface V1alpha1AWSChaosSpec { 164 | /** 165 | * @type {string} 166 | * @memberof V1alpha1AWSChaosSpec 167 | */ 168 | action?: string; 169 | 170 | //... 171 | } 172 | ``` 173 | 174 | This prevents us from using the TS compiler API to read the enum, but the solution 175 | is also simple, **since we can't get the key information through the code, we need 176 | to define it manually**. 177 | 178 | How about defining a json file? This is what I first thought, but @STRRL reminded 179 | me that **we can write it in the comments**. Yes, like `kubebuilder`'s marker, we 180 | can also define markers. 181 | 182 | So the following are all markers we will using: 183 | 184 | - +kubebuilder:validation:Enum=action1;action2;action3 185 | 186 | +ui:form:enum=action1;action2;action3 187 | 188 | > Reuse the kubebuilder validation marker to indicate what actions we have. 189 | > For uniformity, you can also write as `+ui:form:enum`. 190 | > 191 | > Except for the `action`, this can also reuse to define a `select` field. 192 | 193 | - +ui:form:when=action=='action1' 194 | 195 | > Indicate this property belongs to which action. 196 | > 197 | > The value is an `expression` which needs to be evaluated at runtime. 198 | 199 | - +ui:form:ignore 200 | 201 | > Ignore this property. 202 | 203 | For example: 204 | 205 | ```go 206 | type AWSChaosSpec struct { 207 | // +ui:form:enum=ec2-stop;ec2-restart;detach-volume 208 | // +kubebuilder:validation:Enum=ec2-stop;ec2-restart;detach-volume 209 | Action AWSChaosAction `json:"action"` 210 | 211 | //... 212 | } 213 | 214 | type AWSSelector struct { 215 | // Endpoint indicates the endpoint of the aws server. Just used it in test now. 216 | // +ui:form:ignore 217 | // +optional 218 | Endpoint *string `json:"endpoint,omitempty"` 219 | 220 | //... 221 | 222 | // DeviceName indicates the name of the device. 223 | // +ui:form:when=action=='detach-volume' 224 | // +optional 225 | DeviceName *string `json:"deviceName,omitempty" webhook:"AWSDeviceName,nilable"` 226 | } 227 | ``` 228 | 229 | The rest of the steps are simple, use regular expressions to read them: 230 | 231 | ```js 232 | // Part of the code 233 | const UI_FORM_ENUM = /\+ui:form:enum=(.+)\s/; 234 | const KUBEBUILDER_VALIDATION_ENUM = /\+kubebuilder:validation:Enum=(.+)\s/; 235 | 236 | /** 237 | * Get enum array from jsdoc comment. 238 | * 239 | * @export 240 | * @param {string} s 241 | * @return {string[]} 242 | */ 243 | export function getUIFormEnum(s) { 244 | let matched = s.match(UI_FORM_ENUM) || s.match(KUBEBUILDER_VALIDATION_ENUM); 245 | 246 | return matched ? matched[1].split(";") : []; 247 | } 248 | 249 | // ... 250 | 251 | // Assuming that the Node is found 252 | const { escapedText: identifier } = node.name; // identifier 253 | const comment = node.jsDoc[0].comment; 254 | 255 | if (identifier === "action") { 256 | // Get all actions 257 | actions = getUIFormEnum(comment); 258 | } 259 | ``` 260 | 261 | Similarly we can handle other markers in this way. So far we have solved this problem. 262 | 263 | #### Shared fields 264 | 265 | Now this problem also becomes simple, since we have defined markers, we can default 266 | to: 267 | 268 | **The fields without `+ui:form:when=action=='xxx'` and `+ui:form:ignore` are shared**. 269 | 270 | I think this part can be ignored for code details, it is enough to understand this 271 | given condition. 272 | 273 | #### Chaos without action 274 | 275 | Here is another case, what if a chaos doesn't have an action? Like `KernelChaos` 276 | and `TimeChaos`. 277 | 278 | This Chaos will output an empty actions array `export const actions = []` as a placeholder. 279 | 280 | #### Non-primitive type 281 | 282 | If a field is not primitive type, TypeScript will use an `TypeReference` to represent 283 | it. This is different from primitive types, once the type reference appears, we 284 | need to recursively get the corresponding type symbol. We will use the checker to 285 | achieve this: 286 | 287 | ```ts 288 | const program = ts.createProgram([source], { 289 | target: ts.ScriptTarget.ES2015, 290 | }); 291 | const sourceFile = program.getSourceFile(source); 292 | const checker = program.getTypeChecker(); // this is what we need 293 | 294 | const type = checker.getTypeAtLocation(typeRef); // get the final type 295 | ``` 296 | 297 | So we still need to use a new field type to represent it. Here is an example: 298 | 299 | ```js 300 | { 301 | field: "ref", 302 | label: "callchain", 303 | multiple: true, 304 | children: [ 305 | { 306 | field: "text", 307 | label: "funcname", 308 | value: "", 309 | helperText: "" 310 | }, 311 | { 312 | field: "text", 313 | label: "parameters", 314 | value: "", 315 | helperText: "" 316 | }, 317 | { 318 | field: "text", 319 | label: "predicate", 320 | value: "", 321 | helperText: "" 322 | } 323 | ] 324 | }, 325 | ``` 326 | 327 | When a field is represented as a `ref`, then it must contain the `children` key, 328 | which means it will render a children group to the interface. 329 | 330 | There is also a `mutiple` key to indicate whether children are rendered repeatedly. 331 | 332 | #### Result 333 | 334 | Finally, we will generate `AWSChaos` like this: 335 | 336 | ```ts 337 | /** 338 | * This file was auto-generated by @ui/openapi. 339 | * Do not make direct changes to the file. 340 | */ 341 | 342 | export const actions = [], 343 | data = [ 344 | { 345 | field: "text", 346 | label: "awsRegion", 347 | value: "", 348 | helperText: "AWSRegion defines the region of aws.", 349 | }, 350 | { 351 | field: "text", 352 | label: "deviceName", 353 | value: "", 354 | helperText: 355 | "Optional. DeviceName indicates the name of the device. Needed in detach-volume.", 356 | when: "action=='detach-volume'", 357 | }, 358 | { 359 | field: "text", 360 | label: "ec2Instance", 361 | value: "", 362 | helperText: "Ec2Instance indicates the ID of the ec2 instance.", 363 | }, 364 | { 365 | field: "text", 366 | label: "secretName", 367 | value: "", 368 | helperText: "Optional. SecretName defines the name of kubernetes secret.", 369 | }, 370 | { 371 | field: "text", 372 | label: "volumeID", 373 | value: "", 374 | helperText: 375 | "Optional. EbsVolume indicates the ID of the EBS volume. Needed in detach-volume.", 376 | when: "action=='detach-volume'", 377 | }, 378 | ]; 379 | ``` 380 | 381 | Also `KernelChaos`: 382 | 383 | ```ts 384 | export const actions = [], 385 | data = [ 386 | { 387 | field: "ref", 388 | label: "failKernRequest", 389 | children: [ 390 | { 391 | field: "ref", 392 | label: "callchain", 393 | multiple: true, 394 | children: [ 395 | { 396 | field: "text", 397 | label: "funcname", 398 | value: "", 399 | helperText: "xxx", 400 | }, 401 | { 402 | field: "text", 403 | label: "parameters", 404 | value: "", 405 | helperText: "xxx", 406 | }, 407 | { 408 | field: "text", 409 | label: "predicate", 410 | value: "", 411 | helperText: "xxx", 412 | }, 413 | ], 414 | }, 415 | { 416 | field: "text", 417 | label: "failtype", 418 | value: 0, 419 | helperText: "xxx", 420 | }, 421 | { 422 | field: "label", 423 | label: "headers", 424 | value: [], 425 | helperText: "xxx", 426 | }, 427 | { 428 | field: "text", 429 | label: "probability", 430 | value: 0, 431 | helperText: "xxx", 432 | }, 433 | { 434 | field: "text", 435 | label: "times", 436 | value: 0, 437 | helperText: "xxx", 438 | }, 439 | ], 440 | }, 441 | ]; 442 | 443 | export default shared; 444 | ``` 445 | 446 | But this is not an ideal result and there are still some details that need to be 447 | addressed, like: 448 | 449 | - ~~Some useless markers remain in the `helperText`.~~ 450 | - ~~If an action only uses shared fields, the spread operator will be excess.~~ 451 | - Output components directly? (enhancement) 452 | - ~~$ref siblings aren't supported in OpenAPI v2~~ 453 | 454 | For example: 455 | 456 | ```yaml 457 | attr: 458 | $ref: "#/definitions/v1alpha1.AttrOverrideSpec" 459 | description: |- 460 | Attr defines the overrided attribution 461 | +ui:form:when=action=='attrOverride' 462 | +optional 463 | type: object 464 | ``` 465 | 466 | The above definitions will convert to: 467 | 468 | ```ts 469 | /** 470 | * 471 | * @type {V1alpha1AttrOverrideSpec} 472 | * @memberof V1alpha1IOChaosSpec 473 | */ 474 | attr?: V1alpha1AttrOverrideSpec 475 | ``` 476 | 477 | The `description` will be lost. 478 | 479 | ## Drawbacks 480 | 481 | The biggest disadvantage of automation is that we generate a lot of useless structures, 482 | but luckily we can fix this in the compilation phase. 483 | 484 | There is also the fact that we have to adapt the existing code to what the automation 485 | generates, which can still be a lot of work. Automation solves the code burden, 486 | but it can become difficult to debug. 487 | 488 | ## Alternatives 489 | 490 | There still have a way to do these things, use Go code to generate the code, but 491 | it is not the best way because: **It's only benefit the part of converting native 492 | types to typescript types, but it's hard to generate real typescript 493 | types (Write from scratch)**. 494 | 495 | ## Unresolved questions 496 | 497 | Already described in [Result](#result). 498 | -------------------------------------------------------------------------------- /text/2022-01-17-keep-a-changelog.md: -------------------------------------------------------------------------------- 1 | # Keep A Changelog 2 | 3 | ## Summary 4 | 5 | 6 | 7 | Proposal for keeping a changelog for https://github.com/chaos-mesh/chaos-mesh. 8 | 9 | ## Motivation 10 | 11 | 13 | 14 | It would consume a lot hand work to collect all the changes into release notes 15 | before releasing a new version. It might be better to keep a changelog, commit 16 | the changes for each PR. 17 | 18 | Thanks to @yangkeao, he introduces us https://keepachangelog.com/ with best 19 | practices. 20 | 21 | ## Detailed design 22 | 23 | 28 | 29 | This proposal contains nothing about technical details and codes, it only 30 | affects several steps about creating and reviewing PRs. 31 | 32 | Most of following contents are copied from https://keepachangelog.com/, we would 33 | follow the best practices and pattern from it. 34 | 35 | ### Guiding Principles 36 | 37 | - Changelogs are for humans, not machines. 38 | - There should be an entry for every single version. 39 | - The same types of changes should be grouped. 40 | - Versions and sections should be linkable. 41 | - The latest version comes first. 42 | - The release date of each version is displayed. 43 | - Mention whether you follow [Semantic Versioning](https://semver.org/). 44 | 45 | ### Types of changes 46 | 47 | - `Added` for new features. 48 | - `Changed` for changes in existing functionality. 49 | - `Deprecated` for soon-to-be removed features. 50 | - `Removed` for now removed features. 51 | - `Fixed` for any bug fixes. 52 | - `Security` in case of vulnerabilities. 53 | 54 | When creating a new `Unreleased` section, we would keep all the types of changes 55 | with "Nothing", and trim empty entries when releasing a new version. 56 | 57 | ### When to create new items in the changelog 58 | 59 | - As a contributor, when opening a new PR, if you think this PR should be 60 | considered in the changelog/release notes, please create a new item in the 61 | changelog. 62 | - As a reviewer, if you think a PR without updating changelog is enough 63 | important for taking one, please ask the author of the PR to create one. 64 | - As a contributor, if you find some a change is enough important for taking a 65 | changelog, but it doesn't, please open a new PR to update the changelog. 66 | - As a reviewer/committer/maintainer, when you releasing a new version, sync the 67 | changelog with the released version. 68 | 69 | ### CHANGELOG.md in release-* branches 70 | 71 | We should maintain a changelog for each active `release-*` branches, for now, 72 | they are `release-2.0` and `release-2.1`. I would create a new file 73 | `CHANGELOG.md` in each branch with the existing release notes. `CHANGELOG.md` 74 | should also contains an `Unreleased` section for next patch/bugfix release. 75 | 76 | When we cherry-pick a PR into `release-*` branch, if the original PR already 77 | have items in `Unreleased` section, it could be also cherry-picked into here. 78 | 79 | #### When releasing major, minor, or bugfix/patch version 80 | 81 | The detailed step of creating a new release should be updated in release guide 82 | `RELEASE.md`. We should manually change the `Unreleased` section to `[x.y.z] - 83 | YYYY-MM-DD` in the `release-*` branch, and update `CHANGELOG.md` in `master` 84 | branch manually. 85 | 86 | So `CHANGELOG.md` in `release-*` only contains sections for minor and 87 | bugfix/patch releases, and `CHANGELOG.md` in `master` contains sections for all 88 | the releases. 89 | 90 | As one principal is "The latest version comes first", and with semantic 91 | versioning's comparing rule, `2.1.2 > 2.0.6`. So sections in `CHANGELOG.md` in 92 | `master` should be like 93 | 94 | ```markdown 95 | 96 | - [2.1.2] - 2021-12-29 97 | 98 | (contents) 99 | 100 | - [2.1.1] - 2021-12-10 101 | 102 | (contents) 103 | 104 | - [2.1.0] - 2021-11-30 105 | 106 | (contents) 107 | 108 | - [2.0.6] - 2021-12-29 109 | 110 | (contents) 111 | 112 | - [2.0.5] - 2021-11-25 113 | 114 | (contents) 115 | 116 | ``` 117 | 118 | ### What would be the first step 119 | 120 | We would create a file called `CHANGELOG.md` in `master` branch, which content: 121 | 122 | ```markdown 123 | # Chaos Mesh Changelog 124 | 125 | (descriptions) 126 | 127 | (guideline, link to this rfc) 128 | 129 | ## [Unreleased] 130 | 131 | ### Added 132 | 133 | - some [#(pr-number)](https://link) 134 | - entries [#(pr-number)](https://link) [#(pr-number)](https://link) 135 | - (I would collect items from commits in master branch) 136 | 137 | ### Changed 138 | 139 | - ditto 140 | 141 | ### Deprecated 142 | 143 | (leave "Nothing" only in "Unreleased" section) 144 | - Nothing 145 | 146 | ### Removed 147 | 148 | - Nothing 149 | 150 | ### Fixed 151 | 152 | - Nothing 153 | 154 | ### Security 155 | 156 | - Nothing 157 | 158 | ## [2.1.2] - 2021-12-29 159 | 160 | ### Changed 161 | 162 | - some [#(pr-number)](https://link) 163 | - entries [#(pr-number)](https://link) [#(pr-number)](https://link) 164 | - (I would collect items from existing release notes) 165 | 166 | ### Fixed 167 | 168 | - ditto 169 | 170 | ## [2.0.6] - 2021-12-29 171 | 172 | ### Changed 173 | 174 | - ditto 175 | 176 | ### Fixed 177 | 178 | - ditto 179 | ``` 180 | 181 | And create `CHANGELOG.md` in each `release-*` branch with similar content. 182 | 183 | ## Drawbacks 184 | 185 | 186 | 187 | No predictable drawbacks could block this proposal now. 188 | 189 | ## Alternatives 190 | 191 | 195 | 196 | No other alternative solutions yet. 197 | 198 | ## Unresolved questions 199 | 200 | 201 | 202 | Does Chaos Mesh follows Semantic Versioning? 203 | 204 | I think Chaos Mesh does not follow Semantic Versioning now. We would release 205 | major versions with more reasons about marketing, and we introduce API breaking 206 | changes in minor releases. It does not matter now, because 207 | https://github.com/chaos-mesh/chaos-mesh is not designed to be used as 208 | dependency by other projects. I think we should mention that in the changelog. 209 | 210 | About "Types of changes", which one would be better? 211 | 212 | Pattern A: 213 | 214 | - `Added` for new features. 215 | - `Changed` for changes in existing functionality. 216 | - `Deprecated` for soon-to-be removed features. 217 | - `Removed` for now removed features. 218 | - `Fixed` for any bug fixes. 219 | - `Security` in case of vulnerabilities. 220 | 221 | Pattern B: 222 | 223 | - `New Features` for new features. 224 | - `Enhancements` for changes in existing functionality. 225 | - `Deprecated` for soon-to-be removed features. 226 | - `Removed` for now removed features. 227 | - `Fixed` for any bug fixes. 228 | - `Security` in case of vulnerabilities. 229 | 230 | I prefer Pattern A because a "change" might not be "enhancement". Maybe we could 231 | mix them. 232 | -------------------------------------------------------------------------------- /text/2022-02-21-workflow-status-check.md: -------------------------------------------------------------------------------- 1 | # Status Check in Workflow 2 | 3 | ## Summary 4 | 5 | The state check is responsible for collecting the system state before, 6 | during and after the chaos workflow execution and is used to determine 7 | whether the chaos workflow is successful or not. The chaos workflow can 8 | also be stopped automatically when the system becomes unhealthy during 9 | the execution of the chaos workflow. 10 | 11 | ## Motivation 12 | 13 | When currently executing chaos workflows, the user cannot quickly determine 14 | the impact of the chaos workflow on the system. 15 | 16 | One conceivable path is: 17 | 18 | 1. click to start the workflow 19 | 1. manually check the key panel on monitor system 20 | 1. if the system is observed to become unhealthy 21 | 1. go back to Chaos Dashboard and manually stop the workflow 22 | 23 | It is clearly a user-unfriendly design. To optimize this process, it is 24 | necessary to introduce status check in workflow. 25 | 26 | ## Detailed design 27 | 28 | ### Concept 29 | 30 | #### StatusCheck Template 31 | 32 | `StatusCheck Template` enables `StatusCheck` to be quickly reused by 33 | multiple workflows. Users can create the `StatusCheck Template` in advance 34 | and then create the `StatusCheck` by referring to the `StatusCheck Template` 35 | when creating the workflow. 36 | 37 | #### StatusCheck 38 | 39 | `StatusCheck` defines how the user wants to check the health status of 40 | the system. Users can create `StatusCheck` by referring to the 41 | `StatusCheck Template` or by customizing it on Chaos Dashboard. 42 | 43 | ### `StatusCheck` Properties 44 | 45 | #### General Properties 46 | 47 | - Execution mode, it could be Continuous or Synchronous 48 | - Overall execution time, it corresponds to the `deadline` of the 49 | `WorkflowNode`, after which the execution of `StatusCheck` stops 50 | - Timeout of single execution, different types of StatusCheck have different 51 | implementations. For example, in HTTP StatusCheck, it is the response time 52 | of the request, beyond which the execution is considered to have failed 53 | - Number of retries 54 | - BTW, Synchronous StatusCheck also has a retry mechanism to prevent 55 | the impact of system jitter 56 | - Retry interval time 57 | - Failure to abort the workflow 58 | - The number of the execution history is saved 59 | 60 | Here are the details of the execution mode. Continuous StatusCheck and 61 | Synchronous StatusCheck are both supported as children of 62 | `Parallel WorkflowNode` or `Serial WorkflowNode`, or as `EntryNode` 63 | (the statuscheck as `EntryNode` is not really meaningful, but it can be 64 | written that way). 65 | 66 | The recommended scenario for Continuous StatusCheck is specified here: 67 | Continuous StatusCheck is a child of the `EntryNode` which is Parallel. 68 | In this scenario, it means that the status check will continue 69 | throughout the workflow execution. 70 | 71 | ```yaml 72 | templates: 73 | - name: the-entry 74 | templateType: Parallel 75 | deadline: 240s 76 | children: 77 | - status-check 78 | - node1 79 | - node2 80 | - name: status-check 81 | templateType: StatusCheck 82 | ... 83 | ``` 84 | 85 | The yaml example for the rest of the cases is as follows: 86 | 87 | ```yaml 88 | templates: 89 | - name: node0 90 | templateType: Serial 91 | deadline: 240s 92 | children: 93 | - status-check 94 | - node1 95 | - node2 96 | - name: status-check 97 | templateType: StatusCheck 98 | ... 99 | ``` 100 | 101 | #### The Status of `StatusCheck` 102 | 103 | - Conditions of current StatusCheck 104 | - StatusCheck execution histories, including the execution time and outcomes 105 | 106 | ### Status Check with HTTP 107 | 108 | HTTP StatusCheck determines the health of the system by response code, 109 | response time, or response body returned from the request URL. 110 | 111 | #### HTTP StatusCheck Properties 112 | 113 | - Request URL 114 | - For example: system health API, system key API, Grafana alert API 115 | - Request Method 116 | - Request Header 117 | - For example: AUTH-KEY 118 | - Response Time 119 | - Response Code 120 | - Response Body 121 | 122 | here is a example yaml of HTTP StatusCheck: 123 | 124 | ```yaml 125 | apiVersion: chaos-mesh.org/v1alpha1 126 | kind: StatusCheck 127 | metadata: 128 | name: try-workflow-status-check 129 | annotations: 130 | "experiment.chaos-mesh.org/abort": "false" 131 | "experiment.chaos-mesh.org/description": "try-workflow-status-check" 132 | spec: 133 | mode: Synchronous 134 | type: HTTP 135 | deadline: 20s 136 | timeoutSeconds: 1 137 | failureThreshold: 3 138 | periodSeconds: 3 139 | historyLimit: 10 140 | abortIfFailed: true 141 | http: 142 | url: http://1.1.1.1:8080 143 | method: GET 144 | body: "" 145 | headers: 146 | - name: a 147 | value: b 148 | criteria: 149 | responseCode: "200-209" 150 | status: 151 | conditions: 152 | - type: Abort # ProbeSuccess/Accomplished/DeadlineExceed/Abort 153 | status: "False" 154 | reason: "Unknown" 155 | records: 156 | - probeTime: 2018-01-01T00:00:00Z 157 | outcome: Success # Success/Failure 158 | ``` 159 | 160 | HTTP StatusCheck VS existing `Workflow HTTP Request Task` 161 | 162 | - `Workflow HTTP Request Task` is an instant request, not a continuous request 163 | - `Workflow HTTP Request Task` can not stop the workflow 164 | 165 | #### How to abort `Workflow` 166 | 167 | There are two ways to abort `Workflow`: 168 | 169 | - When StatusCheck Failed, it could abort workflow automatically 170 | - Users could abort Workflow manually, by adding the annotation to the workflow 171 | 172 | ## Drawbacks 173 | 174 | - Saving the history of `StatusCheck` in the `status` of `StatusCheck` 175 | 176 | Since an object in Kubernetes cannot exceed `1M`, we have to consider the 177 | extreme case of a huge amount of `StatusCheck` history, so we added a 178 | `HistoryLimit` field to limit the number of times it can be saved. 179 | 180 | ## Alternatives 181 | 182 | - `StatusCheck` in `Experiment` or `Schedule`? 183 | 184 | At the moment it seems to be an appropriate choice to put `StatusCheck` in 185 | Workflow. We don't want to inflate `Experiment` or `Schedule` functionality 186 | too much, so if there are scenarios that require `StatusCheck`, 187 | then it is recommended to use `Workflow`. 188 | - `StatusCheck` with other types? 189 | 190 | For example, to get data from Prometheus metrics, or to execute some 191 | commands. 192 | 193 | If you want to get data from Prometheus metrics, it is more efficient to 194 | determine if grafana triggers an alarm through HTTP requests (the data from 195 | Prometheus needs to be calculated by promQL, while grafana alarms are 196 | configured in advance, which seems easier to use) 197 | 198 | If the HTTP StatusCheck does not meet your needs, feel free to make any 199 | suggestions (BTW, it is better to explain your usage scenarios, 200 | so that we can better help you solve the problem) 201 | 202 | ## Unresolved questions 203 | --------------------------------------------------------------------------------