├── LICENSE ├── NOTICE ├── README.adoc ├── docs ├── backups.adoc ├── components.adoc ├── ha.adoc ├── ha_3scale.adoc ├── ha_dbs.adoc └── observability.adoc ├── images ├── 3scale-HA-scenarios.svg ├── 3scale-architecture.png ├── 3scale-ha-active-passive-different-region-synced-databases.png ├── 3scale-ha-active-passive-same-region-shared-databases.png └── 3scale-pods-by-type.png └── sops ├── README.adoc └── alerts ├── apicast_apicast_latency.adoc ├── apicast_http_4xx_error_rate.adoc ├── apicast_request_time.adoc ├── apicast_worker_restart.adoc ├── backend_listener_5xx_requests_high.adoc ├── backend_worker_jobs_count_running_high.adoc ├── container_cpu_high.adoc ├── container_cpu_throttling_high.adoc ├── container_memory_high.adoc ├── container_waiting.adoc ├── pod_crash_looping.adoc ├── pod_not_ready.adoc ├── prometheus_job_down.adoc ├── replication_controller_replicas_mismatch.adoc ├── system_app_5xx_requests_high.adoc ├── zync_5xx_requests_high.adoc ├── zync_que_failed_job_count_high.adoc ├── zync_que_ready_job_count_high.adoc └── zync_que_scheduled_job_count_high.adoc /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Red Hat 3scale API Management Operations 2 | Copyright (c) 2016-2019 Red Hat, Inc. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | -------------------------------------------------------------------------------- /README.adoc: -------------------------------------------------------------------------------- 1 | = 3scale Operations 2 | 3 | This repository contains documentation and tools to help in the installation, operation, ongoing Maintenance and upgrade of an instance of 3scale API Management, whether you are deploying it in a *public cloud*, *private cloud* or *on-premises*. 4 | 5 | This documentation assumes familiarity with the main components of the 3scale architecture. If you would like to learn more about them, please check their upstream repositories: 6 | 7 | * link:https://github.com/3scale/apicast[apicast] 8 | * link:https://github.com/3scale/apisonator[apisonator (also known as "backend")] 9 | * link:https://github.com/3scale/porta[porta (also known as "system")] 10 | * link:https://github.com/3scale/zync[zync] 11 | 12 | It also assumes that you are installing 3scale on link:https://www.openshift.com/[Red Hat OpenShift] using the link:https://github.com/3scale/3scale-operator[3scale Operator]. 13 | 14 | Sections will be added over time and updated when relevant changes are made in 3scale or 15 | specific platforms it runs on. 16 | 17 | == Architecture 18 | 19 | image::images/3scale-architecture.png[3scale Architecture] 20 | 21 | == Operations 22 | 23 | Documents in this section describe typical operation procedures once you have your 3scale installation up and running: 24 | 25 | * link:docs/components.adoc[Components responsibilities and impact when down] 26 | * link:docs/backups.adoc[Backup & Restore procedures] 27 | * link:docs/observability.adoc[Observability: Apps & Databases] 28 | * link:docs/ha.adoc[High-Availability] -------------------------------------------------------------------------------- /docs/backups.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = Backup & Restore 5 | 6 | image:https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Warning.svg/156px-Warning.svg.png[] 7 | 8 | * *It is recommended to use link:ha_dbs.adoc[external databases]* instead of using the internal databases that are installed by default 9 | * For an easier backup/restore management when using external databases on public cloud providers, the recommendation is to setup automated backups using the public cloud provider's integrated backup solution (like for example AWS RDS and its automated daily snapshots with PITR), if available 10 | * If using the internal databases, backup and restore can be done using the link:https://github.com/3scale/3scale-operator[3scale Operator] by following these link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-backup-and-restore.md#3scale-installation-backup-and-restore-using-the-operator[instructions] 11 | * *However, if for any reason you don't want to use the link:https://github.com/3scale/3scale-operator[3scale Operator] to manage backup/restore of internal databases, in the following sections you can find the deprecated way of doing manual backup and restore. This is not recommended, use at you own risk* 12 | 13 | == Manual Backup & Restore [Deprecated] 14 | 15 | toc::[] 16 | 17 | This section provides an Operator of a 3scale installation the information they need for: 18 | 19 | * <> to backup persistent data 20 | * <> to restore from a backup of the persistent data 21 | 22 | Such that, in the case of a failure, they can restore 3scale to correctly running operational state and continue 23 | operating. 24 | 25 | === Persistent Volumes 26 | 27 | In a 3scale deployment on OpenShift all persistent data is stored either in a storage service in the cluster 28 | (not currently used), a Persistent Volume (PV) provided to the cluster by the underlying infrastructure or a storage 29 | service external to the cluster (be that in the same Data Center or elsewhere). 30 | 31 | === Considerations 32 | 33 | The backup and restore procedures for persistent data are required to vary depending on the storage used, to ensure the 34 | backups and restores preserve data conistsency (e.g. that a partial write, or a partial transaction is not captured). 35 | i.e. It is not sufficient to backup the underlying Persistent Volumes for a database, but instead the databases backup 36 | mechanisms should be used. 37 | 38 | Also, some parts of the data are synchronized between different components. One copy is considered the "source of truth" 39 | for the data set, and the other a copy that is not modified locally, just synchronized from the "source of truth". 40 | In these cases, upon restore, the "source of truth" should be restored and then the copies in other components 41 | synchronized from it. 42 | 43 | == Data Sets 44 | 45 | This section goes into more detail on the different data sets in the different persistent stores, their purposes, 46 | the storage type used and if it is the "source of truth" or not. 47 | 48 | The full state of a 3scale deployment is stored across these services and their persistent volumes: 49 | 50 | * `system-mysql`: Mysql database (`mysql-storage`) 51 | * `system-storage`: Volume for Files 52 | * `zync-database`: Postgres database for `zync` component. This uses "HostPath" as storage. If the pod is moved into 53 | another node the data is lost. That is not a problem because the data are sync jobs and does not need to be 100% 54 | persistent. 55 | * `backend-redis`: Redis database (`backend-redis-storage`) 56 | * `system-redis`: Redis database (`system-redis-storage`) 57 | 58 | === `system-mysql` OR `system-database` OR external Oracle database 59 | 60 | This is a relational database for storing the information about users, accounts, APIs, plans etc in the 3scale Admin 61 | Console. 62 | 63 | A subset of this information related to Services is synchronized to the `Backend` component and stored in 64 | `backend-redis`. `system-mysql` OR `system-database` OR external Oracle database is the source of truth for this information. 65 | 66 | === `system-storage` 67 | 68 | This is for storing files to be read and written by the `System` component. They fall into two categories: 69 | 70 | * Configuration files read by the System component at run-time 71 | * Static files (HTML, CSS, JS, etc) uploaded to System by its CMS feature, for the purpose of creating a Developer 72 | Portal. 73 | 74 | Note that `System` can be scaled horizontally with multiple pods uploading and reading said static files, hence the 75 | need for a `RWX` PersistentVolume. 76 | 77 | === `zync-database` 78 | 79 | This is a relational database for storing information related to the synchronization of identities between 3scale and 80 | an Identity provider. 81 | This information is not duplicated in other components and this is the sole source of truth. 82 | 83 | === `backend-redis` 84 | 85 | This contains multiple data sets used by the `Backend` component: 86 | 87 | * Usages: This is API usage information aggregated by `Backend`. It is used by `Backend` for rate-limiting decisions 88 | and by `System` to display analytics information in the UI or via API. 89 | * Config: This is configuration information about Services, Rate-limits, etc that is synchronized from `System` via an 90 | internal API. This is NOT the source of truth of this info, `System` and `system-mysql` is. 91 | * AuthKeys: Storage of OAuth keys created directly in `Backend`. This is the source of truth for this information. 92 | * Queues: Queues of Background Jobs to be executed by worker processses. These are ephemeral and are deleted once 93 | processed. 94 | 95 | === `system-redis` 96 | 97 | This contains Queues for jobs to be processed in background. These are ephemeral and are deleted once processed. 98 | 99 | == Backup Procedures 100 | 101 | === `system-mysql` 102 | 103 | Execute MySQL Backup Command 104 | 105 | [source,bash] 106 | ---- 107 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'export MYSQL_PWD=${MYSQL_ROOT_PASSWORD}; mysqldump --single-transaction -hsystem-mysql -uroot system' | gzip > system-mysql-backup.gz 108 | ---- 109 | 110 | === `system-database` 111 | 112 | Execute PostgreSQL Backup Command 113 | 114 | [source,bash] 115 | ---- 116 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq '.items[0].metadata.name' -r) bash -c 'pg_dumpall -c --if-exists' | gzip > system-postgres-backup.gz 117 | ---- 118 | 119 | === External Oracle database 120 | 121 | Follow Oracle Database Backup and Recovery Quick Start Guide: https://docs.oracle.com/cd/B19306_01/backup.102/b14193/toc.htm 122 | 123 | === `system-storage` 124 | 125 | Archive the system-storage files to another storage. 126 | 127 | [source,bash] 128 | ---- 129 | oc rsync $(oc get pods -l 'deploymentConfig=system-app' -o json | jq '.items[0].metadata.name' -r):/opt/system/public/system ./local/dir 130 | ---- 131 | 132 | === `zync-database` 133 | 134 | Execute Postgres Backup Command 135 | 136 | [source,bash] 137 | ---- 138 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq '.items[0].metadata.name' -r) bash -c 'pg_dumpall -c --if-exists' | gzip > zync-database-backup.gz 139 | ---- 140 | 141 | === `backend-redis` 142 | 143 | Backup the dump.rb file from redis 144 | 145 | [source,bash] 146 | ---- 147 | oc cp $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb ./backend-redis-dump.rdb 148 | ---- 149 | 150 | === `system-redis` 151 | 152 | Backup the dump.rb file from redis 153 | 154 | [source,bash] 155 | ---- 156 | oc cp $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb ./system-redis-dump.rdb 157 | ---- 158 | 159 | === `Secrets` 160 | [source,bash] 161 | ---- 162 | oc get secrets system-smtp -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-smtp.json 163 | oc get secrets system-seed -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-seed.json 164 | oc get secrets system-database -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-database.json 165 | oc get secrets backend-internal-api -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-internal-api.json 166 | oc get secrets system-events-hook -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-events-hook.json 167 | oc get secrets system-app -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-app.json 168 | oc get secrets system-recaptcha -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-recaptcha.json 169 | oc get secrets system-redis -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-redis.json 170 | oc get secrets zync -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > zync.json 171 | oc get secrets system-master-apicast -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-master-apicast.json 172 | oc get secrets backend-listener -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-listener.json 173 | oc get secrets backend-redis -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-redis.json 174 | oc get secrets system-memcache -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-memcache.json 175 | ---- 176 | 177 | === `ConfigMaps` 178 | [source,bash] 179 | ---- 180 | oc get configmaps system-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-environment.json 181 | oc get configmaps apicast-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > apicast-environment.json 182 | oc get configmaps backend-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-environment.json 183 | oc get configmaps mysql-extra-conf -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > mysql-extra-conf.json 184 | oc get configmaps mysql-main-conf -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > mysql-main-conf.json 185 | oc get configmaps redis-config -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > redis-config.json 186 | oc get configmaps system -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system.json 187 | ---- 188 | 189 | == Restore Procedures 190 | 191 | === `Template-based deployment` 192 | 193 | Restore secrets before creating deploying template. 194 | 195 | [source,bash] 196 | ---- 197 | oc apply -f system-smtp.json 198 | ---- 199 | 200 | Template parameters will be read from copied secrets and configmaps. 201 | 202 | [source,bash] 203 | ---- 204 | oc new-app --file /opt/amp/templates/amp.yml \ 205 | --param APP_LABEL=$(cat system-environment.json | jq -r '.metadata.labels.app') \ 206 | --param TENANT_NAME=$(cat system-seed.json | jq -r '.data.TENANT_NAME' | base64 -d) \ 207 | --param SYSTEM_DATABASE_USER=$(cat system-database.json | jq -r '.data.DB_USER' | base64 -d) \ 208 | --param SYSTEM_DATABASE_PASSWORD=$(cat system-database.json | jq -r '.data.DB_PASSWORD' | base64 -d) \ 209 | --param SYSTEM_DATABASE=$(cat system-database.json | jq -r '.data.URL' | base64 -d | cut -d '/' -f4) \ 210 | --param SYSTEM_DATABASE_ROOT_PASSWORD=$(cat system-database.json | jq -r '.data.URL' | base64 -d | awk -F '[:@]' '{print $3}') \ 211 | --param WILDCARD_DOMAIN=$(cat system-environment.json | jq -r '.data.THREESCALE_SUPERDOMAIN') \ 212 | --param SYSTEM_BACKEND_USERNAME=$(cat backend-internal-api.json | jq '.data.username' -r | base64 -d) \ 213 | --param SYSTEM_BACKEND_PASSWORD=$(cat backend-internal-api.json | jq '.data.password' -r | base64 -d) \ 214 | --param SYSTEM_BACKEND_SHARED_SECRET=$(cat system-events-hook.json | jq -r '.data.PASSWORD' | base64 -d) \ 215 | --param SYSTEM_APP_SECRET_KEY_BASE=$(cat system-app.json | jq -r '.data.SECRET_KEY_BASE' | base64 -d) \ 216 | --param ADMIN_PASSWORD=$(cat system-seed.json | jq -r '.data.ADMIN_PASSWORD' | base64 -d) \ 217 | --param ADMIN_USERNAME=$(cat system-seed.json | jq -r '.data.ADMIN_USER' | base64 -d) \ 218 | --param ADMIN_EMAIL=$(cat system-seed.json | jq -r '.data.ADMIN_EMAIL' | base64 -d) \ 219 | --param ADMIN_ACCESS_TOKEN=$(cat system-seed.json | jq -r '.data.ADMIN_ACCESS_TOKEN' | base64 -d) \ 220 | --param MASTER_NAME=$(cat system-seed.json | jq -r '.data.MASTER_DOMAIN' | base64 -d) \ 221 | --param MASTER_USER=$(cat system-seed.json | jq -r '.data.MASTER_USER' | base64 -d) \ 222 | --param MASTER_PASSWORD=$(cat system-seed.json | jq -r '.data.MASTER_PASSWORD' | base64 -d) \ 223 | --param MASTER_ACCESS_TOKEN=$(cat system-seed.json | jq -r '.data.MASTER_ACCESS_TOKEN' | base64 -d) \ 224 | --param RECAPTCHA_PUBLIC_KEY="$(cat system-recaptcha.json | jq -r '.data.PUBLIC_KEY' | base64 -d)" \ 225 | --param RECAPTCHA_PRIVATE_KEY="$(cat system-recaptcha.json | jq -r '.data.PRIVATE_KEY' | base64 -d)" \ 226 | --param SYSTEM_REDIS_URL=$(cat system-redis.json | jq -r '.data.URL' | base64 -d) \ 227 | --param SYSTEM_MESSAGE_BUS_REDIS_URL="$(cat system-redis.json | jq -r '.data.MESSAGE_BUS_URL' | base64 -d)" \ 228 | --param SYSTEM_REDIS_NAMESPACE="$(cat system-redis.json | jq -r '.data.NAMESPACE' | base64 -d)" \ 229 | --param SYSTEM_MESSAGE_BUS_REDIS_NAMESPACE="$(cat system-redis.json | jq -r '.data.MESSAGE_BUS_NAMESPACE' | base64 -d)" \ 230 | --param ZYNC_DATABASE_PASSWORD=$(cat zync.json | jq -r '.data.ZYNC_DATABASE_PASSWORD' | base64 -d) \ 231 | --param ZYNC_SECRET_KEY_BASE=$(cat zync.json | jq -r '.data.SECRET_KEY_BASE' | base64 -d) \ 232 | --param ZYNC_AUTHENTICATION_TOKEN=$(cat zync.json | jq -r '.data.ZYNC_AUTHENTICATION_TOKEN' | base64 -d) \ 233 | --param APICAST_ACCESS_TOKEN=$(cat system-master-apicast.json | jq -r '.data.ACCESS_TOKEN' | base64 -d) \ 234 | --param APICAST_MANAGEMENT_API=$(cat apicast-environment.json | jq -r '.data.APICAST_MANAGEMENT_API') \ 235 | --param APICAST_OPENSSL_VERIFY=$(cat apicast-environment.json | jq -r '.data.OPENSSL_VERIFY') \ 236 | --param APICAST_RESPONSE_CODES=$(cat apicast-environment.json | jq -r '.data.APICAST_RESPONSE_CODES') \ 237 | --param APICAST_REGISTRY_URL=$(cat system-environment.json | jq -r '.data.APICAST_REGISTRY_URL') 238 | ---- 239 | 240 | === `Operator-based deployments` 241 | 242 | Restore secrets before creating APIManager resource. 243 | 244 | [source,bash] 245 | ---- 246 | oc apply -f system-smtp.json 247 | oc apply -f system-seed.json 248 | oc apply -f system-database.json 249 | oc apply -f backend-internal-api.json 250 | oc apply -f system-events-hook.json 251 | oc apply -f system-app.json 252 | oc apply -f system-recaptcha.json 253 | oc apply -f system-redis.json 254 | oc apply -f zync.json 255 | oc apply -f system-master-apicast.json 256 | oc apply -f backend-listener.json 257 | oc apply -f backend-redis.json 258 | oc apply -f system-memcache.json 259 | ---- 260 | 261 | Restore configmaps before creating APIManager resource. 262 | 263 | [source,bash] 264 | ---- 265 | oc apply -f system-environment.json 266 | oc apply -f apicast-environment.json 267 | oc apply -f backend-environment.json 268 | oc apply -f mysql-extra-conf.json 269 | oc apply -F mysql-main-conf.json 270 | oc apply -f redis-config.json 271 | oc apply -f system.json 272 | ---- 273 | 274 | === `system-mysql` 275 | 276 | Copy the MySQL dump to the system-mysql pod 277 | 278 | [source,bash] 279 | ---- 280 | oc cp ./system-mysql-backup.gz $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq '.items[0].metadata.name' -r):/var/lib/mysql 281 | ---- 282 | 283 | Decompress the Backup File 284 | 285 | [source,bash] 286 | ---- 287 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/system-mysql-backup.gz' 288 | ---- 289 | 290 | Restore the MySQL DB Backup file 291 | 292 | [source,bash] 293 | ---- 294 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'export MYSQL_PWD=${MYSQL_ROOT_PASSWORD}; mysql -hsystem-mysql -uroot system < ${HOME}/system-mysql-backup' 295 | ---- 296 | 297 | === `system-database` 298 | 299 | Copy the PostgreSQL Database dump to the system-database pod 300 | 301 | [source,bash] 302 | ---- 303 | oc cp ./system-postgres-backup.gz $(oc get pods -l 'deploymentConfig=system-database' -o json | jq '.items[0].metadata.name' -r):/var/lib/pgsql/ 304 | ---- 305 | 306 | Decompress the Backup File 307 | 308 | [source,bash] 309 | ---- 310 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/system-postgres-backup.gz' 311 | ---- 312 | 313 | Restore the PostgreSQL DB Backup file 314 | 315 | [source,bash] 316 | ---- 317 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq -r '.items[0].metadata.name') bash -c 'psql -f ${HOME}/system-postgres-backup.gz' 318 | ---- 319 | 320 | === `system-storage` 321 | 322 | Restore the archived files from a different location. 323 | 324 | [source,bash] 325 | ---- 326 | oc rsync ./local/dir/system/ $(oc get pods -l 'deploymentConfig=system-app' -o json | jq '.items[0].metadata.name' -r):/opt/system/public/system --delete=true 327 | ---- 328 | 329 | 330 | === `zync-database` 331 | 332 | Copy the Zync Database dump to the zync-database pod 333 | 334 | [source,bash] 335 | ---- 336 | oc cp ./zync-database-backup.gz $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq '.items[0].metadata.name' -r):/var/lib/pgsql/ 337 | ---- 338 | 339 | Decompress the Backup File 340 | 341 | [source,bash] 342 | ---- 343 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/zync-database-backup.gz' 344 | ---- 345 | 346 | Restore the PostgreSQL DB Backup file 347 | 348 | [source,bash] 349 | ---- 350 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq -r '.items[0].metadata.name') bash -c 'psql -f ${HOME}/zync-database-backup' 351 | ---- 352 | 353 | === `backend-redis` 354 | 355 | * After restoring `backend-redis` a sync of the Config information from `System` should be forced, to ensure the 356 | information in `Backend` is consistent with that in `System` (the source of truth). 357 | 358 | Edit the `redis-config` configmap 359 | 360 | [source,bash] 361 | ---- 362 | oc edit configmap redis-config 363 | ---- 364 | 365 | Comment `save` comands in the `redis-config` configmap 366 | 367 | [source,bash] 368 | ---- 369 | #save 900 1 370 | #save 300 10 371 | #save 60 10000 372 | ---- 373 | 374 | Set `appendonly` to no in the `redis-config` configmap 375 | 376 | [source,bash] 377 | ---- 378 | appendonly no 379 | ---- 380 | 381 | Re-deploy `backend-redis` to load the new configurations 382 | 383 | [source,bash] 384 | ---- 385 | oc rollout latest dc/backend-redis 386 | ---- 387 | 388 | Rename the `dump.rb` file 389 | 390 | [source,bash] 391 | ---- 392 | oc rsh $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/dump.rdb ${HOME}/data/dump.rdb-old' 393 | ---- 394 | 395 | Rename the `appendonly.aof` file 396 | 397 | [source,bash] 398 | ---- 399 | oc rsh $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/appendonly.aof ${HOME}/data/appendonly.aof-old' 400 | ---- 401 | 402 | Move the Backup file to the POD 403 | 404 | [source,bash] 405 | ---- 406 | oc cp ./backend-redis-dump.rdb $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb 407 | ---- 408 | 409 | Re-deploy `backend-redis` to load the backup 410 | 411 | [source,bash] 412 | ---- 413 | oc rollout latest dc/backend-redis 414 | ---- 415 | 416 | Edit the `redis-config` configmap 417 | 418 | [source,bash] 419 | ---- 420 | oc edit configmap redis-config 421 | ---- 422 | 423 | Uncomment `SAVE` comands in the `redis-config` configmap 424 | 425 | [source,bash] 426 | ---- 427 | save 900 1 428 | save 300 10 429 | save 60 10000 430 | ---- 431 | 432 | Set `appendonly` to yes in the `redis-config` configmap 433 | 434 | [source,bash] 435 | ---- 436 | appendonly yes 437 | ---- 438 | 439 | Re-deploy `backend-redis` to reload the default configurations 440 | 441 | [source,bash] 442 | ---- 443 | oc rollout latest dc/backend-redis 444 | ---- 445 | 446 | === `system-redis` 447 | 448 | Edit the `redis-config` configmap 449 | 450 | [source,bash] 451 | ---- 452 | oc edit configmap redis-config 453 | ---- 454 | 455 | Comment `SAVE` comands in the `redis-config` configmap 456 | 457 | [source,bash] 458 | ---- 459 | #save 900 1 460 | #save 300 10 461 | #save 60 10000 462 | ---- 463 | 464 | Set `appendonly` to no in the `redis-config` configmap 465 | 466 | [source,bash] 467 | ---- 468 | appendonly no 469 | ---- 470 | 471 | Re-deploy `system-redis` to load the new configurations 472 | 473 | [source,bash] 474 | ---- 475 | oc rollout latest dc/system-redis 476 | ---- 477 | 478 | Rename the `dump.rb` file 479 | 480 | [source,bash] 481 | ---- 482 | oc rsh $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/dump.rdb ${HOME}/data/dump.rdb-old' 483 | ---- 484 | 485 | Rename the `appendonly.aof` file 486 | 487 | [source,bash] 488 | ---- 489 | oc rsh $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/appendonly.aof ${HOME}/data/appendonly.aof-old' 490 | ---- 491 | 492 | Move the Backup file to the POD 493 | 494 | [source,bash] 495 | ---- 496 | oc cp ./system-redis-dump.rdb $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb 497 | ---- 498 | 499 | Re-deploy `system-redis` to load the backup 500 | 501 | [source,bash] 502 | ---- 503 | oc rollout latest dc/system-redis 504 | ---- 505 | 506 | Edit the `redis-config` configmap 507 | 508 | [source,bash] 509 | ---- 510 | oc edit configmap redis-config 511 | ---- 512 | 513 | Uncomment `SAVE` comands in the `redis-config` configmap 514 | 515 | [source,bash] 516 | ---- 517 | save 900 1 518 | save 300 10 519 | save 60 10000 520 | ---- 521 | 522 | Set `appendonly` to yes in the `redis-config` configmap 523 | 524 | [source,bash] 525 | ---- 526 | appendonly yes 527 | ---- 528 | 529 | Re-deploy `system-redis` to reload the default configurations 530 | 531 | [source,bash] 532 | ---- 533 | oc rollout latest dc/system-redis 534 | ---- 535 | 536 | === `backend-worker` 537 | 538 | [source,bash] 539 | ---- 540 | oc rollout latest dc/backend-worker 541 | ---- 542 | 543 | === `system-app` 544 | 545 | [source,bash] 546 | ---- 547 | oc rollout latest dc/system-app 548 | ---- 549 | 550 | === `system-sidekiq` 551 | 552 | Resync domains 553 | 554 | [source,bash] 555 | ---- 556 | oc exec -t $(oc get pods -l 'deploymentConfig=system-sidekiq' -o json | jq '.items[0].metadata.name' -r) -- bash -c "bundle exec rake zync:resync:domains" 557 | ---- 558 | 559 | == Open Issues 560 | 561 | * What about System services and sphinx (index)? 562 | * How to handle backup/restore of job queues (of different types). They can be lost or maybe done twice! 563 | -------------------------------------------------------------------------------- /docs/components.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = Components 5 | 6 | This section explains what the responsibilities of every 3scale component are and what is the impact when each of them is not available. 7 | 8 | image::../images/3scale-pods-by-type.png[3scale pods by Type] 9 | 10 | toc::[] 11 | 12 | == apicast-production 13 | 14 | === Responsibilities 15 | * 3scale API Gateway 16 | * APIcast enforces traffic policies, either returning an error or proxying the API call to the customer’s API backend 17 | * APIcast fetches its operating configuration from `system-app` component when the gateway starts 18 | * Each incoming/managed API call produces a sync/auth request from the API Gateway to `backend-listener` 19 | * Reduces latency by introducing the API Gateway in the traffic flow 20 | 21 | === Impact when down 22 | * Clients will not be able to reach the provider's API using that APIcast instance 23 | 24 | == apicast-staging 25 | 26 | === Responsibilities 27 | * Same as `apicast-production`, but affecting staging traffic 28 | 29 | === Impact when down 30 | * Same as `apicast-production`, but affecting staging traffic 31 | 32 | == backend-listener 33 | 34 | === Responsibilities 35 | * This is the most critical component, responsible for authorizing and rate-limiting requests 36 | 37 | === Impact when down 38 | * APIcast will not be able to tell whether a request should be authorized or not and simply denies everything 39 | * APIs configured using 3scale are down 40 | * This service distuption can be mitigated by using the APIcast auth caching policy. There are some trade-offs when using this policy (service availability upon possible outages vs. service available but possibly using outdated configuration), so make sure you understand the implications before using it 41 | 42 | == backend-worker 43 | 44 | === Responsibilities 45 | * Process the background jobs created by `backend-listener` 46 | * Runs enqueued jobs, mainly related to traffic reporting 47 | 48 | === Impact when down 49 | * Reported metrics can't be made effective, so the rate-limiting functionality loses accuracy because those pending reports are not taken into account 50 | * Statistics will not be up to date and the alerts and errors shown in the admin portal will not be triggered 51 | * Because accounting will not work, authorizations will not be correct 52 | * Jobs will not be processed and will start accumulating in `backend-redis`, which could lead to database out of memory related problems 53 | 54 | == backend-cron 55 | 56 | === Responsibilities 57 | * This is a simple task that acts as a cron like scheduler to retry failed jobs 58 | * When a `backend-worker` job fails, it is pushed to a "failed jobs" queue so that it can be retried later. Jobs can fail, for example, when there's a Redis timeout 59 | * It is also responsible for deleting the stats of services that have been removed. This is run every 24h 60 | 61 | === Impact when down 62 | * Failed jobs will not be rescheduled 63 | * If it crashes in the middle of the delete process, it will just continue the next time it runs 64 | * If the 3scale installation is working correctly, the failed jobs queue will be empty at almost all times, so `backend-cron` being down is not critical 65 | 66 | == backend-redis 67 | 68 | === Responsibilities 69 | * It is the database used by `backend-listener` and `backend-worker` 70 | * It is used both for data persistence (metrics...) and to store job queues 71 | 72 | === Impact when down 73 | * `backend-listener` and `backend-worker` cannot function without access to the storage, so both components can be considered as down. Refer to the sections on `backend-listener` and `backend-worker` to review impact when these components are down 74 | 75 | == system-app 76 | 77 | === Responsibilities 78 | * Developer and Admin Portal UI/API 79 | * 3scale APIs (Accounts, Analytics) 80 | 81 | === Impact when down 82 | * Developer and Admin Portal UI/API will not be available 83 | * 3scale APIs (Accounts, Analytics) will not be available 84 | * `apicast` will not be able to retrieve the gateway configuration, so new `apicast` deployments will not work 85 | * Already running `apicast` Pods will continue serving traffic using the latest retrieved configuration (cached) 86 | 87 | == system-sidekiq 88 | 89 | === Responsibilities 90 | * It is the job manager used by `system-app` to process jobs in the background asynchronously 91 | 92 | === Impact when down 93 | * Emails are not sent 94 | * Communication with `backend-listener` breaks: changes in Admin Portal will not propagate to Backend 95 | * Backend alerts will not be triggered 96 | * Webhooks will not be triggered 97 | * Zync will not receive any updates 98 | * Background jobs will not be processed and will start accumulating in `system-redis`, which could lead to database out of memory related problems 99 | 100 | == system-mysql or system-postgresql 101 | 102 | === Responsibilities 103 | * It is the main relational database used by `system-app` 104 | 105 | === Impact when down 106 | * Both `system-app` and `system-sidekiq` components can be considered down if access to the relational database is lost. Refer to the sections on `system-app` and `system-sidekiq` to review impact when these components are down 107 | 108 | == system-redis 109 | 110 | === Responsibilities 111 | * It is the database used by `system-app` to enqueue the jobs consumed by `system-sidekiq` 112 | 113 | === Impact when down 114 | * `system-app` and `system-sidekiq` cannot function without access to the storage, so both components can be considered as down. Refer to the sections on `system-app` and `system-sidekiq` to review impact when these components are down 115 | 116 | == system-memcache 117 | 118 | === Responsibilities 119 | * `system-memcached` is an ephemeral cache of values used to speed-up the performance of the `system-app` web application 120 | 121 | === Impact when down 122 | * `system-app` will run slightly slower (UI page loading times will be worse) while the cache is not accessible. Cache will be rebuilt once the memcached instance is back online 123 | 124 | == system-sphinx 125 | 126 | === Responsibilities 127 | * Full-text search for `system-app` 128 | 129 | === Impact when down 130 | * The search functionality on the `system-app` Admin/Developer Portal (accounts and proxy rules search bars, templates, forum searches...) stops working 131 | 132 | == zync 133 | 134 | === Responsibilities 135 | 136 | * Receives events from `system-sidekiq` 137 | * Enqueue those events as new jobs to be processed in the background by `zync-que` 138 | * Those enqueued jobs can be: 139 | - Creation/Update of OpenShift Routes (Admin/Developer portals of each tenant) 140 | - Creation/Update of OpenShift Routes (`apicast-staging` or `apicast-production` domains of each API) 141 | - Synchronization of information with configured 3rd party IDPs 142 | 143 | === Impact when down 144 | * Synchronization of OpenShift Routes for `apicast-staging` and `apicast-production` will not work 145 | * Synchronization of OpenShift Routes for the Admin Portal and the Developer Portal domains will not work 146 | * Synchronization with 3rd party IDPs will not work 147 | * `system-sidekiq` will retry the failed requests for some time 148 | 149 | == zync-que 150 | 151 | === Responsibilities 152 | * Process the enqueued jobs created by `zync` 153 | * Those jobs can be: 154 | - Creation/Update of OpenShift Routes (Admin/Developer portals of each tenant) 155 | - Creation/Update of OpenShift Routes (`apicast-staging` or `apicast-production` domains of each API) 156 | - Synchronization of information with configured 3rd party IDPs 157 | 158 | === Impact when down 159 | * Synchronization of OpenShift Routes for `apicast-staging` and `apicast-production` will not work 160 | * Synchronization of OpenShift Routes for the Admin Portal and the Developer Portal domains will not work 161 | * Synchronization with 3rd party IDPs will not work 162 | * Jobs will not be processed and will start accumulating in `zync-database`, which could lead to database out of disk space related problems 163 | 164 | == zync-database 165 | 166 | === Responsibilities 167 | * It is the database used by `zync` 168 | * It contains job queues and also some data synchronized from `system-app` 169 | 170 | === Impact when down 171 | * `zync` will not be able to enqueue jobs and `zync-que` will not be able to consume them, so both components can be considered down when database access is lost. Refer to the sections on `zync` and `zync-que` to review impact when these components are down -------------------------------------------------------------------------------- /docs/ha.adoc: -------------------------------------------------------------------------------- 1 | = Highly-Availability 2 | 3 | * link:ha_3scale.adoc[3scale HA on different scenarios] 4 | * link:ha_dbs.adoc[3scale HA with external databases] 5 | -------------------------------------------------------------------------------- /docs/ha_3scale.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = 3scale HA different scenarios 5 | 6 | toc::[] 7 | 8 | == Highly available 3scale-operator based installation 9 | 10 | This section explains how to install 3scale using the link:https://github.com/3scale/3scale-operator[3scale Operator] and how to configure it to have a highly-available deployment. If you haven't done so, you need to install the 3scale-operator following link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#installing-3scale[this guide] before proceeding. 11 | 12 | As usual, to install 3scale using the 3scale-operator, an APIManager custom resource is used (see a simple example link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#basic-installation[here]), but several changes need to be made to ensure a highly-available installation. The following sections delve into each of the changes required. 13 | 14 | === External databases 15 | 16 | The 3scale-operator installs all the databases required by 3scale within the OpenShift cluster by default. For a highly-available installation databases need to be external to the cluster though, so the user must deploy them himself outside the cluster and make the APIManager custom resource aware of it: 17 | 18 | * Instruct the operator not to deploy the critical databases of the 3scale installation. Check the link:ha_dbs.adoc[3scale HA external databases documentation section] for this. 19 | * Notice that the operator does not currently offer an option to deploy `system-memcached` or `system-sphinx` externally. Neither of these components have a critical impact on 3scale in the event they are down, so the operator does not currently support a high-available deployment option for them. 20 | 21 | === Scaling number of replicas 22 | 23 | The number of replicas of each component needs to be at least 2 to support Pod failures. For installations with nodes running in different availability zones, the minimum replica count should match the number of availability zones available. This configuration used in combination with node-based Pod anti-affinity rules, described in the next section, will ensure that at least one Pod will run in each availability zone. 3scale components that support more than one replicas are: 24 | 25 | - apicast-production 26 | - apicast-staging 27 | - backend-listener 28 | - backend-worker 29 | - system-app 30 | - system-sidekiq 31 | - zync 32 | - zync-que 33 | 34 | Depending on your cluster size, throughput and latency requirements you might want to deploy more replicas, at least for the components that are called on each request to the APIs configured in 3scale, that is, `apicast-production`, `backend-listener`, and `backend-worker`. 35 | 36 | Replicas can be configured using the `replicas` attribute in the APIManager spec. 37 | 38 | === Pod Affinity 39 | 40 | Enabling Pod affinities in the operator for every component, ensures that Pod replicas from each DeploymentConfig are distributed across different nodes of the cluster, and evenly balanced across different availability zones. You can read more about affinities in the link:https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity[Kubernetes docs]. 41 | 42 | To enable affinities you need to follow link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#setting-custom-affinity-and-tolerations[these examples]. 43 | 44 | The following affinity block can be added for example to `apicastProductionSpec` (but can be added to any non-database DeploymentConfig), adds a *soft podAntiAffinity* configuration using `preferredDuringSchedulingIgnoredDuringExecution` (so the scheduler will try to run this set of `apicast-production` Pods in different hosts from different availability zones, but if it is not possible, then allow to run elsewhere): 45 | ```yaml 46 | affinity: 47 | podAntiAffinity: 48 | preferredDuringSchedulingIgnoredDuringExecution: 49 | - weight: 100 50 | podAffinityTerm: 51 | labelSelector: 52 | matchLabels: 53 | deploymentConfig: apicast-production 54 | topologyKey: kubernetes.io/hostname 55 | - weight: 99 56 | podAffinityTerm: 57 | labelSelector: 58 | matchLabels: 59 | deploymentConfig: apicast-production 60 | topologyKey: topology.kubernetes.io/zone 61 | ``` 62 | In the following example, unlike in the previous one, a *hard podAntiAffinity* configuration is set using `requiredDuringSchedulingIgnoredDuringExecution` (so conditions must be met for a Pod to be scheduled onto a node, otherwise Pod won’t be scheduled, which can be risky on certain situations like being on a cluster with low free resources, and new Pods cannot be scheduled): 63 | ```yaml 64 | affinity: 65 | podAntiAffinity: 66 | requiredDuringSchedulingIgnoredDuringExecution: 67 | - weight: 100 68 | podAffinityTerm: 69 | labelSelector: 70 | matchLabels: 71 | deploymentConfig: apicast-production 72 | topologyKey: kubernetes.io/hostname 73 | - weight: 99 74 | podAffinityTerm: 75 | labelSelector: 76 | matchLabels: 77 | deploymentConfig: apicast-production 78 | topologyKey: topology.kubernetes.io/zone 79 | ``` 80 | 81 | === Pod Disruption Budgets 82 | 83 | Enabling Pod disruption budgets (PDBs) ensures that a minimum number of Pods will be available for each component. You can read more about PDBs in the link:https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets[Kubernetes docs]. 84 | 85 | Pod disruption budgets configuration is documented in link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#enabling-pod-disruption-budgets[the user guide]. 86 | 87 | ''' 88 | 89 | The information provided above enables a basic highly-available installation, but you might need to take some extra steps depending on the environment where 3scale is running. Next, there are some of the most common scenarios described. 90 | 91 | == Single cluster in single availability zone 92 | 93 | With this setup, 3scale will continue working if a node fails, but stop working in case the availability zone fails. 94 | 95 | Pod affinities do apply (but only using one rule with `kubernetes.io/hostname`), because there is a single availability zone. 96 | 97 | Pod disruption budgets do apply. 98 | 99 | == Single cluster in multiple availability zones 100 | 101 | The same setup as before, but now nodes are distributed over hosts from different availability zones (physically separated locations or data centers within a cloud region that are tolerant to local failures). 102 | 103 | A minimum of 3 availability zones is recommended to have high availability. 104 | 105 | With this setup, 3scale will continue working if a node or even an availability zone fails at the same time. 106 | 107 | Pod affinities do apply (with both rules `kubernetes.io/hostname` and `topology.kubernetes.io/zone`). 108 | 109 | Pod disruption budgets do apply. 110 | 111 | == Multiple clusters in multiple availability zones 112 | 113 | There are several options to install 3scale across several OpenShift clusters and availability zones. 114 | 115 | In multiple cluster installation options, clusters work in an *active/passive* configuration, with the *failover* procedure involving a few *manual* steps. Note that there will be service disruption while a human operator performs the required steps to bring the *passive* cluster into *active* mode in case of failure. 116 | 117 | This documentation focuses on deployment using Amazon Web Services (AWS), but the same patterns and configuration options should apply to other public cloud vendors as long as the provider’s managed database services offer the required features (support for multiple availability zones, multiple regions …​). 118 | 119 | === Common configurations for multiple clusters installations 120 | 121 | The following configuration items need to be used in any 3scale installations that involve using several OpenShift clusters: 122 | 123 | * Use Pod affinities, with both `kubernetes.io/hostname` and `topology.kubernetes.io/zone` rules, in the APIManager custom resource. 124 | * Use Pod disruption budgets in the APIManager custom resource. 125 | * A 3scale installation over multiple clusters *must use the same shared `wildcardDomain`* attribute spec in the APIManager custom resource. The use of a different domain for each cluster is not allowed when in this installation mode as the information stored in the database would be conflicting. 126 | * The secrets containing credentials like tokens, passwords, etc, have to be *manually* deployed in all clusters with the same values. By default, the 3scale operator creates them with some secure random values on every cluster. However, in this case, you need to have the same credentials in both clusters. The list of secrets and how to configure them can be found in the link:https://github.com/3scale/3scale-operator/blob/master/doc/apimanager-reference.md#apimanager-secrets[3scale Operator docs]. This is the list of secrets that should be mirrored in both clusters: 127 | - backend-internal-api 128 | - system-app 129 | - system-events-hook 130 | - system-master-apicast 131 | - system-seed 132 | * The secrets containing database connections strings (`backend-redis`, `system-database`, `system-redis`, `zync`) have to be *manually* deployed like explained on link:ha_dbs.adoc[external databases]: 133 | - If databases are shared among clusters they must use the same values on all clusters 134 | - On the other hand, if each cluster have their own databases, they must use different values on each cluster 135 | 136 | === Active-Passive clusters on the same region with shared databases 137 | 138 | image::../images/3scale-ha-active-passive-same-region-shared-databases.png[Active Passive same region shared databases.png] 139 | 140 | This setup consists of having 2 clusters (or more) in the *same region* and deploying 3scale in *active-passive* mode. One of the clusters will be the *active* (receiving traffic), whereas the other/s will be in standby mode without receiving traffic (*passive*), but prepared to assume the *active* role in case there is a failure in the *active* cluster. 141 | 142 | In this installation option, given that only a single region is used, databases will be shared among all clusters. 143 | 144 | ==== Prerequisites and installation shared databases 145 | 146 | . Create 2 (or more) OpenShift clusters in the *same region* using different availability zones. A minimum of 3 zones is recommended. 147 | . Create all required AWS ElastiCache instances with Multi-AZ enabled: 148 | .. One AWS EC for *Backend* redis database 149 | .. One AWS EC for *System* redis database 150 | . Create all required AWS RDS instances with Multi-AZ enabled: 151 | .. One AWS RDS for the *System* database 152 | .. One AWS RDS for *Zync* database (since 3scale version v2.10) 153 | . Configure a AWS S3 bucket for the *System* assets 154 | . Create a custom domain in AWS Route53 (or your DNS provider) and point it to the OpenShift Router of the *active* cluster (need to coincide with the `wildcardDomain` attribute from APIManager custom resource) 155 | . Install 3scale in the *passive* cluster. The APIManager custom resource should be identical to the one used in the step above. After all the Pods are running, change the APIManager to deploy 0 replicas for all the Pods of apicast, backend, system, and zync. You want to have 0 replicas to avoid consuming jobs from the *active* database, etc. You can not tell the operator to deploy 0 replicas of each directly because the deployment will fail due to some Pod dependencies that cannot be met (some Pods check that others are running). That is why, as a workaround, first you deploy normally, and then, with 0 replicas. This is how it is specified in the APIManager spec: 156 | ```yaml 157 | spec: 158 | apicast: 159 | stagingSpec: 160 | replicas: 0 161 | productionSpec: 162 | replicas: 0 163 | backend: 164 | listenerSpec: 165 | replicas: 0 166 | workerSpec: 167 | replicas: 0 168 | cronSpec: 169 | replicas: 0 170 | zync: 171 | appSpec: 172 | replicas: 0 173 | queSpec: 174 | replicas: 0 175 | system: 176 | appSpec: 177 | replicas: 0 178 | sidekiqSpec: 179 | replicas: 0 180 | ``` 181 | 182 | ==== Manual Failover shared databases [[manual-failover-shared-databases]] 183 | 184 | . In the *active* cluster, scale down the replicas of the *Backend*, *System*, *Zync* and *Apicast* Pods to 0, it will become the new *passive* cluster, so you ensure that the new *passive* cluster will not consume jobs from *active* databases (*downtime starts here*) 185 | . In the *passive* cluster, edit the APIManager to scale up the replicas of the *Backend*, *System*, *Zync* and *Apicast* Pods that were set to 0, so it will become the *active* cluster 186 | . In the newly *active* cluster (ex *passive*), recreate the OpenShift Routes created by *Zync*. To do that, run `bundle exec rake zync:resync:domains` from the `system-master` container of the `system-app` Pod. In 3scale version v2.9, this command fails sometimes, so you can retry until all the Routes are generated 187 | . Point the custom domain created in AWS Route53 to the OpenShift router of the new *active* cluster 188 | . From this moment, the old *passive* cluster will start to receive traffic, and it becomes the new *active* one 189 | 190 | === Active-Passive clusters on different regions with synced databases 191 | 192 | image::../images/3scale-ha-active-passive-different-region-synced-databases.png[Active Passive different region synced databases.png] 193 | 194 | This setup consists of having two clusters (or more) in *different regions* and deploying 3scale in *active-passive* mode. One of the clusters will be the *active* one (receiving traffic), whereas the other will be in standby mode without receiving traffic (*passive*) but prepared to assume the *active* role in case there is a failure in the *active* cluster. 195 | 196 | In this setup, to ensure good database access latency, each cluster will have its own database instances. The databases from the *active* 3scale installation will be replicated to the read-replica databases of the 3scale *passive* installations so the data is available and up to date in all regions for a possible failover. 197 | 198 | ==== Prerequisites and installation synced databases 199 | 200 | . Create 2 (or more) OpenShift clusters in *different regions* using different availability zones. A minimum of 3 zones is recommended. 201 | . Create all required AWS ElastiCache instances with MultiAZ enabled *on every region*: 202 | .. Two AWS EC for *Backend* redis database (one per region) 203 | .. Two AWS EC for *System* redis database (one per region) 204 | .. But on that case use the link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastore.html[cross-region replication with Global Datastore feature enabled], so the databases from *passive* regions will be read-replicas from the master databases at the *active* region 205 | . Create all required AWS RDS instances with Multi-AZ enabled *on every region*: 206 | .. Two AWS RDS for the *System* database 207 | .. Two AWS RDS for *Zync* database (since 3scale version v2.10) 208 | .. But on that case use link:https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.XRgn.html[cross-region replication], so the databases from *passive* regions will be read-replicas from the master databases at the *active* region 209 | . Configure a AWS S3 bucket for the *System* assets *on every region*, but on that case use link:https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html#crr-scenario[cross-region replication] 210 | . Like on previous scenario, create a custom domain in AWS Route53 (or your DNS provider) and point it to the OpenShift Router of the *active* cluster (need to coincide with the `wildcardDomain` attribute from APIManager custom resource) 211 | . Like in the previous scenario, install 3scale in the *passive* cluster. The APIManager custom resource should be identical to the one used in the step above. After all the Pods are running, change the APIManager to deploy 0 replicas for all the Pods of apicast, backend, system, and zync. You want to have 0 replicas to avoid consuming jobs from the *active* database, etc. You can not tell the operator to deploy 0 replicas of each directly because the deployment will fail due to some Pod dependencies that cannot be met (some Pods check that others are running). That is why, as a workaround, first you deploy normally, and then, with 0 replicas 212 | 213 | ==== Manual Failover synced databases 214 | 215 | . Execute step 1, 2 and 3 from <> 216 | . Every cluster has its own independent databases (read-replicas from the master at the *active* region), so it is required to *manually* execute a failover on every database to select the new master on the *passive* region (which will become now the *active* region) 217 | . Manual failovers of the databases to execute are (order does not matter): 218 | .. AWS RDS: *System* and *Zync* 219 | .. AWS ElastiCaches: *Backend* and *System* 220 | . Execute step 4 and 5 from <> 221 | 222 | === Active-Active clusters (not supported) 223 | 224 | Having an application working on *active/active* configuration is always difficult for any application: 225 | 226 | - *Soft Databases limitation*: it is complex but can be achieved with some trade offs, but it is mainly responsibility from the administrator (not 3scale responsibility) 227 | - *Hard Application limitation*: this one is a hard limitation imposed by how 3scale manages OpenShift Routes and cannot be achieved 228 | 229 | ==== Soft Databases limitation (out of 3scale scope) 230 | 231 | One of the most difficult parts for any application deployed on *active*/*active* (not only 3scale), is having databases on *active*/*active* mode, and in case of 3scale there are many different databases involved which would require different *active*/*active* implementations for each of them: 232 | 233 | - System-mysql (or system-postgres or system-oracle, you can choose) 234 | - System-redis 235 | - Backend-redis 236 | - Zync-database (postgresql) 237 | - system-sphinx (doesn't need HA) 238 | - system-memcached (doesn't need HA) 239 | 240 | So, from 3scale point of view, an administrator need to ensure that database connection strings configured in 3scale APIManager custom resource on any OCP cluster will be always a master instance (allowing write operations). 241 | 242 | ____ 243 | *NOTE* 244 | 245 | *This database limitation is a soft limitation because although being very complex (implementing active/active for mysql, postgresql or redis is not trivial), it can be achieved.* 246 | ____ 247 | 248 | ==== Hard Application limitation: OpenShift Routes 249 | 250 | The main reason why *active*/*active* can not be achieved, even with *active*/*active* databases, is *System* dev/admin portals and *Apicasts* OpenShift Routes. They are managed by the *Zync* component and they need to be the exactly the same on all OCP clusters. 251 | 252 | This is an example flow since a client wants a *System* new dev portal and incoming traffic can eventually reach system-app pods thanks to a newly created OpenShift Route: 253 | 254 | . System-app receives the request from UI/API to create a dev/admin portal or apicast OpenShift Route 255 | . System-app enqueues the background job to create those OpenShift Routes into system-redis 256 | . System-sidekiq takes the job from system-redis, process it and contacts with zync API (zync deployment) 257 | . Zync API (zync deployment) creates a background job into zync-database (postgres) 258 | . Zync-que takes the route-creation-job from zync-database (postgres) and create the final OpenShift Routes into the cluster that proccessed that job 259 | 260 | When having N clusters, imagine Cluster-A and Cluster-B, any of them can receive traffic managed by either Cluster-A-system-app or Cluster-B-system-app, so they will enqueue the job into its own database, and would follow the whole procedure: 261 | 262 | ``` 263 | System-app -> system-redis -> system-sidekiq -> zync (api) -> zync-database -> zync-que -> OpenShift Routes creation 264 | ``` 265 | 266 | And finally, the OpenShift Routes will be created *only* into the cluster that processed the route-creation-job (not on both OCP clusters), so incoming traffic managed by OpenShift Router will work only on one cluster (the one with the needed OpenShift Routes), so not having *active*/*active*. 267 | 268 | ____ 269 | *NOTE* 270 | 271 | *This application limitation is a hard limitation because the first worker that fetches the job, will create the OpenShift Route on the cluster it's running. So, each cluster will have a subset of OpenShift Routes.* 272 | ____ 273 | 274 | === Active independent clusters 275 | 276 | A few customers wanting *active*/*active*, as it can not be achieved within the same 3scale instance due to the hard limitation imposed by the OpenShift Routes created by *Zync*, what they have done is to deploy 3scale independent instances on different OCP clusters, where every 3scale instance has its own databases, which are not shared and not synced between them. 277 | 278 | Here you can find some considerations: 279 | 280 | - Each 3scale independent instance over multiple clusters will need its own High Availability configuration with their own databases, which are not shared between 3scale instances 281 | - All 3scale independent instances over multiple clusters *will use the same shared `wildcardDomain`* 282 | - You need to manage the 3scale configuration with a shared CI/CD, to ensure that any independent 3scale instance will have exactly the same configuration (so no configuration drifts between 3scale instances) 283 | - You need a kind of Smart Global Load Balancer (maybe done at DNS level) above all OCP clusters, to ensure that the same % of traffic goes to every independent 3scale instance on its OCP cluster (maybe using round-robin policy) 284 | - The rate limit will be independent for every 3scale instance, so if for example you have 2 3scale independent instances, and want a global rate limit of 100 requests/minute, you will need to configure a rate limit of 50 requests/minute on each 3scale instance, and in case of having round-robin DNS policy on the Global Load Balancer, *theoretically* you will achieve the supposed global rate limit of 100 requests/minute 285 | 286 | With this approach of 3scale independent instances, there might arise possible issues that are unknown at the time of writing this documentation. -------------------------------------------------------------------------------- /docs/ha_dbs.adoc: -------------------------------------------------------------------------------- 1 | = 3scale HA external databases 2 | 3 | *Using external databases is the recommended 3scale setup* (avoding the internal databases created by the 3scale Operator by default). 4 | 5 | You can choose between different options when setting up the databases in high-avalability. One of the first questions that you need to ask yourself is whether you want to deploy them on-premises or if you would like to use a managed service provided by a cloud provider: 6 | 7 | * Some users might be required by law to have everything on-premises 8 | * For users who are not, and are already using services provided by for example Amazon Web Services (AWS), using their managed services also for the 3scale databases might be the most convenient option (making easier the database operation, management, backup, restore...) 9 | 10 | In order to set up HA for the 3scale databases, you need to tell the operator that internal databases won't be used (so won't be created in OpenShift), which applies to: 11 | 12 | * *Backend* and the *System* Redis databases 13 | * Relational database used by *System* 14 | * And starting from 3scale v2.10, also *Zync* database 15 | 16 | To do so, follow link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#external-databases-installation[these instructions], these are the required APIManager spec fields to tell the operator to work with external databases: 17 | ```yaml 18 | apiVersion: apps.3scale.net/v1alpha1 19 | kind: APIManager 20 | metadata: 21 | name: example-apimanager 22 | spec: 23 | highAvailability: 24 | enabled: true # backend redis, system redis, system database external databases 25 | externalZyncDatabaseEnabled: true # zync external database 26 | ``` 27 | Then you need to configure *manually* their associated URLs for the database connections strings on the following secrets: 28 | 29 | * backend-redis 30 | * system-database 31 | * system-redis 32 | * zync 33 | 34 | == Backend Redis and System Redis databases 35 | 36 | Redis offers replication. You need to set up multiple instances of Redis. One of them will be the leader and the others the replicas. Writes go to the leader and are propagated to the replicas. When the leader fails, a replica becomes the new leader. When the old leader comes back online it will become a replica of the new leader. 37 | 38 | In Redis, the replication mechanism is asynchronous. This means that when something is written to the leader, Redis answers to the client without waiting for the writes to be effective in the replicas. This is fast, but when there is a failover, you will lose the data that has not been replicated yet. 39 | 40 | In the case of Backend, that trade-off makes sense. In the event of a failover, *Backend* might lose some reports. This means that some rate-limits could not be accurate and you could let more calls than configured for a brief period of time. Also, some statistics might not be exact. This is a trade-off that makes sense in the case of *Backend*. It needs to be fast because it is called on every API call (unless there's some auth caching policy enabled in APIcast), and failovers should be pretty rare, plus the amount of data not in the replicas should be pretty low. Aside from that, the *Backend* database contains some information synchronized from *System* that could also be lost, but that can always be recovered by executing a rake task from System. 41 | 42 | Here are some options available to use as the *Backend* and *System* Redis: 43 | 44 | * link:https://redis.io/topics/sentinel[Redis with sentinels] 45 | - This is the option you need to choose if you want to deploy everything on-premises 46 | * link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/WhatIs.html[AWS ElastiCache] 47 | - Check the link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.Redis.Groups.html[docs] 48 | - It is link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html[Multi-AZ] 49 | - It also supports link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastore.html[Multi-region using the "Global Datastore" feature]. Keep in mind that in the multi-region case failovers are not automatic 50 | * link:https://redislabs.com/redis-enterprise-software/overview/[Redis Enterprise] 51 | - Check the link:https://redislabs.com/redis-enterprise/technology/highly-available-redis/[docs] 52 | - It's multi-AZ, and can also be configured in link:https://redislabs.com/redis-enterprise/technology/active-passive-geo-distribution/[Multiple regions] 53 | 54 | Here are some considerations: 55 | 56 | * When choosing and configuring any of those or other options, keep in mind that *Backend* does not support the link:https://redis.io/topics/cluster-tutorial[Redis Cluster mode] (which has nothing to do with usual HA with master and slaves) 57 | * In the case of *Backend*, you can deploy its databases (storage and queues) in a single Redis instance using two different database indexes(typically `db0` and `db1`), but you can also use two different redis instances if the option that you choose does not support this, or if you prefer to have both usages (storage and queues) on separate instances 58 | * Take into account that using the same Redis instance both for *Backend* and *System* is not supported 59 | 60 | == System and Zync databases 61 | 62 | Here are some options for the relational database used by *System* and *Zync*. Keep in mind that they need to be two separate databases: 63 | 64 | * link:https://www.crunchydata.com/[Crunchy] 65 | - Read more about it link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/architecture/high-availability/multi-cluster-kubernetes/[here] 66 | and link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/advanced/multi-zone-design-considerations/[here] 67 | - It can be deployed using a Kubernetes operator 68 | - By default, the replication is asynchronous, but it can be link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/architecture/high-availability/[configured to be synchronous] 69 | - It can be configured to be multi-cluster, but in that case, the failover is not automatic. Also, the replication in that scenario is synchronous and uses S3 as an intermediate storage 70 | * link:https://aws.amazon.com/es/rds/ha/[AWS RDS] 71 | - The replication is synchronous 72 | - It supports link:https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.XRgn.html[Multi-AZ and Multi-region replicas]. Keep in mind that in the multi-region case failovers are not automatic -------------------------------------------------------------------------------- /docs/observability.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = Observability 5 | 6 | toc::[] 7 | 8 | == Application metrics 9 | 10 | You can link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-monitoring-resources.md#enabling-3scale-monitoring[enable 3scale monitoring] configuring the following APIManager spec fields: 11 | 12 | ```yaml 13 | apiVersion: apps.3scale.net/v1alpha1 14 | kind: APIManager 15 | metadata: 16 | name: example-apimanager 17 | spec: 18 | monitoring: 19 | enabled: true # Mandatory, deploys PodMonitors, GrafanaDashboards and PrometheusRules 20 | enablePrometheusRules: false # Optional, not deploy PrometheusRules 21 | ``` 22 | 23 | link:https://github.com/3scale/3scale-operator[3scale Operator] will: 24 | 25 | * Create a *PodMonitor* custom resource for every DeployConfig, so prometheus-operator knows how to scrape metrics from every pod 26 | * Create a *GrafanaDashboard* custom resource for every 3scale component: 27 | - There are dasboards for every 3scale component: Backend, System, Zync, Apicast, Apicast Services 28 | - In addition, there are a couple of generic dashboards with kubernetes resources usage by pod and namespace where the 3scale instance is deployed 29 | * Create a *PrometheusRule* custom resource for every 3scale component: 30 | - You can check default alerts at link:https://github.com/3scale/3scale-operator/tree/master/doc/prometheusrules[3scale Operator repository] 31 | - As alerts management can be very customizable, to not force anyone using them, they can be disabled from operator management by adding the optional field `enablePrometheusRules: false` 32 | - Having access to default PrometheusRules custom resources, you can deploy manually the ones you prefer by link:https://github.com/3scale/3scale-operator/tree/master/doc/prometheusrules#tune-the-prometheus-rules-based-on-your-infraestructure[tunning them to your own needs] (updating severity, time duration, thresholds...) 33 | - Bear in mind that every default PrometheusRule has a linked SOP (Standard Operating Procedures) on an annotation, you can check current SOP alerts at link:../sops/alerts[sops/alerts] directory 34 | 35 | == Databases metrics 36 | 37 | * 3scale uses different databases (different redis, mysql, postgresql, memcached, sphinx), so if there is any issue on a database it might have a huge impact on 3scale performance and availability 38 | * For this reason, it is recommended to setup monitoring on 3scale databases, along with some prometheus alerts, and also grafana dashboards to easily check internal database metrics 39 | * You can use link:https://github.com/3scale-ops/prometheus-exporter-operator[Prometheus Exporter Operator] in order to monitor: 40 | - 3scale databases internal metrics (including alerts and grafana dashboards): redis, mysql, postgresql, memcached, sphinx 41 | - HTTP monitoring (latency, availability, TLS/SSL certificate expiration...) of any 3scale HTTP endpoint 42 | - And in case of databases deployed in AWS: monitor AWS Cloudwatch metrics from any AWS service (like AWS RDS or AWS ElastiCache... ). Some metrics like CPU or disk space metrics can not be extracted from redis/mysql exporter, so cloudwatch exporter is required -------------------------------------------------------------------------------- /images/3scale-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-architecture.png -------------------------------------------------------------------------------- /images/3scale-ha-active-passive-different-region-synced-databases.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-ha-active-passive-different-region-synced-databases.png -------------------------------------------------------------------------------- /images/3scale-ha-active-passive-same-region-shared-databases.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-ha-active-passive-same-region-shared-databases.png -------------------------------------------------------------------------------- /images/3scale-pods-by-type.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-pods-by-type.png -------------------------------------------------------------------------------- /sops/README.adoc: -------------------------------------------------------------------------------- 1 | = SOPS (Standard Operating Procedures) 2 | 3 | For each alert defined, there should be some associated documentation that explains what the alert means, what the likely cause is and how to respond to the alert. These SOPs are essential to enabling an SRE team to effectively respond to alerts without needing direct input from an Engineering team. Without these, an SRE team will be aware that there is an ongoing problem, but will not have the necessary knowledge to deal with the problem. 4 | 5 | Common sections in a SOP may include: 6 | 7 | * Assumptions - how to validate that the alert is actually highlighting a problem, and what prerequisites might be required 8 | * Reference Articles - where to find additional information, or prior art for linked/similar issues 9 | * Corrective Process - what steps to take to resolve the issue 10 | * Success Indicators - how to know when the issue is resolved 11 | * Other Notes - any other relevant information 12 | -------------------------------------------------------------------------------- /sops/alerts/apicast_apicast_latency.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleApicastLatencyHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the `apicast-staging`/`apicast-production` **p99** latency is greater than a given threshold 11 | * Latency metric includes the time the request spent in apicast plus the time it took the upstream to respond 12 | 13 | == Troubleshooting 14 | 15 | * Check if `apicast-staging`/`apicast-production` pods might not be able to deal with incoming traffic due to high CPU usage 16 | - You may need to consider scaling horizontally `apicast-staging`/`apicast-production` deployment 17 | - You may need to consider increasing `apicast-staging`/`apicast-production` deployment resources requests/limits 18 | * Check if upstream response is taking too much time 19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the `apicast-staging`/`apicast-production` **p99** latency is below the threshold 24 | -------------------------------------------------------------------------------- /sops/alerts/apicast_http_4xx_error_rate.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleApicastHttp4xxErrorRate 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the error rate of `apicast-staging`/`apicast-production` HTTP 4XX requests is greater than a given threshold 11 | 12 | == Troubleshooting 13 | 14 | * Check if `apicast-staging`/`apicast-production` pods might be failing 15 | * Check `apicast-staging`/`apicast-production` pod logs to see if: 16 | - Auth is not valid 17 | - Service/Mapping rule is not being found (either it does not exists in `system-app` or `system-app` is having issues and not returning the required configuration ) 18 | - Upstream is sending a 4XX status code 19 | - Used policy is configured to send 4XX status code 20 | 21 | 22 | == Verification 23 | 24 | * Alert should disappear once the error rate of `apicast-staging`/`apicast-production` HTTP 4XX requests is below the threshold 25 | -------------------------------------------------------------------------------- /sops/alerts/apicast_request_time.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleApicastRequestTime 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if there is a huge number of `apicast-staging`/`apicast-production` requests being processed slow (below a given threshold) 11 | * Request time metric includes only the time the request spent in apicast 12 | 13 | == Troubleshooting 14 | 15 | * Check if `apicast-staging`/`apicast-production` pods might not be able to deal with incoming traffic due to high CPU usage 16 | - You may need to consider scaling horizontally `apicast-staging`/`apicast-production` deployment 17 | - You may need to consider increasing `apicast-staging`/`apicast-production` deployment resources requests/limits 18 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic 19 | 20 | == Verification 21 | 22 | * Alert should disappear once the number of `apicast-staging`/`apicast-production` slow requests are below the threshold 23 | -------------------------------------------------------------------------------- /sops/alerts/apicast_worker_restart.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleApicastWorkerRestart 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if one of the nginx-worker restarted for any reason 11 | 12 | == Troubleshooting 13 | 14 | * Check why the worker died in the APIcast logs; the usual reasons are: 15 | - Pod hits the memory limits and one of the worker processes was killed. Check why the limits were reached, and if it's needed, increase them 16 | - Node is out of memory, so try to kill containers to schedule to another node 17 | 18 | == Verification 19 | 20 | * Check that the worker died metric does not increase and memory is under the limits. 21 | -------------------------------------------------------------------------------- /sops/alerts/backend_listener_5xx_requests_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleBackendListener5XXRequestsHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `backend-listener` HTTP 5XX requests is greater than a given threshold 11 | 12 | == Troubleshooting 13 | 14 | * Check if `backend-listener` pods might be failing 15 | * Check if `backend-listener` pods might not be able to deal with incoming traffic due to high CPU usage 16 | - You may need to consider scaling horizontally `backend-listener` deployment 17 | - You may need to consider increasing `backend-listener` deployment resources requests/limits 18 | * Check if backend's redis is having issues 19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the number of `backend-listener` HTTP 5XX requests is below the threshold 24 | -------------------------------------------------------------------------------- /sops/alerts/backend_worker_jobs_count_running_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleBackendWorkerJobsCountRunningHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `backend-worker` jobs is greater than a given threshold 11 | 12 | == Troubleshooting 13 | 14 | * Check if `backend-worker` pods might be failing and the jobs are not being consumed at all from backend's redis 15 | * Check if `backend-worker` pods might not be consuming jobs at the required throughput: 16 | - You may need to consider scaling horizontally `backend-worker` deployment 17 | - You may need to consider increasing `backend-worker` deployment resources requests/limits 18 | * Check if the current traffic in `backend-listener` is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic. Whatever the reason, increased traffic in `backend-listener` generates more jobs for `backend-worker` to process from backend's redis 19 | * Check if backend's redis is having issues 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the number of `backend-worker` jobs is below the threshold 24 | -------------------------------------------------------------------------------- /sops/alerts/container_cpu_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleContainerCPUHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the cpu usage is being very high 11 | 12 | == Troubleshooting 13 | 14 | * Check if CPU resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources 15 | * Check if maybe you need to scale horizontally the associated deployment with more pods (if possible) to distribute the load 16 | - Some deployments **cannot** be scaled horizontally: `backend-cron` and any db (`backend-redis`,`system-redis`, `system-mysql`, `system-memcache`, `system-sphinx`, `zync-database`) 17 | 18 | == Verification 19 | 20 | * Alert should disappear once CPU usage decreases below the threshold 21 | -------------------------------------------------------------------------------- /sops/alerts/container_cpu_throttling_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleContainerCPUThrottlingHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the cpu usage is being throttled (using more resources than configured) 11 | 12 | == Troubleshooting 13 | 14 | * Check if CPU resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources 15 | 16 | == Verification 17 | 18 | * Alert should disappear once CPU usage is stable 19 | 20 | -------------------------------------------------------------------------------- /sops/alerts/container_memory_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleContainerMemoryHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the memory usage is being very high 11 | 12 | == Troubleshooting 13 | 14 | * Check if Memory resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources 15 | * Check if maybe you need to scale horizontally the associated deployment with more pods (if possible) to distribute the load 16 | - Some deployments **cannot** be scaled horizontally: `backend-cron` and any db (`backend-redis`,`system-redis`, `system-mysql`, `system-memcache`, `system-sphinx`, `zync-database`) 17 | 18 | == Verification 19 | 20 | * Alert should disappear once Memory usage decreases below the threshold 21 | -------------------------------------------------------------------------------- /sops/alerts/container_waiting.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleContainerWaiting 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if a container has been in `waiting` state for longer than 1 hour 11 | 12 | == Troubleshooting 13 | 14 | * Check if maybe container image registry is not reachable 15 | 16 | == Verification 17 | 18 | * Alert should disappear once container is up and running 19 | 20 | 21 | -------------------------------------------------------------------------------- /sops/alerts/pod_crash_looping.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescalePodCrashLooping 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if a pod is on `CrashLoop` state because one of the containers crashes and is restarted indefinitely 11 | 12 | == Troubleshooting 13 | 14 | * Check pod logs/events to see the reason of the `CrashLoop` 15 | * Check if maybe container resources requests/limits are too low and OOMKiller is acting (container trying to allocate more memory than permitted) 16 | 17 | == Verification 18 | 19 | * Alert should disappear once the pod is up and running 20 | -------------------------------------------------------------------------------- /sops/alerts/pod_not_ready.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescalePodNotReady 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if a pod is **not** on `Ready` state 11 | 12 | == Troubleshooting 13 | 14 | * Execute a describe pod command to check the pod status 15 | 16 | == Verification 17 | 18 | * Alert should disappear once pod passes the readiness probe 19 | -------------------------------------------------------------------------------- /sops/alerts/prometheus_job_down.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescalePrometheusJobDown 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert willl trigger when a prometheus Job is Down, because `ServiceMonitor`/`PodMonitor` might have issues to scrape the application `Service`/`Pods` 11 | * This alert is important because if you have alerts based on application metrics that cannot be obtained because its associated prometheus job is failing you will lose visibility of the application 12 | 13 | == Troubleshooting 14 | 15 | * Check if the scrapped application (or app metrics endpoint) might be failing and the scrape can not be done 16 | * Check if `ServiceMonitor`/`PodMonitor` might be misconfigured pointing to incorrect label, port, path... 17 | * Check if prometheus is having some problems, or if prometheus has not reload its config with possible new `ServiceMonitor`/`PodMonitor` 18 | 19 | == Verification 20 | 21 | * Alert should disappear once prometheus target starts scrapping all pods 22 | -------------------------------------------------------------------------------- /sops/alerts/replication_controller_replicas_mismatch.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleReplicationControllerReplicasMismatch 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if a `ReplicationController` does not have the desired number of pods running 11 | 12 | == Troubleshooting 13 | 14 | * Check if the new pods created by the replicaset might be failing 15 | * Check if there might be an issue with the scheduling due to lack of nodes resources or tolerations 16 | 17 | == Verification 18 | 19 | * Alert should disappear once all pods are running per `ReplicationController` (running pods = desired pods) 20 | -------------------------------------------------------------------------------- /sops/alerts/system_app_5xx_requests_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleSystemApp5XXRequestsHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `system-app` HTTP 5XX requests is greater than a given threshold 11 | 12 | == Troubleshooting 13 | 14 | * Check if `system-app` pods might be failing 15 | * Check if `system-app` pods might not be able to deal with incoming traffic due to high CPU usage 16 | - You may need to consider scaling horizontally `system-app` deployment 17 | - You may need to consider increasing `system-app` deployment resources requests/limits 18 | * Check if system's mysql database (or even system's redis) is having issues 19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the number of `system-app` HTTP 5XX requests is below the threshold 24 | -------------------------------------------------------------------------------- /sops/alerts/zync_5xx_requests_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleZync5XXRequestsHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `zync` HTTP 5XX requests is greater than a given threshold 11 | 12 | == Troubleshooting 13 | 14 | * Check if `zync` pods might be failing 15 | * Check if `zync` pods might not be able to deal with incoming traffic due to high CPU usage 16 | - You may need to consider scaling horizontally `zync` deployment 17 | - You may need to consider increasing `zync` deployment resources requests/limits 18 | * Check if zync's postgres database is having issues 19 | 20 | == Verification 21 | 22 | * Alert should disappear once the number of `zync` HTTP 5XX requests is below the threshold 23 | -------------------------------------------------------------------------------- /sops/alerts/zync_que_failed_job_count_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleZyncQueFailedJobCountHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `zync-que` **failed** jobs is greater than a given threshold 11 | * **Failed** jobs are the ones that failed at least once, did not run out of attempts to retry and therefore are scheduled for retry any time soon 12 | 13 | == Troubleshooting 14 | 15 | * Check the logs at `zync-que` pods to see the reason of the failures 16 | 17 | == Verification 18 | 19 | * Alert should disappear once the number of `zync-que` **failed** jobs is below the threshold 20 | -------------------------------------------------------------------------------- /sops/alerts/zync_que_ready_job_count_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleZyncQueReadyJobCountHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `zync-que` **ready** jobs is greater than a given threshold 11 | * **Ready** jobs are the ones that are enqueued and ready to be executed ASAP (never failed, nor got expired) 12 | 13 | == Troubleshooting 14 | 15 | * Check if `zync-que` pods might be failing and the jobs are not being consumed at all from zync's postgres 16 | * Check if `zync-que` pods might not be consuming jobs at the required throughput: 17 | - You may need to consider scaling horizontally `zync-que` deployment 18 | - You may need to consider increasing `zync-que` deployment resources requests/limits 19 | * Check if zync's postgres is having issues 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the number of `zync-que` **ready** jobs is below the threshold 24 | -------------------------------------------------------------------------------- /sops/alerts/zync_que_scheduled_job_count_high.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toc-placement!: 3 | 4 | = ThreescaleZyncQueScheduledJobCountHigh 5 | 6 | toc::[] 7 | 8 | == Description 9 | 10 | * This alert will trigger if the number of `zync-que` **scheduled** jobs is greater than a given threshold 11 | * **Scheduled** jobs are the ones that are enqueued to be executed some time in the future, but not now (never failed, nor got expired) 12 | 13 | == Troubleshooting 14 | 15 | * Check if `zync-que` pods might be failing and the jobs are not being consumed at all from zync's postgres 16 | * Check if `zync-que` pods might not be consuming jobs at the required throughput: 17 | - You may need to consider scaling horizontally `zync-que` deployment 18 | - You may need to consider increasing `zync-que` deployment resources requests/limits 19 | * Check if zync's postgres is having issues 20 | 21 | == Verification 22 | 23 | * Alert should disappear once the number of `zync-que` **scheduled** jobs is below the threshold 24 | --------------------------------------------------------------------------------