├── LICENSE
├── NOTICE
├── README.adoc
├── docs
    ├── backups.adoc
    ├── components.adoc
    ├── ha.adoc
    ├── ha_3scale.adoc
    ├── ha_dbs.adoc
    └── observability.adoc
├── images
    ├── 3scale-HA-scenarios.svg
    ├── 3scale-architecture.png
    ├── 3scale-ha-active-passive-different-region-synced-databases.png
    ├── 3scale-ha-active-passive-same-region-shared-databases.png
    └── 3scale-pods-by-type.png
└── sops
    ├── README.adoc
    └── alerts
        ├── apicast_apicast_latency.adoc
        ├── apicast_http_4xx_error_rate.adoc
        ├── apicast_request_time.adoc
        ├── apicast_worker_restart.adoc
        ├── backend_listener_5xx_requests_high.adoc
        ├── backend_worker_jobs_count_running_high.adoc
        ├── container_cpu_high.adoc
        ├── container_cpu_throttling_high.adoc
        ├── container_memory_high.adoc
        ├── container_waiting.adoc
        ├── pod_crash_looping.adoc
        ├── pod_not_ready.adoc
        ├── prometheus_job_down.adoc
        ├── replication_controller_replicas_mismatch.adoc
        ├── system_app_5xx_requests_high.adoc
        ├── zync_5xx_requests_high.adoc
        ├── zync_que_failed_job_count_high.adoc
        ├── zync_que_ready_job_count_high.adoc
        └── zync_que_scheduled_job_count_high.adoc


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
 1 |    Red Hat 3scale API Management Operations
 2 |    Copyright (c) 2016-2019 Red Hat, Inc.
 3 | 
 4 |    Licensed under the Apache License, Version 2.0 (the "License");
 5 |    you may not use this file except in compliance with the License.
 6 |    You may obtain a copy of the License at
 7 | 
 8 |        http://www.apache.org/licenses/LICENSE-2.0
 9 | 
10 |    Unless required by applicable law or agreed to in writing, software
11 |    distributed under the License is distributed on an "AS IS" BASIS,
12 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 |    See the License for the specific language governing permissions and
14 |    limitations under the License.
15 | 


--------------------------------------------------------------------------------
/README.adoc:
--------------------------------------------------------------------------------
 1 | = 3scale Operations
 2 | 
 3 | This repository contains documentation and tools to help in the installation, operation, ongoing Maintenance and upgrade of an instance of 3scale API Management, whether you are deploying it in a *public cloud*, *private cloud* or *on-premises*.
 4 | 
 5 | This documentation assumes familiarity with the main components of the 3scale architecture. If you would like to learn more about them, please check their upstream repositories:
 6 | 
 7 | * link:https://github.com/3scale/apicast[apicast]
 8 | * link:https://github.com/3scale/apisonator[apisonator (also known as "backend")]
 9 | * link:https://github.com/3scale/porta[porta (also known as "system")]
10 | * link:https://github.com/3scale/zync[zync]
11 | 
12 | It also assumes that you are installing 3scale on link:https://www.openshift.com/[Red Hat OpenShift] using the link:https://github.com/3scale/3scale-operator[3scale Operator].
13 | 
14 | Sections will be added over time and updated when relevant changes are made in 3scale or
15 | specific platforms it runs on.
16 | 
17 | == Architecture
18 | 
19 | image::images/3scale-architecture.png[3scale Architecture]
20 | 
21 | == Operations
22 | 
23 | Documents in this section describe typical operation procedures once you have your 3scale installation up and running:
24 | 
25 | * link:docs/components.adoc[Components responsibilities and impact when down]
26 | * link:docs/backups.adoc[Backup & Restore procedures]
27 | * link:docs/observability.adoc[Observability: Apps & Databases]
28 | * link:docs/ha.adoc[High-Availability]


--------------------------------------------------------------------------------
/docs/backups.adoc:
--------------------------------------------------------------------------------
  1 | :toc:
  2 | :toc-placement!:
  3 | 
  4 | = Backup & Restore
  5 | 
  6 | image:https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Warning.svg/156px-Warning.svg.png[]
  7 | 
  8 | * *It is recommended to use link:ha_dbs.adoc[external databases]* instead of using the internal databases that are installed by default
  9 | * For an easier backup/restore management when using external databases on public cloud providers,  the recommendation is to setup automated backups using the public cloud provider's integrated backup solution (like for example AWS RDS and its automated daily snapshots with PITR), if available
 10 | * If using the internal databases, backup and restore can be done using the link:https://github.com/3scale/3scale-operator[3scale Operator] by following these link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-backup-and-restore.md#3scale-installation-backup-and-restore-using-the-operator[instructions]
 11 | * *However, if for any reason you don't want to use the link:https://github.com/3scale/3scale-operator[3scale Operator] to manage backup/restore of internal databases, in the following sections you can find the deprecated way of doing manual backup and restore. This is not recommended, use at you own risk*
 12 | 
 13 | == Manual Backup & Restore [Deprecated]
 14 | 
 15 | toc::[]
 16 | 
 17 | This section provides an Operator of a 3scale installation the information they need for:
 18 | 
 19 | * <<Backup Procedures>> to backup persistent data
 20 | * <<Restore Procedures>> to restore from a backup of the persistent data
 21 | 
 22 | Such that, in the case of a failure, they can restore 3scale to correctly running operational state and continue
 23 | operating.
 24 | 
 25 | === Persistent Volumes
 26 | 
 27 | In a 3scale deployment on OpenShift all persistent data is stored either in a storage service in the cluster
 28 | (not currently used), a Persistent Volume (PV) provided to the cluster by the underlying infrastructure or a storage
 29 | service external to the cluster (be that in the same Data Center or elsewhere).
 30 | 
 31 | === Considerations
 32 | 
 33 | The backup and restore procedures for persistent data are required to vary depending on the storage used, to ensure the
 34 | backups and restores preserve data conistsency (e.g. that a partial write, or a partial transaction is not captured).
 35 | i.e. It is not sufficient to backup the underlying Persistent Volumes for a database, but instead the databases backup
 36 | mechanisms should be used.
 37 | 
 38 | Also, some parts of the data are synchronized between different components. One copy is considered the "source of truth"
 39 | for the data set, and the other a copy that is not modified locally, just synchronized from the "source of truth".
 40 | In these cases, upon restore, the "source of truth" should be restored and then the copies in other components
 41 | synchronized from it.
 42 | 
 43 | == Data Sets
 44 | 
 45 | This section goes into more detail on the different data sets in the different persistent stores, their purposes,
 46 | the storage type used and if it is the "source of truth" or not.
 47 | 
 48 | The full state of a 3scale deployment is stored across these services and their persistent volumes:
 49 | 
 50 | * `system-mysql`: Mysql database (`mysql-storage`)
 51 | * `system-storage`: Volume for Files
 52 | * `zync-database`: Postgres database for `zync` component. This uses "HostPath" as storage. If the pod is moved into
 53 | another node the data is lost. That is not a problem because the data are sync jobs and does not need to be 100%
 54 | persistent.
 55 | * `backend-redis`: Redis database (`backend-redis-storage`)
 56 | * `system-redis`: Redis database (`system-redis-storage`)
 57 | 
 58 | === `system-mysql` OR `system-database` OR external Oracle database
 59 | 
 60 | This is a relational database for storing the information about users, accounts, APIs, plans etc in the 3scale Admin
 61 | Console.
 62 | 
 63 | A subset of this information related to Services is synchronized to the `Backend` component and stored in
 64 | `backend-redis`. `system-mysql` OR `system-database` OR external Oracle database is the source of truth for this information.
 65 | 
 66 | === `system-storage`
 67 | 
 68 | This is for storing files to be read and written by the `System` component. They fall into two categories:
 69 | 
 70 | * Configuration files read by the System component at run-time
 71 | * Static files (HTML, CSS, JS, etc) uploaded to System by its CMS feature, for the purpose of creating a Developer
 72 | Portal.
 73 | 
 74 | Note that `System` can be scaled horizontally with multiple pods uploading and reading said static files, hence the
 75 | need for a `RWX` PersistentVolume.
 76 | 
 77 | === `zync-database`
 78 | 
 79 | This is a relational database for storing information related to the synchronization of identities between 3scale and
 80 | an Identity provider.
 81 | This information is not duplicated in other components and this is the sole source of truth.
 82 | 
 83 | === `backend-redis`
 84 | 
 85 | This contains multiple data sets used by the `Backend` component:
 86 | 
 87 | * Usages: This is API usage information aggregated by `Backend`. It is used by `Backend` for rate-limiting decisions
 88 | and by `System` to display analytics information in the UI or via API.
 89 | * Config: This is configuration information about Services, Rate-limits, etc that is synchronized from `System` via an
 90 | internal API. This is NOT the source of truth of this info, `System` and `system-mysql` is.
 91 | * AuthKeys: Storage of OAuth keys created directly in `Backend`. This is the source of truth for this information.
 92 | * Queues: Queues of Background Jobs to be executed by worker processses. These are ephemeral and are deleted once
 93 | processed.
 94 | 
 95 | === `system-redis`
 96 | 
 97 | This contains Queues for jobs to be processed in background. These are ephemeral and are deleted once processed.
 98 | 
 99 | == Backup Procedures
100 | 
101 | === `system-mysql`
102 | 
103 | Execute MySQL Backup Command
104 | 
105 | [source,bash]
106 | ----
107 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'export MYSQL_PWD=${MYSQL_ROOT_PASSWORD}; mysqldump --single-transaction -hsystem-mysql -uroot system' | gzip > system-mysql-backup.gz
108 | ----
109 | 
110 | === `system-database`
111 | 
112 | Execute PostgreSQL Backup Command
113 | 
114 | [source,bash]
115 | ----
116 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq '.items[0].metadata.name' -r) bash -c 'pg_dumpall -c --if-exists' | gzip > system-postgres-backup.gz
117 | ----
118 | 
119 | === External Oracle database
120 | 
121 | Follow Oracle Database Backup and Recovery Quick Start Guide: https://docs.oracle.com/cd/B19306_01/backup.102/b14193/toc.htm
122 | 
123 | === `system-storage`
124 | 
125 | Archive the system-storage files to another storage.
126 | 
127 | [source,bash]
128 | ----
129 | oc rsync $(oc get pods -l 'deploymentConfig=system-app' -o json | jq '.items[0].metadata.name' -r):/opt/system/public/system ./local/dir
130 | ----
131 | 
132 | === `zync-database`
133 | 
134 | Execute Postgres Backup Command
135 | 
136 | [source,bash]
137 | ----
138 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq '.items[0].metadata.name' -r) bash -c 'pg_dumpall -c --if-exists' | gzip > zync-database-backup.gz
139 | ----
140 | 
141 | === `backend-redis`
142 | 
143 | Backup the dump.rb file from redis
144 | 
145 | [source,bash]
146 | ----
147 | oc cp $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb ./backend-redis-dump.rdb
148 | ----
149 | 
150 | === `system-redis`
151 | 
152 | Backup the dump.rb file from redis
153 | 
154 | [source,bash]
155 | ----
156 | oc cp $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb ./system-redis-dump.rdb
157 | ----
158 | 
159 | === `Secrets`
160 | [source,bash]
161 | ----
162 | oc get secrets system-smtp -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-smtp.json
163 | oc get secrets system-seed -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-seed.json
164 | oc get secrets system-database -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-database.json
165 | oc get secrets backend-internal-api -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-internal-api.json
166 | oc get secrets system-events-hook -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-events-hook.json
167 | oc get secrets system-app -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-app.json
168 | oc get secrets system-recaptcha -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-recaptcha.json
169 | oc get secrets system-redis -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-redis.json
170 | oc get secrets zync -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > zync.json
171 | oc get secrets system-master-apicast -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-master-apicast.json
172 | oc get secrets backend-listener -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-listener.json
173 | oc get secrets backend-redis -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-redis.json
174 | oc get secrets system-memcache -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-memcache.json
175 | ----
176 | 
177 | === `ConfigMaps`
178 | [source,bash]
179 | ----
180 | oc get configmaps system-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system-environment.json
181 | oc get configmaps apicast-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > apicast-environment.json
182 | oc get configmaps backend-environment -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > backend-environment.json
183 | oc get configmaps mysql-extra-conf -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > mysql-extra-conf.json
184 | oc get configmaps mysql-main-conf -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > mysql-main-conf.json
185 | oc get configmaps redis-config -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > redis-config.json
186 | oc get configmaps system -o json --export | jq -r 'del(.metadata.ownerReferences,.metadata.selfLink)' > system.json
187 | ----
188 | 
189 | == Restore Procedures
190 | 
191 | === `Template-based deployment`
192 | 
193 | Restore secrets before creating deploying template.
194 | 
195 | [source,bash]
196 | ----
197 | oc apply -f system-smtp.json
198 | ----
199 | 
200 | Template parameters will be read from copied secrets and configmaps.
201 | 
202 | [source,bash]
203 | ----
204 | oc new-app --file /opt/amp/templates/amp.yml \
205 |     --param APP_LABEL=$(cat system-environment.json | jq -r '.metadata.labels.app') \
206 |     --param TENANT_NAME=$(cat system-seed.json | jq -r '.data.TENANT_NAME' | base64 -d) \
207 |     --param SYSTEM_DATABASE_USER=$(cat system-database.json | jq -r '.data.DB_USER' | base64 -d) \
208 |     --param SYSTEM_DATABASE_PASSWORD=$(cat system-database.json | jq -r '.data.DB_PASSWORD' | base64 -d) \
209 |     --param SYSTEM_DATABASE=$(cat system-database.json | jq -r '.data.URL' | base64 -d | cut -d '/' -f4) \
210 |     --param SYSTEM_DATABASE_ROOT_PASSWORD=$(cat system-database.json | jq -r '.data.URL' | base64 -d | awk -F '[:@]' '{print $3}') \
211 |     --param WILDCARD_DOMAIN=$(cat system-environment.json | jq -r '.data.THREESCALE_SUPERDOMAIN') \
212 |     --param SYSTEM_BACKEND_USERNAME=$(cat backend-internal-api.json | jq '.data.username' -r | base64 -d) \
213 |     --param SYSTEM_BACKEND_PASSWORD=$(cat backend-internal-api.json | jq '.data.password' -r | base64 -d) \
214 |     --param SYSTEM_BACKEND_SHARED_SECRET=$(cat system-events-hook.json | jq -r '.data.PASSWORD' | base64 -d) \
215 |     --param SYSTEM_APP_SECRET_KEY_BASE=$(cat system-app.json | jq -r '.data.SECRET_KEY_BASE' | base64 -d) \
216 |     --param ADMIN_PASSWORD=$(cat system-seed.json | jq -r '.data.ADMIN_PASSWORD' | base64 -d) \
217 |     --param ADMIN_USERNAME=$(cat system-seed.json | jq -r '.data.ADMIN_USER' | base64 -d) \
218 |     --param ADMIN_EMAIL=$(cat system-seed.json | jq -r '.data.ADMIN_EMAIL' | base64 -d) \
219 |     --param ADMIN_ACCESS_TOKEN=$(cat system-seed.json | jq -r '.data.ADMIN_ACCESS_TOKEN' | base64 -d) \
220 |     --param MASTER_NAME=$(cat system-seed.json | jq -r '.data.MASTER_DOMAIN' | base64 -d) \
221 |     --param MASTER_USER=$(cat system-seed.json | jq -r '.data.MASTER_USER' | base64 -d) \
222 |     --param MASTER_PASSWORD=$(cat system-seed.json | jq -r '.data.MASTER_PASSWORD' | base64 -d) \
223 |     --param MASTER_ACCESS_TOKEN=$(cat system-seed.json | jq -r '.data.MASTER_ACCESS_TOKEN' | base64 -d) \
224 |     --param RECAPTCHA_PUBLIC_KEY="$(cat system-recaptcha.json | jq -r '.data.PUBLIC_KEY' | base64 -d)" \
225 |     --param RECAPTCHA_PRIVATE_KEY="$(cat system-recaptcha.json | jq -r '.data.PRIVATE_KEY' | base64 -d)" \
226 |     --param SYSTEM_REDIS_URL=$(cat system-redis.json | jq -r '.data.URL' | base64 -d) \
227 |     --param SYSTEM_MESSAGE_BUS_REDIS_URL="$(cat system-redis.json | jq -r '.data.MESSAGE_BUS_URL' | base64 -d)" \
228 |     --param SYSTEM_REDIS_NAMESPACE="$(cat system-redis.json | jq -r '.data.NAMESPACE' | base64 -d)" \
229 |     --param SYSTEM_MESSAGE_BUS_REDIS_NAMESPACE="$(cat system-redis.json | jq -r '.data.MESSAGE_BUS_NAMESPACE' | base64 -d)" \
230 |     --param ZYNC_DATABASE_PASSWORD=$(cat zync.json | jq -r '.data.ZYNC_DATABASE_PASSWORD' | base64 -d) \
231 |     --param ZYNC_SECRET_KEY_BASE=$(cat zync.json | jq -r '.data.SECRET_KEY_BASE' | base64 -d) \
232 |     --param ZYNC_AUTHENTICATION_TOKEN=$(cat zync.json | jq -r '.data.ZYNC_AUTHENTICATION_TOKEN' | base64 -d) \
233 |     --param APICAST_ACCESS_TOKEN=$(cat system-master-apicast.json | jq -r '.data.ACCESS_TOKEN' | base64 -d) \
234 |     --param APICAST_MANAGEMENT_API=$(cat apicast-environment.json | jq -r '.data.APICAST_MANAGEMENT_API') \
235 |     --param APICAST_OPENSSL_VERIFY=$(cat apicast-environment.json | jq -r '.data.OPENSSL_VERIFY') \
236 |     --param APICAST_RESPONSE_CODES=$(cat apicast-environment.json | jq -r '.data.APICAST_RESPONSE_CODES') \
237 |     --param APICAST_REGISTRY_URL=$(cat system-environment.json | jq -r '.data.APICAST_REGISTRY_URL')
238 | ----
239 | 
240 | === `Operator-based deployments`
241 | 
242 | Restore secrets before creating APIManager resource.
243 | 
244 | [source,bash]
245 | ----
246 | oc apply -f system-smtp.json
247 | oc apply -f system-seed.json
248 | oc apply -f system-database.json
249 | oc apply -f backend-internal-api.json
250 | oc apply -f system-events-hook.json
251 | oc apply -f system-app.json
252 | oc apply -f system-recaptcha.json
253 | oc apply -f system-redis.json
254 | oc apply -f zync.json
255 | oc apply -f system-master-apicast.json
256 | oc apply -f backend-listener.json
257 | oc apply -f backend-redis.json
258 | oc apply -f system-memcache.json
259 | ----
260 | 
261 | Restore configmaps before creating APIManager resource.
262 | 
263 | [source,bash]
264 | ----
265 | oc apply -f system-environment.json
266 | oc apply -f apicast-environment.json
267 | oc apply -f backend-environment.json
268 | oc apply -f mysql-extra-conf.json
269 | oc apply -F mysql-main-conf.json
270 | oc apply -f redis-config.json
271 | oc apply -f system.json
272 | ----
273 | 
274 | === `system-mysql`
275 | 
276 | Copy the MySQL dump to the system-mysql pod
277 | 
278 | [source,bash]
279 | ----
280 | oc cp ./system-mysql-backup.gz $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq '.items[0].metadata.name' -r):/var/lib/mysql
281 | ----
282 | 
283 | Decompress the Backup File
284 | 
285 | [source,bash]
286 | ----
287 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/system-mysql-backup.gz'
288 | ----
289 | 
290 | Restore the MySQL DB Backup file
291 | 
292 | [source,bash]
293 | ----
294 | oc rsh $(oc get pods -l 'deploymentConfig=system-mysql' -o json | jq -r '.items[0].metadata.name') bash -c 'export MYSQL_PWD=${MYSQL_ROOT_PASSWORD}; mysql -hsystem-mysql -uroot system < ${HOME}/system-mysql-backup'
295 | ----
296 | 
297 | === `system-database`
298 | 
299 | Copy the PostgreSQL Database dump to the system-database pod
300 | 
301 | [source,bash]
302 | ----
303 | oc cp ./system-postgres-backup.gz $(oc get pods -l 'deploymentConfig=system-database' -o json | jq '.items[0].metadata.name' -r):/var/lib/pgsql/
304 | ----
305 | 
306 | Decompress the Backup File
307 | 
308 | [source,bash]
309 | ----
310 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/system-postgres-backup.gz'
311 | ----
312 | 
313 | Restore the PostgreSQL DB Backup file
314 | 
315 | [source,bash]
316 | ----
317 | oc rsh $(oc get pods -l 'deploymentConfig=system-database' -o json | jq -r '.items[0].metadata.name') bash -c 'psql -f ${HOME}/system-postgres-backup.gz'
318 | ----
319 | 
320 | === `system-storage`
321 | 
322 | Restore the archived files from a different location.
323 | 
324 | [source,bash]
325 | ----
326 | oc rsync ./local/dir/system/ $(oc get pods -l 'deploymentConfig=system-app' -o json | jq '.items[0].metadata.name' -r):/opt/system/public/system --delete=true
327 | ----
328 | 
329 | 
330 | === `zync-database`
331 | 
332 | Copy the Zync Database dump to the zync-database pod
333 | 
334 | [source,bash]
335 | ----
336 | oc cp ./zync-database-backup.gz $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq '.items[0].metadata.name' -r):/var/lib/pgsql/
337 | ----
338 | 
339 | Decompress the Backup File
340 | 
341 | [source,bash]
342 | ----
343 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq -r '.items[0].metadata.name') bash -c 'gzip -d ${HOME}/zync-database-backup.gz'
344 | ----
345 | 
346 | Restore the PostgreSQL DB Backup file
347 | 
348 | [source,bash]
349 | ----
350 | oc rsh $(oc get pods -l 'deploymentConfig=zync-database' -o json | jq -r '.items[0].metadata.name') bash -c 'psql -f ${HOME}/zync-database-backup'
351 | ----
352 | 
353 | === `backend-redis`
354 | 
355 | * After restoring `backend-redis` a sync of the Config information from `System` should be forced, to ensure the
356 | information in `Backend` is consistent with that in `System` (the source of truth).
357 | 
358 | Edit the `redis-config` configmap
359 | 
360 | [source,bash]
361 | ----
362 | oc edit configmap redis-config
363 | ----
364 | 
365 | Comment `save` comands in the `redis-config` configmap
366 | 
367 | [source,bash]
368 | ----
369 | #save 900 1
370 | #save 300 10
371 | #save 60 10000
372 | ----
373 | 
374 | Set `appendonly` to no in the `redis-config` configmap
375 | 
376 | [source,bash]
377 | ----
378 | appendonly no
379 | ----
380 | 
381 | Re-deploy `backend-redis` to load the new configurations
382 | 
383 | [source,bash]
384 | ----
385 | oc rollout latest dc/backend-redis
386 | ----
387 | 
388 | Rename the `dump.rb` file
389 | 
390 | [source,bash]
391 | ----
392 | oc rsh $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/dump.rdb ${HOME}/data/dump.rdb-old'
393 | ----
394 | 
395 | Rename the `appendonly.aof` file
396 | 
397 | [source,bash]
398 | ----
399 | oc rsh $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/appendonly.aof ${HOME}/data/appendonly.aof-old'
400 | ----
401 | 
402 | Move the Backup file to the POD
403 | 
404 | [source,bash]
405 | ----
406 | oc cp ./backend-redis-dump.rdb $(oc get pods -l 'deploymentConfig=backend-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb
407 | ----
408 | 
409 | Re-deploy `backend-redis` to load the backup
410 | 
411 | [source,bash]
412 | ----
413 | oc rollout latest dc/backend-redis
414 | ----
415 | 
416 | Edit the `redis-config` configmap
417 | 
418 | [source,bash]
419 | ----
420 | oc edit configmap redis-config
421 | ----
422 | 
423 | Uncomment `SAVE` comands in the `redis-config` configmap
424 | 
425 | [source,bash]
426 | ----
427 | save 900 1
428 | save 300 10
429 | save 60 10000
430 | ----
431 | 
432 | Set `appendonly` to yes in the `redis-config` configmap
433 | 
434 | [source,bash]
435 | ----
436 | appendonly yes
437 | ----
438 | 
439 | Re-deploy `backend-redis` to reload the default configurations
440 | 
441 | [source,bash]
442 | ----
443 | oc rollout latest dc/backend-redis
444 | ----
445 | 
446 | === `system-redis`
447 | 
448 | Edit the `redis-config` configmap
449 | 
450 | [source,bash]
451 | ----
452 | oc edit configmap redis-config
453 | ----
454 | 
455 | Comment `SAVE` comands in the `redis-config` configmap
456 | 
457 | [source,bash]
458 | ----
459 | #save 900 1
460 | #save 300 10
461 | #save 60 10000
462 | ----
463 | 
464 | Set `appendonly` to no in the `redis-config` configmap
465 | 
466 | [source,bash]
467 | ----
468 | appendonly no
469 | ----
470 | 
471 | Re-deploy `system-redis` to load the new configurations
472 | 
473 | [source,bash]
474 | ----
475 | oc rollout latest dc/system-redis
476 | ----
477 | 
478 | Rename the `dump.rb` file
479 | 
480 | [source,bash]
481 | ----
482 | oc rsh $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/dump.rdb ${HOME}/data/dump.rdb-old'
483 | ----
484 | 
485 | Rename the `appendonly.aof` file
486 | 
487 | [source,bash]
488 | ----
489 | oc rsh $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r) bash -c 'mv ${HOME}/data/appendonly.aof ${HOME}/data/appendonly.aof-old'
490 | ----
491 | 
492 | Move the Backup file to the POD
493 | 
494 | [source,bash]
495 | ----
496 | oc cp ./system-redis-dump.rdb $(oc get pods -l 'deploymentConfig=system-redis' -o json | jq '.items[0].metadata.name' -r):/var/lib/redis/data/dump.rdb
497 | ----
498 | 
499 | Re-deploy `system-redis` to load the backup
500 | 
501 | [source,bash]
502 | ----
503 | oc rollout latest dc/system-redis
504 | ----
505 | 
506 | Edit the `redis-config` configmap
507 | 
508 | [source,bash]
509 | ----
510 | oc edit configmap redis-config
511 | ----
512 | 
513 | Uncomment `SAVE` comands in the `redis-config` configmap
514 | 
515 | [source,bash]
516 | ----
517 | save 900 1
518 | save 300 10
519 | save 60 10000
520 | ----
521 | 
522 | Set `appendonly` to yes in the `redis-config` configmap
523 | 
524 | [source,bash]
525 | ----
526 | appendonly yes
527 | ----
528 | 
529 | Re-deploy `system-redis` to reload the default configurations
530 | 
531 | [source,bash]
532 | ----
533 | oc rollout latest dc/system-redis
534 | ----
535 | 
536 | === `backend-worker`
537 | 
538 | [source,bash]
539 | ----
540 | oc rollout latest dc/backend-worker
541 | ----
542 | 
543 | === `system-app`
544 | 
545 | [source,bash]
546 | ----
547 | oc rollout latest dc/system-app
548 | ----
549 | 
550 | === `system-sidekiq`
551 | 
552 | Resync domains
553 | 
554 | [source,bash]
555 | ----
556 | oc exec -t $(oc get pods -l 'deploymentConfig=system-sidekiq' -o json | jq '.items[0].metadata.name' -r) -- bash -c "bundle exec rake zync:resync:domains"
557 | ----
558 | 
559 | == Open Issues
560 | 
561 | * What about System services and sphinx (index)?
562 | * How to handle backup/restore of job queues (of different types). They can be lost or maybe done twice!
563 | 


--------------------------------------------------------------------------------
/docs/components.adoc:
--------------------------------------------------------------------------------
  1 | :toc:
  2 | :toc-placement!:
  3 | 
  4 | = Components
  5 | 
  6 | This section explains what the responsibilities of every 3scale component are and what is the impact when each of them is not available.
  7 | 
  8 | image::../images/3scale-pods-by-type.png[3scale pods by Type]
  9 | 
 10 | toc::[]
 11 | 
 12 | == apicast-production
 13 | 
 14 | === Responsibilities
 15 | * 3scale API Gateway
 16 | * APIcast enforces traffic policies, either returning an error or proxying the API call to the customer’s API backend
 17 | * APIcast fetches its operating configuration from `system-app` component when the gateway starts
 18 | * Each incoming/managed API call produces a sync/auth request from the API Gateway to `backend-listener`
 19 | * Reduces latency by introducing the API Gateway in the traffic flow
 20 | 
 21 | === Impact when down
 22 | * Clients will not be able to reach the provider's API using that APIcast instance
 23 | 
 24 | == apicast-staging
 25 | 
 26 | === Responsibilities
 27 | * Same as `apicast-production`, but affecting staging traffic
 28 | 
 29 | === Impact when down
 30 | * Same as `apicast-production`, but affecting staging traffic
 31 | 
 32 | == backend-listener
 33 | 
 34 | === Responsibilities
 35 | * This is the most critical component, responsible for authorizing and rate-limiting requests
 36 | 
 37 | === Impact when down
 38 | * APIcast will not be able to tell whether a request should be authorized or not and simply denies everything
 39 | * APIs configured using 3scale are down
 40 | * This service distuption can be mitigated by using the APIcast auth caching policy. There are some trade-offs when using this policy (service availability upon possible outages vs. service available but possibly using outdated configuration), so make sure you understand the implications before using it
 41 | 
 42 | == backend-worker
 43 | 
 44 | === Responsibilities
 45 | * Process the background jobs created by `backend-listener`
 46 | * Runs enqueued jobs, mainly related to traffic reporting
 47 | 
 48 | === Impact when down
 49 | * Reported metrics can't be made effective, so the rate-limiting functionality loses accuracy because those pending reports are not taken into account
 50 | * Statistics will not be up to date and the alerts and errors shown in the admin portal will not be triggered
 51 | * Because accounting will not work, authorizations will not be correct
 52 | * Jobs will not be processed and will start accumulating in `backend-redis`, which could lead to database out of memory related problems
 53 | 
 54 | == backend-cron
 55 | 
 56 | === Responsibilities
 57 | * This is a simple task that acts as a cron like scheduler to retry failed jobs
 58 | * When a `backend-worker` job fails, it is pushed to a "failed jobs" queue so that it can be retried later. Jobs can fail, for example, when there's a Redis timeout
 59 | * It is also responsible for deleting the stats of services that have been removed. This is run every 24h
 60 | 
 61 | === Impact when down
 62 | * Failed jobs will not be rescheduled
 63 | * If it crashes in the middle of the delete process, it will just continue the next time it runs
 64 | * If the 3scale installation is working correctly, the failed jobs queue will be empty at almost all times, so `backend-cron` being down is not critical
 65 | 
 66 | == backend-redis
 67 | 
 68 | === Responsibilities
 69 | * It is the database used by `backend-listener` and `backend-worker`
 70 | * It is used both for data persistence (metrics...) and to store job queues
 71 | 
 72 | === Impact when down
 73 | * `backend-listener` and `backend-worker` cannot function without access to the storage, so both components can be considered as down. Refer to the sections on `backend-listener` and `backend-worker` to review impact when these components are down
 74 | 
 75 | == system-app
 76 | 
 77 | === Responsibilities
 78 | * Developer and Admin Portal UI/API
 79 | * 3scale APIs (Accounts, Analytics)
 80 | 
 81 | === Impact when down
 82 | * Developer and Admin Portal UI/API will not be available
 83 | * 3scale APIs (Accounts, Analytics) will not be available
 84 | * `apicast` will not be able to retrieve the gateway configuration, so new `apicast` deployments will not work
 85 | * Already running `apicast` Pods will continue serving traffic using the latest retrieved configuration (cached)
 86 | 
 87 | == system-sidekiq
 88 | 
 89 | === Responsibilities
 90 | * It is the job manager used by `system-app` to process jobs in the background asynchronously
 91 | 
 92 | === Impact when down
 93 | * Emails are not sent
 94 | * Communication with `backend-listener` breaks: changes in Admin Portal will not propagate to Backend
 95 | * Backend alerts will not be triggered
 96 | * Webhooks will not be triggered
 97 | * Zync will not receive any updates
 98 | * Background jobs will not be processed and will start accumulating in `system-redis`, which could lead to database out of memory related problems
 99 | 
100 | == system-mysql or system-postgresql
101 | 
102 | === Responsibilities
103 | * It is the main relational database used by `system-app`
104 | 
105 | === Impact when down
106 | * Both `system-app` and `system-sidekiq` components can be considered down if access to the relational database is lost. Refer to the sections on `system-app` and `system-sidekiq` to review impact when these components are down
107 | 
108 | == system-redis
109 | 
110 | === Responsibilities
111 | * It is the database used by `system-app` to enqueue the jobs consumed by `system-sidekiq`
112 | 
113 | === Impact when down
114 | * `system-app` and `system-sidekiq` cannot function without access to the storage, so both components can be considered as down. Refer to the sections on `system-app` and `system-sidekiq` to review impact when these components are down
115 | 
116 | == system-memcache
117 | 
118 | === Responsibilities
119 | * `system-memcached` is an ephemeral cache of values used to speed-up the performance of the `system-app` web application
120 | 
121 | === Impact when down
122 | * `system-app` will run slightly slower (UI page loading times will be worse) while the cache is not accessible. Cache will be rebuilt once the memcached instance is back online
123 | 
124 | == system-sphinx
125 | 
126 | === Responsibilities
127 | * Full-text search for `system-app`
128 | 
129 | === Impact when down
130 | * The search functionality on the `system-app` Admin/Developer Portal (accounts and proxy rules search bars, templates, forum searches...) stops working
131 | 
132 | == zync
133 | 
134 | === Responsibilities
135 | 
136 | * Receives events from `system-sidekiq`
137 | * Enqueue those events as new jobs to be processed in the background by `zync-que`
138 | * Those enqueued jobs can be:
139 | - Creation/Update of OpenShift Routes (Admin/Developer portals of each tenant)
140 | - Creation/Update of OpenShift Routes (`apicast-staging` or `apicast-production` domains of each API)
141 | - Synchronization of information with configured 3rd party IDPs
142 | 
143 | === Impact when down
144 | * Synchronization of OpenShift Routes for `apicast-staging` and `apicast-production` will not work
145 | * Synchronization of OpenShift Routes for the Admin Portal and the Developer Portal domains will not work
146 | * Synchronization with 3rd party IDPs will not work
147 | * `system-sidekiq` will retry the failed requests for some time
148 | 
149 | == zync-que
150 | 
151 | === Responsibilities
152 | * Process the enqueued jobs created by `zync`
153 | * Those jobs can be:
154 | - Creation/Update of OpenShift Routes (Admin/Developer portals of each tenant)
155 | - Creation/Update of OpenShift Routes (`apicast-staging` or `apicast-production` domains of each API)
156 | - Synchronization of information with configured 3rd party IDPs
157 | 
158 | === Impact when down
159 | * Synchronization of OpenShift Routes for `apicast-staging` and `apicast-production` will not work
160 | * Synchronization of OpenShift Routes for the Admin Portal and the Developer Portal domains will not work
161 | * Synchronization with 3rd party IDPs will not work
162 | * Jobs will not be processed and will start accumulating in `zync-database`, which could lead to database out of disk space related problems
163 | 
164 | == zync-database
165 | 
166 | === Responsibilities
167 | * It is the database used by `zync`
168 | * It contains job queues and also some data synchronized from `system-app`
169 | 
170 | === Impact when down
171 | * `zync` will not be able to enqueue jobs and `zync-que` will not be able to consume them, so both components can be considered down when database access is lost. Refer to the sections on `zync` and `zync-que` to review impact when these components are down


--------------------------------------------------------------------------------
/docs/ha.adoc:
--------------------------------------------------------------------------------
1 | = Highly-Availability
2 | 
3 | * link:ha_3scale.adoc[3scale HA on different scenarios]
4 | * link:ha_dbs.adoc[3scale HA with external databases]
5 | 


--------------------------------------------------------------------------------
/docs/ha_3scale.adoc:
--------------------------------------------------------------------------------
  1 | :toc:
  2 | :toc-placement!:
  3 | 
  4 | = 3scale HA different scenarios
  5 | 
  6 | toc::[]
  7 | 
  8 | == Highly available 3scale-operator based installation
  9 | 
 10 | This section explains how to install 3scale using the link:https://github.com/3scale/3scale-operator[3scale Operator] and how to configure it to have a highly-available deployment. If you haven't done so, you need to install the 3scale-operator following link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#installing-3scale[this guide] before proceeding.
 11 | 
 12 | As usual, to install 3scale using the 3scale-operator, an APIManager custom resource is used (see a simple example link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#basic-installation[here]), but several changes need to be made to ensure a highly-available installation. The following sections delve into each of the changes required.
 13 | 
 14 | === External databases
 15 | 
 16 | The 3scale-operator installs all the databases required by 3scale within the OpenShift cluster by default. For a highly-available installation databases need to be external to the cluster though, so the user must deploy them himself outside the cluster and make the APIManager custom resource aware of it:
 17 | 
 18 | * Instruct the operator not to deploy the critical databases of the 3scale installation. Check the link:ha_dbs.adoc[3scale HA external databases documentation section] for this.
 19 | * Notice that the operator does not currently offer an option to deploy `system-memcached` or `system-sphinx` externally. Neither of these components have a critical impact on 3scale in the event they are down, so the operator does not currently support a high-available deployment option for them.
 20 | 
 21 | === Scaling number of replicas
 22 | 
 23 | The number of replicas of each component needs to be at least 2 to support Pod failures. For installations with nodes running in different availability zones, the minimum replica count should match the number of availability zones available. This configuration used in combination with node-based Pod anti-affinity rules, described in the next section, will ensure that at least one Pod  will run in each availability zone. 3scale components that support more than one replicas are:
 24 | 
 25 | - apicast-production
 26 | - apicast-staging
 27 | - backend-listener
 28 | - backend-worker
 29 | - system-app
 30 | - system-sidekiq
 31 | - zync
 32 | - zync-que
 33 | 
 34 | Depending on your cluster size, throughput and latency requirements you might want to deploy more replicas, at least for the components that are called on each request to the APIs configured in 3scale, that is, `apicast-production`, `backend-listener`, and `backend-worker`.
 35 | 
 36 | Replicas can be configured using the `replicas` attribute in the APIManager spec.
 37 | 
 38 | === Pod Affinity
 39 | 
 40 | Enabling Pod affinities in the operator for every component, ensures that Pod replicas from each DeploymentConfig are distributed across different nodes of the cluster, and evenly balanced across different availability zones. You can read more about affinities in the link:https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity[Kubernetes docs].
 41 | 
 42 | To enable affinities you need to follow link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#setting-custom-affinity-and-tolerations[these examples].
 43 | 
 44 | The following affinity block can be added for example to `apicastProductionSpec` (but can be added to any non-database DeploymentConfig), adds a *soft podAntiAffinity* configuration using `preferredDuringSchedulingIgnoredDuringExecution` (so the scheduler will try to run this set of `apicast-production` Pods in different hosts from different availability zones, but if it is not possible, then allow to run elsewhere):
 45 | ```yaml
 46 |       affinity:
 47 |         podAntiAffinity:
 48 |           preferredDuringSchedulingIgnoredDuringExecution:
 49 |             - weight: 100
 50 |               podAffinityTerm:
 51 |                 labelSelector:
 52 |                   matchLabels:
 53 |                     deploymentConfig: apicast-production
 54 |                 topologyKey: kubernetes.io/hostname
 55 |             - weight: 99
 56 |               podAffinityTerm:
 57 |                 labelSelector:
 58 |                   matchLabels:
 59 |                     deploymentConfig: apicast-production
 60 |                 topologyKey: topology.kubernetes.io/zone
 61 | ```
 62 | In the following example, unlike in the previous one, a *hard podAntiAffinity* configuration is set using `requiredDuringSchedulingIgnoredDuringExecution` (so conditions must be met for a Pod to be scheduled onto a node, otherwise Pod won’t be scheduled, which can be risky on certain situations like being on a cluster with low free resources, and new Pods cannot be scheduled):
 63 | ```yaml
 64 |       affinity:
 65 |         podAntiAffinity:
 66 |           requiredDuringSchedulingIgnoredDuringExecution:
 67 |             - weight: 100
 68 |               podAffinityTerm:
 69 |                 labelSelector:
 70 |                   matchLabels:
 71 |                     deploymentConfig: apicast-production
 72 |                 topologyKey: kubernetes.io/hostname
 73 |             - weight: 99
 74 |               podAffinityTerm:
 75 |                 labelSelector:
 76 |                   matchLabels:
 77 |                     deploymentConfig: apicast-production
 78 |                 topologyKey: topology.kubernetes.io/zone
 79 | ```
 80 | 
 81 | === Pod Disruption Budgets
 82 | 
 83 | Enabling Pod disruption budgets (PDBs) ensures that a minimum number of Pods will be available for each component. You can read more about PDBs in the link:https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets[Kubernetes docs].
 84 | 
 85 | Pod disruption budgets configuration is documented in link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#enabling-pod-disruption-budgets[the user guide].
 86 | 
 87 | '''
 88 | 
 89 | The information provided above enables a basic highly-available installation, but you might need to take some extra steps depending on the environment where 3scale is running. Next, there are some of the most common scenarios described.
 90 | 
 91 | == Single cluster in single availability zone
 92 | 
 93 | With this setup, 3scale will continue working if a node fails, but stop working in case the availability zone fails.
 94 | 
 95 | Pod affinities do apply (but only using one rule with `kubernetes.io/hostname`), because there is a single availability zone.
 96 | 
 97 | Pod disruption budgets do apply.
 98 | 
 99 | == Single cluster in multiple availability zones
100 | 
101 | The same setup as before, but now nodes are distributed over hosts from different availability zones (physically separated locations or data centers within a cloud region that are tolerant to local failures).
102 | 
103 | A minimum of 3 availability zones is recommended to have high availability.
104 | 
105 | With this setup, 3scale will continue working if a node or even an availability zone fails at the same time.
106 | 
107 | Pod affinities do apply (with both rules `kubernetes.io/hostname` and `topology.kubernetes.io/zone`).
108 | 
109 | Pod disruption budgets do apply.
110 | 
111 | == Multiple clusters in multiple availability zones
112 | 
113 | There are several options to install 3scale across several OpenShift clusters and availability zones.
114 | 
115 | In multiple cluster installation options, clusters work in an *active/passive* configuration, with the *failover* procedure involving a few *manual* steps. Note that there will be service disruption while a human operator performs the required steps to bring the *passive* cluster into *active* mode in case of failure.
116 | 
117 | This documentation focuses on deployment using Amazon Web Services (AWS), but the same patterns and configuration options should apply to other public cloud vendors as long as the provider’s managed database services offer the required features (support for multiple availability zones, multiple regions …​).
118 | 
119 | === Common configurations for multiple clusters installations
120 | 
121 | The following configuration items need to be used in any 3scale installations that involve using several OpenShift clusters:
122 | 
123 | * Use Pod affinities, with both `kubernetes.io/hostname` and `topology.kubernetes.io/zone` rules, in the APIManager custom resource.
124 | * Use Pod disruption budgets in the APIManager custom resource.
125 | * A 3scale installation over multiple clusters *must use the same shared `wildcardDomain`* attribute spec in the APIManager custom resource. The use of a different domain for each cluster is not allowed when in this installation mode as the information stored in the database would be conflicting.
126 | * The secrets containing credentials like tokens, passwords, etc, have to be *manually* deployed in all clusters with the same values. By default, the 3scale operator creates them with some secure random values on every cluster. However, in this case, you need to have the same credentials in both clusters. The list of secrets and how to configure them can be found in the link:https://github.com/3scale/3scale-operator/blob/master/doc/apimanager-reference.md#apimanager-secrets[3scale Operator docs]. This is the list of secrets that should be mirrored in both clusters:
127 | - backend-internal-api
128 | - system-app
129 | - system-events-hook
130 | - system-master-apicast
131 | - system-seed
132 | * The secrets containing database connections strings (`backend-redis`, `system-database`, `system-redis`, `zync`) have to be *manually* deployed like explained on link:ha_dbs.adoc[external databases]:
133 | - If databases are shared among clusters they must use the same values on all clusters
134 | - On the other hand, if each cluster have their own databases, they must use different values on each cluster
135 | 
136 | === Active-Passive clusters on the same region with shared databases
137 | 
138 | image::../images/3scale-ha-active-passive-same-region-shared-databases.png[Active Passive same region shared databases.png]
139 | 
140 | This setup consists of having 2 clusters (or more) in the *same region* and deploying 3scale in *active-passive* mode. One of the clusters will be the *active* (receiving traffic), whereas the other/s will be in standby mode without receiving traffic (*passive*), but prepared to assume the *active* role in case there is a failure in the *active* cluster.
141 | 
142 | In this installation option, given that only a single region is used, databases will be shared among all clusters.
143 | 
144 | ==== Prerequisites and installation shared databases
145 | 
146 | . Create 2 (or more) OpenShift clusters in the *same region* using different availability zones. A minimum of 3 zones is recommended.
147 | . Create all required AWS ElastiCache instances with Multi-AZ enabled:
148 | .. One AWS EC for *Backend* redis database
149 | .. One AWS EC for *System* redis database
150 | . Create all required AWS RDS instances with Multi-AZ enabled:
151 | .. One AWS RDS for the *System* database
152 | .. One AWS RDS for *Zync* database (since 3scale version v2.10)
153 | . Configure a AWS S3 bucket for the *System* assets
154 | . Create a custom domain in AWS Route53 (or your DNS provider) and point it to the OpenShift Router of the *active* cluster (need to coincide with the `wildcardDomain` attribute from APIManager custom resource)
155 | . Install 3scale in the *passive* cluster. The APIManager custom resource should be identical to the one used in the step above. After all the Pods are running, change the APIManager to deploy 0 replicas for all the Pods of apicast, backend, system, and zync. You want to have 0 replicas to avoid consuming jobs from the *active* database, etc. You can not tell the operator to deploy 0 replicas of each directly because the deployment will fail due to some Pod dependencies that cannot be met (some Pods check that others are running). That is why, as a workaround, first you deploy normally, and then, with 0 replicas. This is how it is specified in the APIManager spec:
156 | ```yaml
157 | spec:
158 |   apicast:
159 |     stagingSpec:
160 |       replicas: 0
161 |     productionSpec:
162 |       replicas: 0
163 |   backend:
164 |     listenerSpec:
165 |       replicas: 0
166 |     workerSpec:
167 |       replicas: 0
168 |     cronSpec:
169 |       replicas: 0
170 |   zync:
171 |     appSpec:
172 |       replicas: 0
173 |     queSpec:
174 |       replicas: 0
175 |   system:
176 |     appSpec:
177 |       replicas: 0
178 |     sidekiqSpec:
179 |       replicas: 0
180 | ```
181 | 
182 | ==== Manual Failover shared databases [[manual-failover-shared-databases]]
183 | 
184 | . In the *active* cluster, scale down the replicas of the *Backend*, *System*, *Zync* and *Apicast* Pods to 0, it will become the new *passive* cluster, so you ensure that the new *passive* cluster will not consume jobs from *active* databases (*downtime starts here*)
185 | . In the *passive* cluster, edit the APIManager to scale up the replicas of the *Backend*, *System*, *Zync* and *Apicast* Pods that were set to 0, so it will become the *active* cluster
186 | . In the newly *active* cluster (ex *passive*), recreate the OpenShift Routes created by *Zync*. To do that, run `bundle exec rake zync:resync:domains` from the `system-master` container of the `system-app` Pod. In 3scale version v2.9, this command fails sometimes, so you can retry until all the Routes are generated
187 | . Point the custom domain created in AWS Route53 to the OpenShift router of the new *active* cluster
188 | . From this moment, the old *passive* cluster will start to receive traffic, and it becomes the new *active* one
189 | 
190 | === Active-Passive clusters on different regions with synced databases
191 | 
192 | image::../images/3scale-ha-active-passive-different-region-synced-databases.png[Active Passive different region synced databases.png]
193 | 
194 | This setup consists of having two clusters (or more) in *different regions* and deploying 3scale in *active-passive* mode. One of the clusters will be the *active* one (receiving traffic), whereas the other will be in standby mode without receiving traffic (*passive*) but prepared to assume the *active* role in case there is a failure in the *active* cluster.
195 | 
196 | In this setup, to ensure good database access latency, each cluster will have its own database instances. The databases from the *active* 3scale installation will be replicated to the read-replica databases of the 3scale *passive* installations so the data is available and up to date in all regions for a possible failover.
197 | 
198 | ==== Prerequisites and installation synced databases
199 | 
200 | . Create 2 (or more) OpenShift clusters in *different regions* using different availability zones. A minimum of 3 zones is recommended.
201 | . Create all required AWS ElastiCache instances with MultiAZ enabled *on every region*:
202 | .. Two AWS EC for *Backend* redis database (one per region)
203 | .. Two AWS EC for *System* redis database (one per region)
204 | .. But on that case use the link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastore.html[cross-region replication with Global Datastore feature enabled], so the databases from *passive* regions will be read-replicas from the master databases at the *active* region
205 | . Create all required AWS RDS instances with Multi-AZ enabled *on every region*:
206 | .. Two AWS RDS for the *System* database
207 | .. Two AWS RDS for *Zync* database (since 3scale version v2.10)
208 | .. But on that case use link:https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.XRgn.html[cross-region replication], so the databases from *passive* regions will be read-replicas from the master databases at the *active* region
209 | . Configure a AWS S3 bucket for the *System* assets *on every region*, but on that case use link:https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html#crr-scenario[cross-region replication]
210 | . Like on previous scenario, create a custom domain in AWS Route53 (or your DNS provider) and point it to the OpenShift Router of the *active* cluster (need to coincide with the `wildcardDomain` attribute from APIManager custom resource)
211 | . Like in the previous scenario, install 3scale in the *passive* cluster. The APIManager custom resource should be identical to the one used in the step above. After all the Pods are running, change the APIManager to deploy 0 replicas for all the Pods of apicast, backend, system, and zync. You want to have 0 replicas to avoid consuming jobs from the *active* database, etc. You can not tell the operator to deploy 0 replicas of each directly because the deployment will fail due to some Pod dependencies that cannot be met (some Pods check that others are running). That is why, as a workaround, first you deploy normally, and then, with 0 replicas
212 | 
213 | ==== Manual Failover synced databases
214 | 
215 | . Execute step 1, 2 and 3 from <<manual-failover-shared-databases>>
216 | . Every cluster has its own independent databases (read-replicas from the master at the *active* region), so it is required to *manually* execute a failover on every database to select the new master on the *passive* region (which will become now the *active* region)
217 | . Manual failovers of the databases to execute are (order does not matter):
218 | .. AWS RDS: *System* and *Zync*
219 | .. AWS ElastiCaches: *Backend* and *System*
220 | . Execute step 4 and 5 from <<manual-failover-shared-databases>>
221 | 
222 | === Active-Active clusters (not supported)
223 | 
224 | Having an application working on *active/active* configuration is always difficult for any application:
225 | 
226 | - *Soft Databases limitation*: it is complex but can be achieved with some trade offs, but it is mainly responsibility from the administrator (not 3scale responsibility)
227 | - *Hard Application limitation*: this one is a hard limitation imposed by how 3scale manages OpenShift Routes and cannot be achieved
228 | 
229 | ==== Soft Databases limitation (out of 3scale scope)
230 | 
231 | One of the most difficult parts for any application deployed on *active*/*active* (not only 3scale), is having databases on *active*/*active* mode, and in case of 3scale there are many different databases involved which would require different *active*/*active* implementations for each of them:
232 | 
233 | - System-mysql (or system-postgres or system-oracle, you can choose)
234 | - System-redis
235 | - Backend-redis
236 | - Zync-database (postgresql)
237 | - system-sphinx (doesn't need HA)
238 | - system-memcached (doesn't need HA)
239 | 
240 | So, from 3scale point of view, an administrator need to ensure that database connection strings configured in 3scale APIManager custom resource on any OCP cluster will be always a master instance (allowing write operations).
241 | 
242 | ____
243 | *NOTE*
244 | 
245 | *This database limitation is a soft limitation because although being very complex (implementing active/active for mysql, postgresql or redis is not trivial), it can be achieved.*
246 | ____
247 | 
248 | ==== Hard Application limitation: OpenShift Routes
249 | 
250 | The main reason why *active*/*active* can not be achieved, even with *active*/*active* databases, is *System* dev/admin portals and *Apicasts* OpenShift Routes. They are  managed by the *Zync* component and they need to be the exactly the same on all OCP clusters.
251 | 
252 | This is an example flow since a client wants a *System* new dev portal and incoming traffic can eventually reach system-app pods thanks to a newly created OpenShift Route:
253 | 
254 | . System-app receives the request from UI/API to create a dev/admin portal or apicast OpenShift  Route
255 | . System-app enqueues the background job to create those OpenShift Routes into system-redis
256 | . System-sidekiq takes the job from system-redis, process it and contacts with zync API (zync deployment)
257 | . Zync API (zync deployment) creates a background job into zync-database (postgres)
258 | . Zync-que takes the route-creation-job from zync-database (postgres) and create the final OpenShift  Routes into the cluster that proccessed that job
259 | 
260 | When having N clusters, imagine Cluster-A and Cluster-B, any of them can receive traffic managed by either Cluster-A-system-app or Cluster-B-system-app, so they will enqueue the job into its own database, and would follow the whole procedure:
261 | 
262 | ```
263 | System-app -> system-redis -> system-sidekiq -> zync (api) -> zync-database -> zync-que -> OpenShift Routes creation
264 | ```
265 | 
266 | And finally, the OpenShift Routes will be created *only* into the cluster that processed the route-creation-job (not on both OCP clusters), so incoming traffic managed by OpenShift Router will work only on one cluster (the one with the needed OpenShift Routes), so not having *active*/*active*.
267 | 
268 | ____
269 | *NOTE*
270 | 
271 | *This application limitation is a hard limitation because the first worker that fetches the job, will create the OpenShift Route on the cluster it's running. So, each cluster will have a subset of OpenShift Routes.*
272 | ____
273 | 
274 | === Active independent clusters
275 | 
276 | A few customers wanting *active*/*active*, as it can not be achieved within the same 3scale instance due to the hard limitation imposed by the OpenShift Routes created by *Zync*, what they have done is to deploy 3scale independent instances on different OCP clusters, where every 3scale instance has its own databases, which are not shared and not synced between them.
277 | 
278 | Here you can find some considerations:
279 | 
280 | - Each 3scale independent instance over multiple clusters will need its own High Availability configuration with their own databases, which are not shared between 3scale instances
281 | - All 3scale independent instances over multiple clusters *will use the same shared `wildcardDomain`*
282 | - You need to manage the 3scale configuration with a shared CI/CD, to ensure that any independent 3scale instance will have exactly the same configuration (so no configuration drifts between 3scale instances)
283 | - You need a kind of Smart Global Load Balancer (maybe done at DNS level) above all OCP clusters, to ensure that the same % of traffic goes to every independent 3scale instance on its OCP cluster (maybe using round-robin policy)
284 | - The rate limit will be independent for every 3scale instance, so if for example you have 2 3scale independent instances, and want a global rate limit of 100 requests/minute, you will need to configure a rate limit of 50 requests/minute on each 3scale instance, and in case of having round-robin DNS policy on the Global Load Balancer, *theoretically* you will achieve the supposed global rate limit of 100 requests/minute
285 | 
286 | With this approach of 3scale independent instances, there might arise possible issues that are unknown at the time of writing this documentation.


--------------------------------------------------------------------------------
/docs/ha_dbs.adoc:
--------------------------------------------------------------------------------
 1 | = 3scale HA external databases
 2 | 
 3 | *Using external databases is the recommended 3scale setup* (avoding the internal databases created by the 3scale Operator by default).
 4 | 
 5 | You can choose between different options when setting up the databases in high-avalability. One of the first questions that you need to ask yourself is whether you want to deploy them on-premises or if you would like to use a managed service provided by a cloud provider:
 6 | 
 7 | * Some users might be required by law to have everything on-premises
 8 | * For users who are not, and are already using services provided by for example Amazon Web Services (AWS), using their managed services also for the 3scale databases might be the most convenient option (making easier the database operation, management, backup, restore...)
 9 | 
10 | In order to set up HA for the 3scale databases, you need to tell the operator that internal databases won't be used (so won't be created in OpenShift), which applies to:
11 | 
12 | * *Backend* and the *System* Redis databases
13 | * Relational database used by *System*
14 | * And starting from 3scale v2.10, also *Zync* database
15 | 
16 | To do so, follow link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-user-guide.md#external-databases-installation[these instructions], these are the required APIManager spec fields to tell the operator to work with external databases:
17 | ```yaml
18 | apiVersion: apps.3scale.net/v1alpha1
19 | kind: APIManager
20 | metadata:
21 |   name: example-apimanager
22 | spec:
23 |   highAvailability:
24 |     enabled: true  # backend redis, system redis, system database external databases
25 |     externalZyncDatabaseEnabled: true   # zync external database
26 | ```
27 | Then you need to configure *manually* their associated URLs for the database connections strings on the following secrets:
28 | 
29 | * backend-redis
30 | * system-database
31 | * system-redis
32 | * zync
33 | 
34 | == Backend Redis and System Redis databases
35 | 
36 | Redis offers replication. You need to set up multiple instances of Redis. One of them will be the leader and the others the replicas. Writes go to the leader and are propagated to the replicas. When the leader fails, a replica becomes the new leader. When the old leader comes back online it will become a replica of the new leader.
37 | 
38 | In Redis, the replication mechanism is asynchronous. This means that when something is written to the leader, Redis answers to the client without waiting for the writes to be effective in the replicas. This is fast, but when there is a failover, you will lose the data that has not been replicated yet.
39 | 
40 | In the case of Backend, that trade-off makes sense. In the event of a failover, *Backend* might lose some reports. This means that some rate-limits could not be accurate and you could let more calls than configured for a brief period of time. Also, some statistics might not be exact. This is a trade-off that makes sense in the case of *Backend*. It needs to be fast because it is called on every API call (unless there's some auth caching policy enabled in APIcast), and failovers should be pretty rare, plus the amount of data not in the replicas should be pretty low. Aside from that, the *Backend* database contains some information synchronized from *System* that could also be lost, but that can always be recovered by executing a rake task from System.
41 | 
42 | Here are some options available to use as the *Backend* and *System* Redis:
43 | 
44 | * link:https://redis.io/topics/sentinel[Redis with sentinels]
45 | - This is the option you need to choose if you want to deploy everything on-premises
46 | * link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/WhatIs.html[AWS ElastiCache]
47 | - Check the link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.Redis.Groups.html[docs]
48 | - It is link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html[Multi-AZ]
49 | - It also supports link:https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastore.html[Multi-region using the "Global Datastore" feature]. Keep in mind that in the multi-region case failovers are not automatic
50 | * link:https://redislabs.com/redis-enterprise-software/overview/[Redis Enterprise]
51 | - Check the link:https://redislabs.com/redis-enterprise/technology/highly-available-redis/[docs]
52 | - It's multi-AZ, and can also be configured in link:https://redislabs.com/redis-enterprise/technology/active-passive-geo-distribution/[Multiple regions]
53 | 
54 | Here are some considerations:
55 | 
56 | * When choosing and configuring any of those or other options, keep in mind that *Backend* does not support the link:https://redis.io/topics/cluster-tutorial[Redis Cluster mode] (which has nothing to do with usual HA with master and slaves)
57 | * In the case of *Backend*, you can deploy its databases (storage and queues) in a single Redis instance using two different database indexes(typically `db0` and `db1`), but you can also use two different redis instances if the option that you choose does not support this, or if you prefer to have both usages (storage and queues) on separate instances
58 | * Take into account that using the same Redis instance both for *Backend* and *System* is not supported
59 | 
60 | == System and Zync databases
61 | 
62 | Here are some options for the relational database used by *System* and *Zync*. Keep in mind that they need to be two separate databases:
63 | 
64 | * link:https://www.crunchydata.com/[Crunchy]
65 | - Read more about it link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/architecture/high-availability/multi-cluster-kubernetes/[here]
66 | and link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/advanced/multi-zone-design-considerations/[here]
67 | - It can be deployed using a Kubernetes operator
68 | - By default, the replication is asynchronous, but it can be link:https://access.crunchydata.com/documentation/postgres-operator/4.6.1/architecture/high-availability/[configured to be synchronous]
69 | - It can be configured to be multi-cluster, but in that case, the failover is not automatic. Also, the replication in that scenario is synchronous and uses S3 as an intermediate storage
70 | * link:https://aws.amazon.com/es/rds/ha/[AWS RDS]
71 | - The replication is synchronous
72 | - It supports link:https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.XRgn.html[Multi-AZ and Multi-region replicas]. Keep in mind that in the multi-region case failovers are not automatic


--------------------------------------------------------------------------------
/docs/observability.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = Observability
 5 | 
 6 | toc::[]
 7 | 
 8 | == Application metrics
 9 | 
10 | You can link:https://github.com/3scale/3scale-operator/blob/master/doc/operator-monitoring-resources.md#enabling-3scale-monitoring[enable 3scale monitoring] configuring the following APIManager spec fields:
11 | 
12 | ```yaml
13 | apiVersion: apps.3scale.net/v1alpha1
14 | kind: APIManager
15 | metadata:
16 |   name: example-apimanager
17 | spec:
18 |   monitoring:
19 |     enabled: true  # Mandatory, deploys PodMonitors, GrafanaDashboards and PrometheusRules
20 |     enablePrometheusRules: false # Optional, not deploy PrometheusRules
21 | ```
22 | 
23 | link:https://github.com/3scale/3scale-operator[3scale Operator] will:
24 | 
25 | * Create a *PodMonitor* custom resource for every DeployConfig, so prometheus-operator knows how to scrape metrics from every pod
26 | * Create a *GrafanaDashboard* custom resource for every 3scale component:
27 | - There are dasboards for every 3scale component: Backend, System, Zync, Apicast, Apicast Services
28 | - In addition, there are a couple of generic dashboards with kubernetes resources usage by pod and namespace where the 3scale instance is deployed
29 | * Create a *PrometheusRule* custom resource for every 3scale component:
30 | - You can check default alerts at link:https://github.com/3scale/3scale-operator/tree/master/doc/prometheusrules[3scale Operator repository]
31 | - As alerts management can be very customizable, to not force anyone using them, they can be disabled from operator management by adding the optional field `enablePrometheusRules: false`
32 | - Having access to default PrometheusRules custom resources, you can deploy manually the ones you prefer by link:https://github.com/3scale/3scale-operator/tree/master/doc/prometheusrules#tune-the-prometheus-rules-based-on-your-infraestructure[tunning them to your own needs] (updating severity, time duration, thresholds...)
33 | - Bear in mind that every default PrometheusRule has a linked SOP (Standard Operating Procedures) on an annotation, you can check current SOP alerts at link:../sops/alerts[sops/alerts] directory
34 | 
35 | == Databases metrics
36 | 
37 | * 3scale uses different databases (different redis, mysql, postgresql, memcached, sphinx), so if there is any issue on a database it might have a huge impact on 3scale performance and availability
38 | * For this reason, it is recommended to setup monitoring on 3scale databases, along with some prometheus alerts, and also grafana dashboards to easily check internal database metrics
39 | * You can use link:https://github.com/3scale-ops/prometheus-exporter-operator[Prometheus Exporter Operator] in order to monitor:
40 | - 3scale databases internal metrics (including alerts and grafana dashboards): redis, mysql, postgresql, memcached, sphinx
41 | - HTTP monitoring (latency, availability, TLS/SSL certificate expiration...) of any 3scale HTTP endpoint
42 | - And in case of databases deployed in AWS: monitor AWS Cloudwatch metrics from any AWS service (like AWS RDS or AWS ElastiCache... ). Some metrics like CPU or disk space metrics can not be extracted from redis/mysql exporter, so cloudwatch exporter is required


--------------------------------------------------------------------------------
/images/3scale-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-architecture.png


--------------------------------------------------------------------------------
/images/3scale-ha-active-passive-different-region-synced-databases.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-ha-active-passive-different-region-synced-databases.png


--------------------------------------------------------------------------------
/images/3scale-ha-active-passive-same-region-shared-databases.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-ha-active-passive-same-region-shared-databases.png


--------------------------------------------------------------------------------
/images/3scale-pods-by-type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/3scale/3scale-Operations/51e76a46eb2aae2de6f12df14dfeb559416c4c99/images/3scale-pods-by-type.png


--------------------------------------------------------------------------------
/sops/README.adoc:
--------------------------------------------------------------------------------
 1 | = SOPS (Standard Operating Procedures)
 2 | 
 3 | For each alert defined, there should be some associated documentation that explains what the alert means, what the likely cause is and how to respond to the alert. These SOPs are essential to enabling an SRE team to effectively respond to alerts without needing direct input from an Engineering team. Without these, an SRE team will be aware that there is an ongoing problem, but will not have the necessary knowledge to deal with the problem.
 4 | 
 5 | Common sections in a SOP may include:
 6 | 
 7 | * Assumptions - how to validate that the alert is actually highlighting a problem, and what prerequisites might be required
 8 | * Reference Articles - where to find additional information, or prior art for linked/similar issues
 9 | * Corrective Process - what steps to take to resolve the issue
10 | * Success Indicators - how to know when the issue is resolved
11 | * Other Notes - any other relevant information
12 | 


--------------------------------------------------------------------------------
/sops/alerts/apicast_apicast_latency.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleApicastLatencyHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the `apicast-staging`/`apicast-production` **p99** latency is greater than a given threshold
11 | * Latency metric includes the time the request spent in apicast plus the time it took the upstream to respond
12 | 
13 | == Troubleshooting
14 | 
15 | * Check if `apicast-staging`/`apicast-production` pods might not be able to deal with incoming traffic due to high CPU usage
16 | - You may need to consider scaling horizontally `apicast-staging`/`apicast-production` deployment
17 | - You may need to consider increasing `apicast-staging`/`apicast-production` deployment resources requests/limits
18 | * Check if upstream response is taking too much time
19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the `apicast-staging`/`apicast-production` **p99** latency is below the threshold
24 | 


--------------------------------------------------------------------------------
/sops/alerts/apicast_http_4xx_error_rate.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleApicastHttp4xxErrorRate
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the error rate of `apicast-staging`/`apicast-production` HTTP 4XX requests is greater than a given threshold
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if `apicast-staging`/`apicast-production` pods might be failing
15 | * Check `apicast-staging`/`apicast-production` pod logs to see if:
16 | - Auth is not valid
17 | - Service/Mapping rule is not being found (either it does not exists in `system-app` or `system-app` is having issues and not returning the required configuration )
18 | - Upstream is sending a 4XX status code
19 | - Used policy is configured to send 4XX status code
20 | 
21 | 
22 | == Verification
23 | 
24 | * Alert should disappear once the error rate of `apicast-staging`/`apicast-production` HTTP 4XX requests is below the threshold
25 | 


--------------------------------------------------------------------------------
/sops/alerts/apicast_request_time.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleApicastRequestTime
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if there is a huge number of `apicast-staging`/`apicast-production` requests being processed slow (below a given threshold)
11 | * Request time metric includes only the time the request spent in apicast
12 | 
13 | == Troubleshooting
14 | 
15 | * Check if `apicast-staging`/`apicast-production` pods might not be able to deal with incoming traffic due to high CPU usage
16 | - You may need to consider scaling horizontally `apicast-staging`/`apicast-production` deployment
17 | - You may need to consider increasing `apicast-staging`/`apicast-production` deployment resources requests/limits
18 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic
19 | 
20 | == Verification
21 | 
22 | * Alert should disappear once the number of `apicast-staging`/`apicast-production` slow requests are below the threshold
23 | 


--------------------------------------------------------------------------------
/sops/alerts/apicast_worker_restart.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleApicastWorkerRestart
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if one of the nginx-worker restarted for any reason
11 | 
12 | == Troubleshooting
13 | 
14 | * Check why the worker died in the APIcast logs; the usual reasons are:
15 | - Pod hits the memory limits and one of the worker processes was killed. Check why the limits were reached, and if it's needed, increase them
16 | - Node is out of memory, so try to kill containers to schedule to another node
17 | 
18 | == Verification
19 | 
20 | * Check that the worker died metric does not increase and memory is under the limits.
21 | 


--------------------------------------------------------------------------------
/sops/alerts/backend_listener_5xx_requests_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleBackendListener5XXRequestsHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `backend-listener` HTTP 5XX requests is greater than a given threshold
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if `backend-listener` pods might be failing
15 | * Check if `backend-listener` pods might not be able to deal with incoming traffic due to high CPU usage
16 | - You may need to consider scaling horizontally `backend-listener` deployment
17 | - You may need to consider increasing `backend-listener` deployment resources requests/limits
18 | * Check if backend's redis is having issues
19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the number of `backend-listener` HTTP 5XX requests is below the threshold
24 | 


--------------------------------------------------------------------------------
/sops/alerts/backend_worker_jobs_count_running_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleBackendWorkerJobsCountRunningHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `backend-worker` jobs is greater than a given threshold
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if `backend-worker` pods might be failing and the jobs are not being consumed at all from backend's redis
15 | * Check if `backend-worker` pods might not be consuming jobs at the required throughput:
16 | - You may need to consider scaling horizontally `backend-worker` deployment
17 | - You may need to consider increasing `backend-worker` deployment resources requests/limits
18 | * Check if the current traffic in `backend-listener` is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic. Whatever the reason, increased traffic in `backend-listener` generates more jobs for `backend-worker` to process from backend's redis
19 | * Check if backend's redis is having issues
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the number of `backend-worker` jobs is below the threshold
24 | 


--------------------------------------------------------------------------------
/sops/alerts/container_cpu_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleContainerCPUHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the cpu usage is being very high
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if CPU resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources
15 | * Check if maybe you need to scale horizontally the associated deployment with more pods (if possible) to distribute the load
16 | - Some deployments **cannot** be scaled horizontally: `backend-cron` and any db (`backend-redis`,`system-redis`, `system-mysql`, `system-memcache`, `system-sphinx`, `zync-database`)
17 | 
18 | == Verification
19 | 
20 | * Alert should disappear once CPU usage decreases below the threshold
21 | 


--------------------------------------------------------------------------------
/sops/alerts/container_cpu_throttling_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleContainerCPUThrottlingHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the cpu usage is being throttled (using more resources than configured)
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if CPU resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources
15 | 
16 | == Verification
17 | 
18 | * Alert should disappear once CPU usage is stable
19 | 
20 | 


--------------------------------------------------------------------------------
/sops/alerts/container_memory_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleContainerMemoryHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the memory usage is being very high
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if Memory resources requests/limits are properly set and appropriate to the historical usage data. If the real usage tends to surpass the requested threshold, you may need to increase the container resources
15 | * Check if maybe you need to scale horizontally the associated deployment with more pods (if possible) to distribute the load
16 | - Some deployments **cannot** be scaled horizontally: `backend-cron` and any db (`backend-redis`,`system-redis`, `system-mysql`, `system-memcache`, `system-sphinx`, `zync-database`)
17 | 
18 | == Verification
19 | 
20 | * Alert should disappear once Memory usage decreases below the threshold
21 | 


--------------------------------------------------------------------------------
/sops/alerts/container_waiting.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleContainerWaiting
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if a container has been in `waiting` state for longer than 1 hour
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if maybe container image registry is not reachable
15 | 
16 | == Verification
17 | 
18 | * Alert should disappear once container is up and running
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/sops/alerts/pod_crash_looping.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescalePodCrashLooping
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if a pod is on `CrashLoop` state because one of the containers crashes and is restarted indefinitely
11 | 
12 | == Troubleshooting
13 | 
14 | * Check pod logs/events to see the reason of the `CrashLoop`
15 | * Check if maybe container resources requests/limits are too low and OOMKiller is acting (container trying to allocate more memory than permitted)
16 | 
17 | == Verification
18 | 
19 | * Alert should disappear once the pod is up and running
20 | 


--------------------------------------------------------------------------------
/sops/alerts/pod_not_ready.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescalePodNotReady
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if a pod is **not** on `Ready` state
11 | 
12 | == Troubleshooting
13 | 
14 | * Execute a describe pod command to check the pod status 
15 | 
16 | == Verification
17 | 
18 | * Alert should disappear once pod passes the readiness probe
19 | 


--------------------------------------------------------------------------------
/sops/alerts/prometheus_job_down.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescalePrometheusJobDown
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert willl trigger when a prometheus Job is Down, because `ServiceMonitor`/`PodMonitor` might have issues to scrape the application `Service`/`Pods`
11 | * This alert is important because if you have alerts based on application metrics that cannot be obtained because its associated prometheus job is failing you will lose visibility of the application
12 | 
13 | == Troubleshooting
14 | 
15 | * Check if the scrapped application (or app metrics endpoint) might be failing and the scrape can not be done
16 | * Check if `ServiceMonitor`/`PodMonitor` might be misconfigured pointing to incorrect label, port, path...
17 | * Check if prometheus is having some problems, or if prometheus has not reload its config with possible new `ServiceMonitor`/`PodMonitor`
18 | 
19 | == Verification
20 | 
21 | * Alert should disappear once prometheus target starts scrapping all pods
22 | 


--------------------------------------------------------------------------------
/sops/alerts/replication_controller_replicas_mismatch.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleReplicationControllerReplicasMismatch
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if a `ReplicationController` does not have the desired number of pods running
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if the new pods created by the replicaset might be failing
15 | * Check if there might be an issue with the scheduling due to lack of nodes resources or tolerations
16 | 
17 | == Verification
18 | 
19 | * Alert should disappear once all pods are running per `ReplicationController` (running pods = desired pods)
20 | 


--------------------------------------------------------------------------------
/sops/alerts/system_app_5xx_requests_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleSystemApp5XXRequestsHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `system-app` HTTP 5XX requests is greater than a given threshold
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if `system-app` pods might be failing
15 | * Check if `system-app` pods might not be able to deal with incoming traffic due to high CPU usage
16 | - You may need to consider scaling horizontally `system-app` deployment
17 | - You may need to consider increasing `system-app` deployment resources requests/limits
18 | * Check if system's mysql database (or even system's redis) is having issues
19 | * Check if the current traffic is normal based on the historical data. An unexpected change in the traffic pattern can be legit, and will require scaling up, but it also can be due to an abnormal or malicious traffic
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the number of `system-app` HTTP 5XX requests is below the threshold
24 | 


--------------------------------------------------------------------------------
/sops/alerts/zync_5xx_requests_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleZync5XXRequestsHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `zync` HTTP 5XX requests is greater than a given threshold
11 | 
12 | == Troubleshooting
13 | 
14 | * Check if `zync` pods might be failing
15 | * Check if `zync` pods might not be able to deal with incoming traffic due to high CPU usage
16 | - You may need to consider scaling horizontally `zync` deployment
17 | - You may need to consider increasing `zync` deployment resources requests/limits
18 | * Check if zync's postgres database is having issues
19 | 
20 | == Verification
21 | 
22 | * Alert should disappear once the number of `zync` HTTP 5XX requests is below the threshold
23 | 


--------------------------------------------------------------------------------
/sops/alerts/zync_que_failed_job_count_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleZyncQueFailedJobCountHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `zync-que` **failed** jobs is greater than a given threshold
11 | * **Failed** jobs are the ones that failed at least once, did not run out of attempts to retry and therefore are scheduled for retry any time soon
12 | 
13 | == Troubleshooting
14 | 
15 | * Check the logs at `zync-que` pods to see the reason of the failures
16 | 
17 | == Verification
18 | 
19 | * Alert should disappear once the number of `zync-que` **failed** jobs is below the threshold
20 | 


--------------------------------------------------------------------------------
/sops/alerts/zync_que_ready_job_count_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleZyncQueReadyJobCountHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `zync-que` **ready** jobs is greater than a given threshold
11 | * **Ready** jobs are the ones that are enqueued and ready to be executed ASAP (never failed, nor got expired)
12 | 
13 | == Troubleshooting
14 | 
15 | * Check if `zync-que` pods might be failing and the jobs are not being consumed at all from zync's postgres
16 | * Check if `zync-que` pods might not be consuming jobs at the required throughput:
17 | - You may need to consider scaling horizontally `zync-que` deployment
18 | - You may need to consider increasing `zync-que` deployment resources requests/limits
19 | * Check if zync's postgres is having issues
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the number of `zync-que` **ready** jobs is below the threshold
24 | 


--------------------------------------------------------------------------------
/sops/alerts/zync_que_scheduled_job_count_high.adoc:
--------------------------------------------------------------------------------
 1 | :toc:
 2 | :toc-placement!:
 3 | 
 4 | = ThreescaleZyncQueScheduledJobCountHigh
 5 | 
 6 | toc::[]
 7 | 
 8 | == Description
 9 | 
10 | * This alert will trigger if the number of `zync-que` **scheduled** jobs is greater than a given threshold
11 | * **Scheduled** jobs are the ones that are enqueued to be executed some time in the future, but not now (never failed, nor got expired)
12 | 
13 | == Troubleshooting
14 | 
15 | * Check if `zync-que` pods might be failing and the jobs are not being consumed at all from zync's postgres
16 | * Check if `zync-que` pods might not be consuming jobs at the required throughput:
17 | - You may need to consider scaling horizontally `zync-que` deployment
18 | - You may need to consider increasing `zync-que` deployment resources requests/limits
19 | * Check if zync's postgres is having issues
20 | 
21 | == Verification
22 | 
23 | * Alert should disappear once the number of `zync-que` **scheduled** jobs is below the threshold
24 | 


--------------------------------------------------------------------------------