├── content ├── Services │ ├── RHOBS │ │ ├── use-cases │ │ │ ├── README.md │ │ │ ├── observability.md │ │ │ └── telemetry.md │ │ ├── consumption-billing.md │ │ ├── analytics.md │ │ ├── README.md │ │ ├── configuring-clients-for-observatorium.md │ │ └── rules-and-alerting.md │ └── README.md ├── Proposals │ ├── README.md │ ├── Done │ │ ├── README.md │ │ └── 202106-proposals-process.md │ ├── Accepted │ │ ├── README.md │ │ ├── 202205-oo.md │ │ └── 202106-handbook.md │ ├── Rejected │ │ └── README.md │ └── process.md ├── Projects │ ├── Go │ │ ├── README.md │ │ ├── prom-client-golang.md │ │ └── grpc-middleware.md │ ├── README.md │ ├── Observability │ │ ├── README.md │ │ ├── prometheus.md │ │ ├── prometheusOp.md │ │ ├── observatorium.md │ │ ├── kubestatemetrics.md │ │ ├── thanos.md │ │ └── kube-rbac-proxy.md │ └── prometheus-api-client-python.md ├── assets │ ├── obs.png │ ├── rhobs.png │ ├── slo-def.png │ ├── telemeter.png │ ├── on-call2021.png │ ├── proposal-how.png │ ├── on-call2021-2.png │ ├── proposal-where.png │ ├── screen-preview.png │ ├── handbook-process.png │ ├── rhobs-logo-white.png │ ├── telemeter-analytics.png │ └── rhobs-logo-transparent.png ├── Teams │ ├── README.md │ ├── observability-platform │ │ ├── jira.png │ │ ├── README.md │ │ └── sre.md │ └── onboarding │ │ └── README.md ├── Products │ ├── README.md │ ├── OpenshiftMonitoring │ │ ├── README.md │ │ ├── dashboards.md │ │ ├── alerting.md │ │ ├── instrumentation.md │ │ ├── faq.md │ │ ├── collecting_metrics.md │ │ └── telemetry.md │ └── Observability-Operator │ │ └── README.md └── contributing.md ├── .mdox.validator.yaml ├── web ├── assets │ ├── icons │ │ └── logo.png │ ├── favicons │ │ ├── favicon.ico │ │ ├── tile70x70.png │ │ ├── pwa-192x192.png │ │ ├── pwa-512x512.png │ │ ├── tile150x150.png │ │ ├── tile310x150.png │ │ ├── tile310x310.png │ │ ├── favicon-16x16.png │ │ ├── favicon-32x32.png │ │ ├── apple-touch-icon-180x180.png │ │ ├── _head.html │ │ └── browserconfig.xml │ └── scss │ │ └── _variables_project.scss ├── layouts │ └── 404.html ├── package.json └── config.toml ├── .gitmodules ├── .bingo ├── go.mod ├── mdox.mod ├── bingo.mod ├── .gitignore ├── hugo.mod ├── variables.env ├── README.md └── Variables.mk ├── .gitignore ├── netlify.toml ├── .github └── workflows │ └── docs.yaml ├── README.md ├── Makefile ├── .mdox.yaml └── LICENSE /content/Services/RHOBS/use-cases/README.md: -------------------------------------------------------------------------------- 1 | # Use Cases 2 | -------------------------------------------------------------------------------- /content/Proposals/README.md: -------------------------------------------------------------------------------- 1 | # Proposals 2 | 3 | > In progress. 4 | -------------------------------------------------------------------------------- /content/Projects/Go/README.md: -------------------------------------------------------------------------------- 1 | # Go 2 | 3 | Projects our group is involved in. 4 | -------------------------------------------------------------------------------- /content/Projects/README.md: -------------------------------------------------------------------------------- 1 | # Projects 2 | 3 | Projects our group is involved in. 4 | -------------------------------------------------------------------------------- /.mdox.validator.yaml: -------------------------------------------------------------------------------- 1 | version: 1 2 | 3 | validators: 4 | - type: ignore 5 | regex: https://.* -------------------------------------------------------------------------------- /content/Projects/Go/prom-client-golang.md: -------------------------------------------------------------------------------- 1 | # prometheus/client_golang 2 | 3 | > In progress. 4 | -------------------------------------------------------------------------------- /content/assets/obs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/obs.png -------------------------------------------------------------------------------- /content/assets/rhobs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/rhobs.png -------------------------------------------------------------------------------- /content/assets/slo-def.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/slo-def.png -------------------------------------------------------------------------------- /web/assets/icons/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/icons/logo.png -------------------------------------------------------------------------------- /content/Projects/Observability/README.md: -------------------------------------------------------------------------------- 1 | # Observability 2 | 3 | Projects our group is involved in. 4 | -------------------------------------------------------------------------------- /content/assets/telemeter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/telemeter.png -------------------------------------------------------------------------------- /content/Teams/README.md: -------------------------------------------------------------------------------- 1 | # Team 2 | 3 | Place for documentation, processes for Red Hat Monitoring Group Teams 4 | -------------------------------------------------------------------------------- /content/assets/on-call2021.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/on-call2021.png -------------------------------------------------------------------------------- /content/assets/proposal-how.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/proposal-how.png -------------------------------------------------------------------------------- /web/assets/favicons/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/favicon.ico -------------------------------------------------------------------------------- /content/assets/on-call2021-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/on-call2021-2.png -------------------------------------------------------------------------------- /content/assets/proposal-where.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/proposal-where.png -------------------------------------------------------------------------------- /content/assets/screen-preview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/screen-preview.png -------------------------------------------------------------------------------- /web/assets/favicons/tile70x70.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/tile70x70.png -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "web/themes/docsy"] 2 | path = web/themes/docsy 3 | url = https://github.com/google/docsy.git 4 | -------------------------------------------------------------------------------- /content/Products/README.md: -------------------------------------------------------------------------------- 1 | # Products 2 | 3 | Place for documentation, processes for Red Hat Observability Products. 4 | -------------------------------------------------------------------------------- /content/assets/handbook-process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/handbook-process.png -------------------------------------------------------------------------------- /content/assets/rhobs-logo-white.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/rhobs-logo-white.png -------------------------------------------------------------------------------- /web/assets/favicons/pwa-192x192.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/pwa-192x192.png -------------------------------------------------------------------------------- /web/assets/favicons/pwa-512x512.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/pwa-512x512.png -------------------------------------------------------------------------------- /web/assets/favicons/tile150x150.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/tile150x150.png -------------------------------------------------------------------------------- /web/assets/favicons/tile310x150.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/tile310x150.png -------------------------------------------------------------------------------- /web/assets/favicons/tile310x310.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/tile310x310.png -------------------------------------------------------------------------------- /content/Services/README.md: -------------------------------------------------------------------------------- 1 | # Services 2 | 3 | Place for documentation, processes for Red Hat Observability Managed Services 4 | -------------------------------------------------------------------------------- /content/assets/telemeter-analytics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/telemeter-analytics.png -------------------------------------------------------------------------------- /web/assets/favicons/favicon-16x16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/favicon-16x16.png -------------------------------------------------------------------------------- /web/assets/favicons/favicon-32x32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/favicon-32x32.png -------------------------------------------------------------------------------- /content/assets/rhobs-logo-transparent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/assets/rhobs-logo-transparent.png -------------------------------------------------------------------------------- /content/Teams/observability-platform/jira.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/content/Teams/observability-platform/jira.png -------------------------------------------------------------------------------- /web/assets/favicons/apple-touch-icon-180x180.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rhobs/handbook/HEAD/web/assets/favicons/apple-touch-icon-180x180.png -------------------------------------------------------------------------------- /content/Projects/Go/grpc-middleware.md: -------------------------------------------------------------------------------- 1 | # go-grpc-middleware 2 | 3 | https://github.com/grpc-ecosystem/go-grpc-middleware 4 | 5 | > In progress. 6 | -------------------------------------------------------------------------------- /content/Proposals/Done/README.md: -------------------------------------------------------------------------------- 1 | # Done 2 | 3 | This is a list of implemented proposals. 4 | 5 | ## Internal Implemented Proposals 6 | 7 | * ... 8 | -------------------------------------------------------------------------------- /.bingo/go.mod: -------------------------------------------------------------------------------- 1 | module _ // Fake go.mod auto-created by 'bingo' for go -moddir compatibility with non-Go projects. Commit this file, together with other .mod files. -------------------------------------------------------------------------------- /.bingo/mdox.mod: -------------------------------------------------------------------------------- 1 | module _ // Auto generated by https://github.com/bwplotka/bingo. DO NOT EDIT 2 | 3 | go 1.16 4 | 5 | require github.com/bwplotka/mdox v0.9.0 6 | -------------------------------------------------------------------------------- /.bingo/bingo.mod: -------------------------------------------------------------------------------- 1 | module _ // Auto generated by https://github.com/bwplotka/bingo. DO NOT EDIT 2 | 3 | go 1.15 4 | 5 | require github.com/bwplotka/bingo v0.5.1 6 | -------------------------------------------------------------------------------- /.bingo/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Ignore everything 3 | * 4 | 5 | # But not these files: 6 | !.gitignore 7 | !*.mod 8 | !README.md 9 | !Variables.mk 10 | !variables.env 11 | 12 | *tmp.mod 13 | -------------------------------------------------------------------------------- /web/layouts/404.html: -------------------------------------------------------------------------------- 1 | {{ define "main"}} 2 |
3 |
4 |

Page not found

5 |
6 |
7 | {{ end }} -------------------------------------------------------------------------------- /content/Proposals/Accepted/README.md: -------------------------------------------------------------------------------- 1 | # Accepted 2 | 3 | This is a list of accepted proposals. This means proposal was accepted, but not yet implemented. 4 | 5 | ## Internal Accepted Proposals 6 | 7 | * ... 8 | -------------------------------------------------------------------------------- /content/Proposals/Rejected/README.md: -------------------------------------------------------------------------------- 1 | # Rejected 2 | 3 | This is a list of rejected proposals. 4 | 5 | > NOTE: This does not mean we can return to them and accept! 6 | 7 | ## Internal Rejected Proposals 8 | 9 | * ... 10 | -------------------------------------------------------------------------------- /content/Services/RHOBS/consumption-billing.md: -------------------------------------------------------------------------------- 1 | # Consumption Billing 2 | 3 | Consumption billing is done by Red Hat SubWatch. Data is periodically taken from RHOBS metrics Query APIs. If you are internal Red Hat Managed Service, contact SubWatch team to learn how you can use consumption billing. 4 | -------------------------------------------------------------------------------- /.bingo/hugo.mod: -------------------------------------------------------------------------------- 1 | module _ // Auto generated by https://github.com/bwplotka/bingo. DO NOT EDIT 2 | 3 | go 1.16 4 | 5 | // TODO(bwplotka): Only go 1.14 is on Netlify image, 6 | // so hugo can't be newer (it depends on io/fs from Go 1.16). 7 | 8 | require github.com/gohugoio/hugo v0.80.0 // CGO_ENABLED=1 -tags=extended 9 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/README.md: -------------------------------------------------------------------------------- 1 | # Openshift Cluster Monitoring 2 | 3 | Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring. 4 | 5 | The official [OpenShift documentation](https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview) contains all user-facing information such as usage and configuration. 6 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Binaries for programs and plugins 2 | *.exe 3 | *.exe~ 4 | *.dll 5 | *.so 6 | *.dylib 7 | 8 | # Test binary, built with `go test -c` 9 | *.test 10 | 11 | # Output of the go coverage tool, specifically when used with LiteIDE 12 | *.out 13 | .idea 14 | .bin 15 | .envrc 16 | web/node_modules 17 | web/resources/_gen 18 | web/public 19 | 20 | # Dependency directories (remove the comment below to include it) 21 | # vendor/ 22 | -------------------------------------------------------------------------------- /content/Teams/observability-platform/README.md: -------------------------------------------------------------------------------- 1 | # Observability Platform 2 | 3 | We are the team responsible (mainly) for: 4 | 5 | * [RHOBS](../../Services/RHOBS) including: 6 | * [Telemeter](../../Services/RHOBS/use-cases/telemetry.md) 7 | * [MST tenants](../../Services/RHOBS/use-cases/observability.md). 8 | * [Thanos](../../Projects/Observability/thanos.md). 9 | * [Observatorium](../../Projects/Observability/observatorium.md) 10 | -------------------------------------------------------------------------------- /web/assets/favicons/_head.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | -------------------------------------------------------------------------------- /web/assets/scss/_variables_project.scss: -------------------------------------------------------------------------------- 1 | $primary: #2B388F; 2 | 3 | .td-sidebar-nav { 4 | font-size: 90% !important; 5 | } 6 | 7 | .td-max-width-on-larger-screens, .td-content > pre, .td-content > .highlight, .td-content > .lead, .td-content > h1, .td-content > h2, .td-content > ul, .td-content > ol, .td-content > p, .td-content > blockquote, .td-content > dl dd, .td-content .footnotes, .td-content > .alert { 8 | max-width: 95% !important; 9 | } 10 | -------------------------------------------------------------------------------- /web/assets/favicons/browserconfig.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | transparent 10 | 11 | 12 | -------------------------------------------------------------------------------- /web/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "web", 3 | "version": "0.1.0", 4 | "description": "I don't know what I do, glad Prem Saraswat is there to help me out", 5 | "main": "index.js", 6 | "scripts": { 7 | "test": "echo \"Error: no test specified\" && exit 1" 8 | }, 9 | "author": "me", 10 | "license": "Apache-2.0", 11 | "dependencies": { 12 | "autoprefixer": "^10.2.5", 13 | "postcss": "^8.2.15", 14 | "postcss-cli": "^8.3.1" 15 | } 16 | } 17 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/dashboards.md: -------------------------------------------------------------------------------- 1 | # Dashboards 2 | 3 | ## Targeted audience 4 | 5 | This document is intended for OpenShift developers that want to add visualization dashboards for their operators and operands in the OCP administrator console. 6 | 7 | ## Getting started 8 | 9 | Please refer to the [document](https://docs.google.com/document/d/1UwHwkL-YtrRJYm-A922IeW3wvKEgCR-epeeeh3CBOGs/edit) written by the Observability UI team. 10 | 11 | The team can also be found in the `#forum-observability-ui` Slack channel. 12 | -------------------------------------------------------------------------------- /netlify.toml: -------------------------------------------------------------------------------- 1 | [build] 2 | publish = "web/public" 3 | functions = "functions" 4 | 5 | [build.environment] 6 | # HUGO_VERSION = "..." is set by bingo which allows reproducible local environment. 7 | NODE_VERSION = "15.5.1" 8 | NPM_VERSION = "7.3.0" 9 | 10 | [context.production] 11 | command = "make web WEBSITE_BASE_URL=${URL}" 12 | 13 | [context.deploy-preview] 14 | command = "make web WEBSITE_BASE_URL=${DEPLOY_PRIME_URL}" 15 | 16 | [context.branch-deploy] 17 | command = "make web WEBSITE_BASE_URL=${DEPLOY_PRIME_URL}" 18 | -------------------------------------------------------------------------------- /.bingo/variables.env: -------------------------------------------------------------------------------- 1 | # Auto generated binary variables helper managed by https://github.com/bwplotka/bingo v0.5.1. DO NOT EDIT. 2 | # All tools are designed to be build inside $GOBIN. 3 | # Those variables will work only until 'bingo get' was invoked, or if tools were installed via Makefile's Variables.mk. 4 | GOBIN=${GOBIN:=$(go env GOBIN)} 5 | 6 | if [ -z "$GOBIN" ]; then 7 | GOBIN="$(go env GOPATH)/bin" 8 | fi 9 | 10 | 11 | BINGO="${GOBIN}/bingo-v0.5.1" 12 | 13 | HUGO="${GOBIN}/hugo-v0.80.0" 14 | 15 | MDOX="${GOBIN}/mdox-v0.9.0" 16 | 17 | -------------------------------------------------------------------------------- /content/Projects/Observability/prometheus.md: -------------------------------------------------------------------------------- 1 | # Prometheus 2 | 3 | Prometheus is a monitoring and alerting system which collects and stores metrics. In the broader sense, it is a collection of tools including (but not limited to) Alertmanager, node_exporter, etc. 4 | 5 | ## Links 6 | 7 | * [GitHub project](https://github.com/prometheus/prometheus) 8 | * [Upstream documentation](https://prometheus.io/docs/) 9 | * [Community](https://prometheus.io/community/) 10 | * [Maintainers](https://github.com/prometheus/prometheus/blob/main/MAINTAINERS.md) 11 | -------------------------------------------------------------------------------- /content/Projects/Observability/prometheusOp.md: -------------------------------------------------------------------------------- 1 | # Prometheus Operator 2 | 3 | The Prometheus Operator is a Kubernetes Operator which manages Prometheus, Alertmanager and ThanosRuler deployments. 4 | 5 | ## Links 6 | 7 | * [GitHub project](https://github.com/prometheus-operator/prometheus-operator) 8 | * [Upstream documentation](https://prometheus-operator.dev/) 9 | * [Contributing guide](https://prometheus-operator.dev/docs/community/contributing/) 10 | * [Maintainers](https://github.com/prometheus-operator/prometheus-operator/blob/main/MAINTAINERS.md) 11 | -------------------------------------------------------------------------------- /content/Products/Observability-Operator/README.md: -------------------------------------------------------------------------------- 1 | # Observability Operator 2 | 3 | [Observability Operator](https://github.com/rhobs/observability-operator) is a Kubernetes operator created in Go to setup and manage multiple highly available Monitoring Stack using Prometheus and Thanos Querier. 4 | 5 | Eventually, this operator may also cover Logging and Tracing. 6 | 7 | The project relies heavily on the [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) library. 8 | 9 | > More details to follow in this document 10 | 11 | Check the observability-operator's [README](https://github.com/rhobs/observability-operator#readme) page for more details. 12 | -------------------------------------------------------------------------------- /content/Projects/prometheus-api-client-python.md: -------------------------------------------------------------------------------- 1 | # prometheus-api-client-python 2 | 3 | This is a python wrapper for the Prometheus API. It also includes some tools for processing metric data using Pandas Dataframes. 4 | 5 | ### Installation 6 | 7 | To install the latest release: 8 | 9 | `pip install prometheus-api-client` 10 | 11 | ### Links: 12 | * [Source](https://github.com/aicoe/prometheus-api-client-python) 13 | * [Usage Examples](https://github.com/aicoe/prometheus-api-client-python#usage) 14 | * [API Documentation](https://prometheus-api-client-python.readthedocs.io/en/master/source/prometheus_api_client.html) 15 | * [Slack](https://operatefirst.slack.com/archives/C01S2707XKM) 16 | -------------------------------------------------------------------------------- /.github/workflows/docs.yaml: -------------------------------------------------------------------------------- 1 | name: docs 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | tags: 8 | pull_request: 9 | 10 | jobs: 11 | lint: 12 | runs-on: ubuntu-latest 13 | name: Docs formatting and link checking. 14 | steps: 15 | - name: Checkout code into the Go module directory. 16 | uses: actions/checkout@v2 17 | 18 | - name: Install Go 19 | uses: actions/setup-go@v2 20 | with: 21 | go-version: 1.15.x 22 | 23 | - uses: actions/cache@v4 24 | with: 25 | path: ~/go/pkg/mod 26 | key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }} 27 | 28 | - name: Formatting docs. 29 | env: 30 | GOBIN: /tmp/.bin 31 | run: make docs-check 32 | -------------------------------------------------------------------------------- /.bingo/README.md: -------------------------------------------------------------------------------- 1 | # Project Development Dependencies. 2 | 3 | This is directory which stores Go modules with pinned buildable package that is used within this repository, managed by https://github.com/bwplotka/bingo. 4 | 5 | * Run `bingo get` to install all tools having each own module file in this directory. 6 | * Run `bingo get ` to install that have own module file in this directory. 7 | * For Makefile: Make sure to put `include .bingo/Variables.mk` in your Makefile, then use $() variable where is the .bingo/.mod. 8 | * For shell: Run `source .bingo/variables.env` to source all environment variable for each tool. 9 | * For go: Import `.bingo/variables.go` to for variable names. 10 | * See https://github.com/bwplotka/bingo or -h on how to add, remove or change binaries dependencies. 11 | 12 | ## Requirements 13 | 14 | * Go 1.14+ 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Handbook 2 | 3 | ![RHOBS](content/assets/rhobs-logo-white.png) 4 | 5 | [![Netlify Status](https://api.netlify.com/api/v1/badges/f0764fff-c6f4-46e5-8b46-e265782f42f1/deploy-status)](https://app.netlify.com/sites/rhobs-handbook/deploys) 6 | 7 | *Welcome to the Red Hat Monitoring Group Handbook* 8 | 9 | ## Goals 10 | 11 | For now check the [Handbook Proposal](content/Proposals/Accepted/202106-handbook.md) 12 | 13 | ## Using 14 | 15 | This handbook was designed to be used in two ways: 16 | 17 | * Through our website `https://rhobs-handbook.netlify.app/`. 18 | * Directly from GitHub UI by navigating to [`content`](https://github.com/rhobs/handbook/tree/main/content) directory. 19 | 20 | ## Contributing 21 | 22 | Anyone is welcome to read and contribute to our handbook! It's licensed with Apache 2, and any team member, another Red Hat team member or even anyone outside Red Hat is welcome to help us to our refine processes, update documentation and FAQs. 23 | 24 | If you were inspired by handbook let us know too! 25 | 26 | See details on how to contribute in our [Contributing Guide](content/contributing.md) 27 | 28 | ## Initial Author 29 | 30 | [`@bwplotka`](https://bwplotka.dev) 31 | -------------------------------------------------------------------------------- /content/Services/RHOBS/use-cases/observability.md: -------------------------------------------------------------------------------- 1 | # MST 2 | 3 | > For RHOBS Overview see [this document](README.md) 4 | 5 | TBD(https://github.com/rhobs/handbook/issues/23) 6 | 7 | ## Support 8 | 9 | TBD(https://github.com/rhobs/handbook/issues/23) 10 | 11 | ### Escalations 12 | 13 | TBD(https://github.com/rhobs/handbook/issues/23) 14 | 15 | ## Service Level Agreement 16 | 17 | ![SLO](../../../assets/slo-def.png) 18 | 19 | If you manage [“Observatorium”](../../../Projects/Observability/observatorium.md), the Service Level Objectives can go ultra-high in all dimensions, such as availability and data loss. The freshness aspect for reading APIs is trickier, as it also depends on client collection pipeline availability, among other things, which is out of the scope of the Observatorium. 20 | 21 | RHOBS has currently established the following default Service Level Objectives. This is based on the infrastructure dependencies we have listed [here (internal)](https://visual-app-interface.devshift.net/services#/services/rhobs/app.yml). 22 | 23 | > NOTE: We envision future improvements to the quality of service by offering additional SLO tiers for different tenants. Currently, all tenants have the same SLOs. 24 | 25 | > Previous docs (internal): 26 | > * [2019-10-30](https://docs.google.com/document/d/1LN-3yDtXmiDmGi5ZwllklJCg3jx-4ysNv6oUZudFj2g/edit#heading=h.20e6cn146nls) 27 | > * [2021-02-10](https://docs.google.com/document/d/1iGRsFMR9YmWG8Mk95UXU_PAUKvk1_zyNUkevbk7ZnFw/edit#heading=h.bupciudrwmna) 28 | 29 | TBD(https://github.com/rhobs/handbook/issues/23) 30 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | include .bingo/Variables.mk 2 | 3 | WEBSITE_DIR ?= web 4 | WEBSITE_BASE_URL ?= https://rhobs-handbook.netlify.app/ 5 | MD_FILES_TO_FORMAT=$(shell find content -name "*.md") README.md 6 | 7 | help: ## Displays help. 8 | @awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m\033[0m\n\nTargets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-10s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST) 9 | 10 | all: docs web 11 | 12 | .PHONY: docs 13 | docs: ## Format docs, localise links, ensure GitHub format. 14 | docs: $(MDOX) 15 | $(MDOX) fmt --links.localize.address-regex="https://rhobs-handbook.netlify.app/.*" *.md $(MD_FILES_TO_FORMAT) 16 | 17 | .PHONY: docs-check 18 | docs-check: ## Checks if docs are localised and links are correct. 19 | docs-check: $(MDOX) 20 | $(MDOX) fmt --check \ 21 | -l --links.validate.config-file=.mdox.validator.yaml --links.localize.address-regex="https://rhobs-handbook.netlify.app/.*" *.md $(MD_FILES_TO_FORMAT) 22 | 23 | .PHONY: web-pre 24 | web-pre: ## Pre process docs using mdox transform which converts it from GitHub structure to Hugo one. 25 | web-pre: $(MDOX) 26 | @rm -rf $(WEBSITE_DIR)/content # Do it in mdox itself. 27 | $(MDOX) transform --log.level=debug --config-file=.mdox.yaml 28 | 29 | $(WEBSITE_DIR)/node_modules: 30 | @git submodule update --init --recursive 31 | cd $(WEBSITE_DIR)/themes/docsy/ && npm install && rm -rf content 32 | 33 | .PHONY: web 34 | web: ## Build website in publish directory. 35 | web: $(WEBSITE_DIR)/node_modules $(HUGO) web-pre 36 | cd $(WEBSITE_DIR) && $(HUGO) -b $(WEBSITE_BASE_URL) 37 | 38 | .PHONY: web-serve 39 | web-serve: ## Build website and run in foreground process in localhost:1313 40 | web-serve: $(WEBSITE_DIR)/node_modules $(HUGO) web-pre 41 | cd $(WEBSITE_DIR) && $(HUGO) serve 42 | -------------------------------------------------------------------------------- /content/Services/RHOBS/analytics.md: -------------------------------------------------------------------------------- 1 | # Analytics based on Observability Data 2 | 3 | We use the [`Apache Parquet`](https://parquet.apache.org/) format (e.g [Amazon RedShift](https://aws.amazon.com/redshift/)) to persist metrics data for RHOBS analytics. 4 | 5 | Red Hat maintains community-driven project that is able to transform Prometheus data into Parquet files. See [Obslytics project to learn more](https://github.com/thanos-community/obslytics). It works by consuming (currently internal) RHOBS Store API. 6 | 7 | If you wish to use this tool against RHOBS, contact us to discuss this feature request or add ticket to our `MON` JIRA project. 8 | 9 | ## Existing Production Pipelines 10 | 11 | ### Telemeter Analytics 12 | 13 | DataHub and CCX teams are currently leveraging such analytics pipeline based on [Telemeter](use-cases/telemetry.md). The pipeline looks as follows: 14 | 15 | ![](../../assets/telemeter-analytics.png) 16 | 17 | *[Source](https://docs.google.com/drawings/d/19Z0vtVjlvU_p6aU3hn6PXdN0AWbd2FA4RAC0dT3hU_w/edit)* 18 | 19 | 1. DataHub runs [Thanos Replicate](https://thanos.io/tip/components/tools.md/#bucket-replicate) tool in their PSI cluster, which copies all fresh 2h size blocks from RHOBS, once they appear in RHOBS S3 object storage (~3h after ingestion). 20 | 2. Replicated blocks are copied to DataHub Ceph object storage and stored for unlimited time. 21 | 3. DataHub runs smaller Thanos system. For this part of the system, only the Thanos Store Gateway is needed to provide Store API for further Obslytics step. 22 | 4. Instance of [Obslytics](https://github.com/thanos-community/obslytics) transforms Thanos metrics format to [`Apache Parquet`](https://parquet.apache.org/) and saves it in Ceph Object Storage. 23 | 5. Parquet data is then imported in [Amazon RedShift](https://aws.amazon.com/redshift/), and is joined with other data sources. 24 | -------------------------------------------------------------------------------- /.bingo/Variables.mk: -------------------------------------------------------------------------------- 1 | # Auto generated binary variables helper managed by https://github.com/bwplotka/bingo v0.5.1. DO NOT EDIT. 2 | # All tools are designed to be build inside $GOBIN. 3 | BINGO_DIR := $(dir $(lastword $(MAKEFILE_LIST))) 4 | GOPATH ?= $(shell go env GOPATH) 5 | GOBIN ?= $(firstword $(subst :, ,${GOPATH}))/bin 6 | GO ?= $(shell which go) 7 | 8 | # Below generated variables ensure that every time a tool under each variable is invoked, the correct version 9 | # will be used; reinstalling only if needed. 10 | # For example for bingo variable: 11 | # 12 | # In your main Makefile (for non array binaries): 13 | # 14 | #include .bingo/Variables.mk # Assuming -dir was set to .bingo . 15 | # 16 | #command: $(BINGO) 17 | # @echo "Running bingo" 18 | # @$(BINGO) 19 | # 20 | BINGO := $(GOBIN)/bingo-v0.5.1 21 | $(BINGO): $(BINGO_DIR)/bingo.mod 22 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies. 23 | @echo "(re)installing $(GOBIN)/bingo-v0.5.1" 24 | @cd $(BINGO_DIR) && $(GO) build -mod=mod -modfile=bingo.mod -o=$(GOBIN)/bingo-v0.5.1 "github.com/bwplotka/bingo" 25 | 26 | HUGO := $(GOBIN)/hugo-v0.80.0 27 | $(HUGO): $(BINGO_DIR)/hugo.mod 28 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies. 29 | @echo "(re)installing $(GOBIN)/hugo-v0.80.0" 30 | @cd $(BINGO_DIR) && CGO_ENABLED=1 $(GO) build -tags=extended -mod=mod -modfile=hugo.mod -o=$(GOBIN)/hugo-v0.80.0 "github.com/gohugoio/hugo" 31 | 32 | MDOX := $(GOBIN)/mdox-v0.9.0 33 | $(MDOX): $(BINGO_DIR)/mdox.mod 34 | @# Install binary/ries using Go 1.14+ build command. This is using bwplotka/bingo-controlled, separate go module with pinned dependencies. 35 | @echo "(re)installing $(GOBIN)/mdox-v0.9.0" 36 | @cd $(BINGO_DIR) && $(GO) build -mod=mod -modfile=mdox.mod -o=$(GOBIN)/mdox-v0.9.0 "github.com/bwplotka/mdox" 37 | 38 | -------------------------------------------------------------------------------- /content/Projects/Observability/observatorium.md: -------------------------------------------------------------------------------- 1 | # Observatorium 2 | 3 | Observatorium is an observability system designed to enable the ingestion, storage (short and long term) and querying capabilities for three major observability signals: metrics, logging and tracing. It unifies horizontally scalable, multi-tenant systems like Thanos, Loki, and in the future, Jaeger to deploy them in a single stack with consistent APIs. On top of that it's designed to be managed as a service thanks to consistent tenancy, authorization and rate limiting across all three signals. 4 | 5 | ### Official Documentation 6 | 7 | https://observatorium.io 8 | 9 | ### APIs 10 | 11 | TBD(https://github.com/rhobs/handbook/issues/22) 12 | 13 | #### Read: Metrics 14 | 15 | * GET /api/metrics/v1/api/v1/query 16 | * GET /api/metrics/v1/api/v1/query_range 17 | * GET /api/metrics/v1/api/v1/series 18 | * GET /api/metrics/v1/api/v1/labels 19 | * GET /api/metrics/v1/api/v1/label/ /values 20 | 21 | ### Notable Talks/Blog Posts 22 | 23 | * 04.2021: [Upstream-First High Scale Prometheus Ecosystem](https://www.youtube.com/watch?v=r0fRFH_921E&list=PLj6h78yzYM2PZb0QuIkm6ZY-xTuNA5zRO&index=6) 24 | 25 | ### Bug Trackers 26 | 27 | https://github.com/observatorium/observatorium/issues 28 | 29 | ### Communication Channels 30 | 31 | The CNCF Slack workspace's ([join here](https://cloud-native.slack.com/messages/CHY2THYUU)) channels: 32 | 33 | * `#observatorium` for user related things. 34 | * `#observatorium-dev` for developer related things. 35 | 36 | ### Proposal Process 37 | 38 | TBD 39 | 40 | ### Our Usage 41 | 42 | We use Observatorium as a Service for our [Red Hat Observability Service (RHOBS)](../../Services/RHOBS). 43 | 44 | We also know of several other companies installing Observatorium on their own (as of 2021.07.07): 45 | 46 | * RHACM 47 | * Managed Tenants until they can get access to RHBOBS (e.g. Kafka Team) 48 | * IBM 49 | * GitPod 50 | 51 | ### Maintainers 52 | 53 | https://github.com/observatorium/observatorium/blob/main/docs/community/maintainers.md 54 | -------------------------------------------------------------------------------- /content/Projects/Observability/kubestatemetrics.md: -------------------------------------------------------------------------------- 1 | # Kube State Metrics 2 | 3 | `kube-state-metrics` (KSM) is a service that listens to the Kubernetes API server and generates metrics about the state of the objects. It's an add-on agent to generate and expose cluster-level metrics. 4 | 5 | ### Official Documentation 6 | 7 | - Overview: [`README.md`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs) 8 | - Resource-wise documentation: [`/docs`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs) 9 | - Design documentation: [`/docs/design`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs/design) 10 | - Developer documentation: [`/docs/developer`](https://github.com/kubernetes/kube-state-metrics/tree/main/docs/developer) 11 | 12 | ### Informational Media 13 | 14 | - [PromCon 2017: Lightning Talk - `kube-state-metrics` - Frederic Branczyk](https://www.youtube.com/watch?v=nUkHeY48mIQ) 15 | - [Intro and Deep Dive: Kubernetes SIG Instrumentation - David Ashpole & Han Kang, Frederic Branczyk](https://youtu.be/NzoG--2UqEk?t=888) 16 | - [Episode 38: Custom Resources in `kube-state-metrics`](https://www.youtube.com/watch?v=rkaG4M5mo-8) 17 | - [`kube-state-metrics` on Google Cloud](https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/kube_state_metrics) 18 | 19 | ### Bug Trackers 20 | 21 | - [`/issues`](https://github.com/kubernetes/kube-state-metrics/issues) 22 | - [`issues.redhat.com`](https://issues.redhat.com/browse/MON-2858?jql=project%20%3D%20MON%20AND%20issuetype%20in%20(Bug%2C%20Epic%2C%20Story%2C%20Task%2C%20Sub-task)%20AND%20resolution%20%3D%20Unresolved%20AND%20text%20~%20%22kube-state-metrics%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC) 23 | 24 | ### Get Involved 25 | 26 | - [`#kube-state-metrics`](https://kubernetes.slack.com/archives/CJJ529RUY) 27 | - [`#sig-instrumentation`](https://kubernetes.slack.com/archives/C20HH14P7) 28 | - [SIG Instrumentation Biweekly Minutes](https://docs.google.com/document/d/1FE4AQ8B49fYbKhfg4Tx0cui1V0eI4o3PxoqQPUwNEiU) 29 | 30 | ### Internal Usages 31 | 32 | - [`cluster-monitoring-operator/assets/kube-state-metrics/`](https://github.com/openshift/cluster-monitoring-operator/tree/master/assets/kube-state-metrics) 33 | - [`openshift-state-metrics/`](https://github.com/openshift/openshift-state-metrics) 34 | 35 | ### Maintainers 36 | 37 | - [`/OWNERS.md`](https://github.com/kubernetes/kube-state-metrics/blob/main/OWNERS) 38 | 39 | ### Miscellaneous 40 | 41 | - [`/releases`](https://github.com/kubernetes/kube-state-metrics/releases) 42 | -------------------------------------------------------------------------------- /content/Projects/Observability/thanos.md: -------------------------------------------------------------------------------- 1 | # Thanos 2 | 3 | Thanos is a horizontally scalable, multi-tenant monitoring system in a form of distributed time series database that supports Prometheus data format. 4 | 5 | ### Official Documentation 6 | 7 | https://thanos.io/tip/thanos/getting-started.md 8 | 9 | ### APIs 10 | 11 | * Querying: Prometheus APIs, Remote Read 12 | * Series: Prometheus APIs, gRPC SeriesAPI 13 | * Metric Metadata: Prometheus API, gRPC MetricMetadataAPI 14 | * Rules, Alerts: Prometheus API, gRPC RulesAPI 15 | * Targets: Prometheus API, gRPC TargetsAPI 16 | * Exemplars: Prometheus API, gRPC ExemplarsAPI 17 | * Receiving: Prometheus Remote Write 18 | 19 | ### Tutorials 20 | 21 | https://katacoda.com/thanos 22 | 23 | ### Notable Talks/Blog Posts 24 | 25 | * 12.2020: [Absorbing Thanos Infinite Powers for Multi-Cluster Telemetry](https://www.youtube.com/watch?v=6Nx2BFyr7qQ) 26 | * 12.2020: [Turn It Up to a Million: Ingesting Millions of Metrics with Thanos Receive](https://www.youtube.com/watch?v=5MJqdJq41Ms) 27 | * 02.2019: [FOSDEM + demo](https://fosdem.org/2019/schedule/event/thanos_transforming_prometheus_to_a_global_scale_in_a_seven_simple_steps/) 28 | * 03.2019: [Alibaba Cloud user story](https://www.youtube.com/watch?v=ZS6zMksfipc) 29 | * [CloudNative Deep Dive](https://www.youtube.com/watch?v=qQN0N14HXPM) 30 | * [CloudNative Intro](https://www.youtube.com/watch?v=m0JgWlTc60Q) 31 | * [Prometheus in Practice: HA with Thanos](https://www.slideshare.net/ThomasRiley45/prometheus-in-practice-high-availability-with-thanos-devopsdays-edinburgh-2019) 32 | 33 | * [Banzai Cloud user story](https://banzaicloud.com/blog/multi-cluster-monitoring/) 34 | 35 | ### Bug Trackers 36 | 37 | https://github.com/thanos-io/thanos/issues 38 | 39 | ### Communication Channels 40 | 41 | The CNCF Slack workspace's ([join here](https://cloud-native.slack.com/messages/CHY2THYUU)) channels: 42 | 43 | * `#thanos` for user related things. 44 | * `#thanos-dev` for developer related things. 45 | 46 | ### Proposal Process 47 | 48 | https://thanos.io/tip/contributing/contributing.md/#adding-new-features--components 49 | 50 | ### Our Usage 51 | 52 | We use Thanos in many places within Red Hat, notably: 53 | 54 | * In [Prometheus Operator (sidecar)](prometheusOp.md) 55 | * In Openshift Platform Monitoring (PM) (see [CMO](../../Products/OpenshiftMonitoring)) 56 | * In Openshift User Workload Monitoring (UWM) 57 | * In [RHOBS](../../Services/RHOBS) (so [Observatorium](observatorium.md)) 58 | 59 | ### Maintainers 60 | 61 | https://thanos.io/tip/thanos/maintainers.md/#core-maintainers-of-this-repository 62 | -------------------------------------------------------------------------------- /.mdox.yaml: -------------------------------------------------------------------------------- 1 | version: 1 2 | 3 | inputDir: "content" 4 | outputDir: "web/content/en" 5 | extraInputGlobs: 6 | - "README.md" 7 | - "web/assets/favicons" 8 | 9 | gitIgnored: true 10 | localLinksStyle: 11 | hugo: 12 | indexFileName: "_index.md" 13 | 14 | transformations: 15 | - glob: "../README.md" 16 | path: /_index.md 17 | frontMatter: 18 | template: | 19 | title: "{{ .Origin.FirstHeader }}" 20 | lastmod: "{{ .Origin.LastMod }}" 21 | menu: 22 | main: 23 | weight: 1 24 | pre: 25 | cascade: 26 | - type: "docs" 27 | _target: 28 | path: "/**" 29 | 30 | - glob: "Proposals/README.md" 31 | path: _index.md 32 | frontMatter: 33 | template: | 34 | title: "{{ .Origin.FirstHeader }}" 35 | lastmod: "{{ .Origin.LastMod }}" 36 | menu: 37 | main: 38 | weight: 2 39 | pre: 40 | 41 | - glob: "Products/OpenshiftMonitoring/instrumentation.md" 42 | frontMatter: 43 | template: | 44 | title: "{{ .Origin.FirstHeader }}" 45 | lastmod: "{{ .Origin.LastMod }}" 46 | weight: 5 47 | - glob: "Products/OpenshiftMonitoring/collecting_metrics.md" 48 | frontMatter: 49 | template: | 50 | title: "{{ .Origin.FirstHeader }}" 51 | lastmod: "{{ .Origin.LastMod }}" 52 | weight: 10 53 | - glob: "Products/OpenshiftMonitoring/alerting.md" 54 | frontMatter: 55 | template: | 56 | title: "{{ .Origin.FirstHeader }}" 57 | lastmod: "{{ .Origin.LastMod }}" 58 | weight: 20 59 | - glob: "Products/OpenshiftMonitoring/dashboards.md" 60 | frontMatter: 61 | template: | 62 | title: "{{ .Origin.FirstHeader }}" 63 | lastmod: "{{ .Origin.LastMod }}" 64 | weight: 25 65 | - glob: "Products/OpenshiftMonitoring/telemetry.md" 66 | frontMatter: 67 | template: | 68 | title: "{{ .Origin.FirstHeader }}" 69 | lastmod: "{{ .Origin.LastMod }}" 70 | weight: 30 71 | - glob: "Products/OpenshiftMonitoring/faq.md" 72 | frontMatter: 73 | template: | 74 | title: "{{ .Origin.FirstHeader }}" 75 | lastmod: "{{ .Origin.LastMod }}" 76 | weight: 40 77 | 78 | - glob: "**/README.md" 79 | path: _index.md 80 | frontMatter: &defaultFrontMatter 81 | template: | 82 | title: "{{ .Origin.FirstHeader }}" 83 | lastmod: "{{ .Origin.LastMod }}" 84 | 85 | - glob: "**.md" 86 | frontMatter: 87 | <<: *defaultFrontMatter 88 | 89 | - glob: "../web/assets/**" 90 | path: "/../../static/**" 91 | 92 | - glob: "**" 93 | path: "/../../static/**" 94 | -------------------------------------------------------------------------------- /web/config.toml: -------------------------------------------------------------------------------- 1 | baseURL = "/" 2 | title = "handbook" 3 | theme = ["docsy"] 4 | 5 | enableRobotsTXT = true 6 | 7 | 8 | # Language settings 9 | contentDir = "content/en" 10 | defaultContentLanguage = "en" 11 | defaultContentLanguageInSubdir = false 12 | enableMissingTranslationPlaceholders = true 13 | 14 | disableKinds = ["taxonomy", "taxonomyTerm"] 15 | 16 | # Highlighting config 17 | pygmentsCodeFences = true 18 | pygmentsUseClasses = false 19 | pygmentsUseClassic = false 20 | # See https://help.farbox.com/pygments.html 21 | pygmentsStyle = "tango" 22 | 23 | # Image processing configuration. 24 | [imaging] 25 | resampleFilter = "CatmullRom" 26 | quality = 75 27 | anchor = "smart" 28 | 29 | [services.googleAnalytics] 30 | # Comment out the next line to disable GA tracking. Also disables the feature described in [params.ui.feedback]. 31 | # id = "UA-00000000-0" 32 | 33 | # Language configuration 34 | 35 | [languages.en] 36 | title = "Red Hat Monitoring Group Handbook" 37 | description = "Red Hat Monitoring Group Handbook" 38 | languageName = "English" 39 | 40 | [markup.goldmark.renderer] 41 | unsafe = true 42 | 43 | [markup.highlight] 44 | # See a complete list of available styles at https://xyproto.github.io/splash/docs/all.html 45 | style = "tango" 46 | guessSyntax = "true" 47 | 48 | # Comment out if you don't want the "print entire section" link enabled. 49 | [outputs] 50 | section = ["HTML", "print"] 51 | 52 | [params] 53 | copyright = "The Red Hat Monitoring Group Authors" 54 | 55 | # Repository configuration (URLs for in-page links to opening issues and suggesting changes) 56 | # TODO(bwplotka): Changes are suggested to wrong, auto-generated directory, disabled for now. Fix it! 57 | #github_repo = "https://github.com/rhobs/handbook" 58 | github_branch= "main" 59 | # Enable Algolia DocSearch 60 | algolia_docsearch = true 61 | offlineSearch = false 62 | prism_syntax_highlighting = false 63 | 64 | # User interface configuration 65 | [params.ui] 66 | sidebar_menu_compact = false 67 | breadcrumb_disable = false 68 | sidebar_search_disable = false 69 | navbar_logo = true 70 | footer_about_disable = false 71 | 72 | # Adds a H2 section titled "Feedback" to the bottom of each doc. The responses are sent to Google Analytics as events. 73 | # This feature depends on [services.googleAnalytics] and will be disabled if "services.googleAnalytics.id" is not set. 74 | # If you want this feature, but occasionally need to remove the "Feedback" section from a single page, 75 | # add "hide_feedback: true" to the page's front matter. 76 | [params.ui.feedback] 77 | enable = true 78 | # The responses that the user sees after clicking "yes" (the page was helpful) or "no" (the page was not helpful). 79 | yes = 'Glad to hear it! Please tell us how we can improve.' 80 | no = 'Sorry to hear that. Please tell us how we can improve.' 81 | 82 | # Adds a reading time to the top of each doc. 83 | # If you want this feature, but occasionally need to remove the Reading time from a single page, 84 | # add "hide_readingtime: true" to the page's front matter 85 | [params.ui.readingtime] 86 | enable = true 87 | 88 | [[params.links.user]] 89 | name = "GitHub" 90 | url = "https://github.com/rhobs/handbook" 91 | icon = "fab fa-github" 92 | desc = "Ask question, discuss on GitHub!" 93 | 94 | [[menu.main]] 95 | name = "Blog #monitoring" 96 | weight = 50 97 | url = "https://www.openshift.com/blog/tag/monitoring" 98 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/alerting.md: -------------------------------------------------------------------------------- 1 | # Alerting 2 | 3 | ## Targeted audience 4 | 5 | This document is intended for OpenShift developers that want to write alerting rules for their operators and operands. 6 | 7 | ## Configuring alerting rules 8 | 9 | You configure alerting rules based on the metrics being collected for your component(s). To do so, you should create `PrometheusRule` objects in your operator/operand namespace which will also be picked up by the Prometheus operator (provided that the namespace has the `openshift.io/cluster-monitoring="true"` label for layered operators). 10 | 11 | Here is an example of a PrometheusRule object with a single alerting rule: 12 | 13 | ```yaml 14 | apiVersion: monitoring.coreos.com/v1 15 | kind: PrometheusRule 16 | metadata: 17 | name: cluster-example-operator-rules 18 | namespace: openshift-example-operator 19 | spec: 20 | groups: 21 | - name: operator 22 | rules: 23 | - alert: ClusterExampleOperatorUnhealthy 24 | annotations: 25 | description: Cluster Example operator running in pod {{$labels.namespace}}/{{$labels.pods}} is not healthy. 26 | summary: Operator Example not healthy 27 | expr: | 28 | max by(pod, namespace) (last_over_time(example_operator_healthy[5m])) == 0 29 | for: 15m 30 | labels: 31 | severity: warning 32 | ``` 33 | 34 | You can choose to configure all your alerting rules into a single `PrometheusRule` object or split them into different objects (one per component). The mechanism to deploy the object(s) depends on the context: it can be deployed by the Cluster Version Operator (CVO), the Operator Lifecycle Manager (OLM) or your own operator. 35 | 36 | ## Guidelines 37 | 38 | Please refer to the [Alerting Consistency](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md) OpenShift enhancement proposal for the recommendations applying to OCP built-in alerting rules. 39 | 40 | If you need a review of alerting rules from the OCP monitoring team, you can reach them on the `#forum-openshift-monitoring` channel. 41 | 42 | ## Identifying alerting rules without a namespace label 43 | 44 | The enhancement proposal mentioned above states the following for OCP built-in alerts: 45 | 46 | > Alerts SHOULD include a namespace label indicating the source of the alert. 47 | 48 | Unfortunately this isn't something that we can verify by static analysis because the namespace label can come from the PromQL result or be added statically. Nevertheless we can still use the Telemetry data to identify OCP alerts that don't respect this statement. 49 | 50 | First, create an OCP cluster from the latest stable release. Once it is installed, run this command to return the list of all OCP built-in alert names: 51 | 52 | ```bash 53 | curl -sk -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" \ 54 | https://$(oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/rules \ 55 | | jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")' 56 | ``` 57 | 58 | Then from https://telemeter-lts.datahub.redhat.com, retrieve the list of all alerts matching the names that fired without a namespace label, grouped by minor release: 59 | 60 | ``` 61 | count by (alertname,version) ( 62 | alerts{alertname=~"",namespace=""} * 63 | on(_id) group_left(version) max by(_id, version) ( 64 | label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.\d\d.*"}, "version", "$1", "version", "^(4.\\d+).*$") 65 | ) 66 | ) 67 | ``` 68 | 69 | You should now track back the non-compliant alerts to their component of origin and file bugs against them ([example](https://issues.redhat.com/browse/OCPBUGS-17191)). 70 | 71 | The exercise should be done at regular intervals, at least once per release cycle. 72 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/instrumentation.md: -------------------------------------------------------------------------------- 1 | # Instrumentation guidelines 2 | 3 | This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the [upstream documentation](https://prometheus.io/docs/practices/instrumentation/) but an introduction focused on the OpenShift use case. 4 | 5 | ## Targeted audience 6 | 7 | This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus. 8 | 9 | ## Getting started 10 | 11 | To instrument software written in Golang, see the official [Golang client](https://pkg.go.dev/github.com/prometheus/client_golang). For other languages, refer to the [curated list](https://prometheus.io/docs/instrumenting/clientlibs/#client-libraries) of client libraries. 12 | 13 | Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this [page](https://prometheus.io/docs/concepts/data_model/). Time series would be represented like this: 14 | 15 | ``` 16 | # HELP http_requests_total Total number of HTTP requests by method and handler. 17 | # TYPE http_requests_total counter 18 | http_requests_total{method="GET", handler="/messages"} 500 19 | http_requests_total{method="POST", handler="/messages"} 10 20 | ``` 21 | 22 | Prometheus supports 4 [metric types](https://prometheus.io/docs/concepts/metric_types/): 23 | * Gauge which represents a single numerical value that can arbitrarily go up and down. 24 | * Counter, a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. When querying a counter metric, you usually apply a `rate()` or `increase()` function. 25 | * Histogram which represents observations (usually things like request durations or response sizes) and counts them in configurable buckets. 26 | * Summary which represents observations too but it reports configurable quantiles over a (fixed) sliding time window. In practice, they are rarely used. 27 | 28 | Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code. 29 | 30 | To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation. {{% alert color="info" %}}OpenShift follows the outlined conventions whenever possible. Any exceptions should be reviewed and properly motivated.{{% /alert %}} 31 | * [Prometheus naming recommendations](https://prometheus.io/docs/practices/naming/) 32 | * [Prometheus instrumentation](https://prometheus.io/docs/practices/instrumentation/) 33 | * [Kubernetes metric instrumentation guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-instrumentation.md) 34 | * [Instrumenting a Go application for Prometheus](https://prometheus.io/docs/guides/go-application/) 35 | 36 | ## Example 37 | 38 | Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric: 39 | 40 | ```golang 41 | cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{ 42 | Name: "cpu_temperature_celsius", 43 | Help: "Current temperature of the CPU.", 44 | }) 45 | 46 | hdFailures := prometheus.NewCounterVec( 47 | prometheus.CounterOpts{ 48 | Name: "hd_errors_total", 49 | Help: "Number of hard-disk errors.", 50 | }, 51 | []string{"device"}, 52 | )} 53 | 54 | reg := prometheus.NewRegistry() 55 | reg.MustRegister(cpuTemp, m.hdFailures) 56 | 57 | cpuTemp.Set(55.2) 58 | 59 | // Record 1 failure for the /dev/sda device. 60 | hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc() 61 | // Record 3 failures for the /dev/sdb device. 62 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() 63 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() 64 | hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() 65 | ``` 66 | 67 | ## Labels 68 | 69 | Defining when to add and when not to add a label to a metric is a [difficult choice](https://prometheus.io/docs/practices/instrumentation/#use-labels). The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series. 70 | 71 | Labels such as `pod`, `service`, `job` and `instance` shouldn't be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics. 72 | 73 | ## Custom collectors 74 | 75 | It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the [custom collector](https://pkg.go.dev/github.com/prometheus/client_golang@v1.20.4/prometheus#hdr-Custom_Collectors_and_constant_Metrics) pattern can be useful. 76 | 77 | You can find an example of this pattern in the [github.com/prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator/blob/3df0811bdc7c046cb283006d94092e42219a0e2f/pkg/operator/operator.go#L166-L191) project. 78 | 79 | ## Next steps 80 | 81 | * [Collect metrics](collecting_metrics.md) with Prometheus. 82 | * [Configure alerting](alerting.md) with Prometheus. 83 | * [Add dashboards](dashboards.md) to the OCP console. 84 | -------------------------------------------------------------------------------- /content/Services/RHOBS/use-cases/telemetry.md: -------------------------------------------------------------------------------- 1 | # Telemetry (Telemeter) 2 | 3 | > For RHOBS Overview see [this document](README.md) 4 | 5 | Telemeter is the metrics-only hard tenant of the RHOBS service designed as a centralized OpenShift Telemetry pipeline for OpenShift Container Platform. It is an essential part of gathering real-time telemetry for remote health monitoring, automation and billing purposes. 6 | 7 | * [OpenShift Documentation](https://docs.openshift.com/container-platform/latest/support/remote_health_monitoring/about-remote-health-monitoring.html#telemetry-about-telemetry_about-remote-health-monitoring) about the Telemetry service. 8 | * [Internal documentation](https://gitlab.cee.redhat.com/data-hub/dh-docs/-/blob/master/docs/interacting-with-telemetry-data.adoc) for interacting with the Telemetry data. 9 | 10 | ## Product Managers 11 | 12 | * Roger Floren 13 | 14 | ## Big Picture Overview 15 | 16 | ![](../../../assets/telemeter.png) 17 | 18 | *[Source](https://docs.google.com/drawings/d/1eIAxCUS2v8Bt0-Ken2gHnx-Q1u8JNPfi2rMB-Azz5zI/edit)* 19 | 20 | ## Support 21 | 22 | To escalate issues use, depending on issue type: 23 | 24 | * For questions related to the service or kind of data it ingests, use `telemetry-sme@redhat.com` (internal) mail address. For quick questions you can try to use [#forum-telemetry](https://coreos.slack.com/archives/CEG5ZJQ1G) on CoreOS Slack. 25 | * For functional bugs or feature requests use Bugzilla, with Product: `Openshift Container Platform` and `Telemeter` component ([example bug](https://bugzilla.redhat.com/show_bug.cgi?id=1914956)). You can additionally notify us about a new bug on `#forum-telemetry` on CoreOS Slack. 26 | * For functional bugs or feature requests for historical storage (Data Hub), use the PNT Jira project. 27 | 28 | > For the managing team: See our internal [agreement document](https://docs.google.com/document/d/1iAhzVxm2ovqkWxJCLplwR7Z-1gzXhfRKcHqXnpQh9Hg/edit#). 29 | 30 | ### Escalations 31 | 32 | For urgent escalation use: 33 | 34 | * For Telemeter Service Unavailability: `@app-sre-ic` and `@observatorium-oncall` on CoreOS Slack. 35 | * For Historical Data (DataHub) Service Unavailability: `@data-hub-ic` on CoreOS Slack. 36 | 37 | ## Service Level Agreement 38 | 39 | ![SLO](../../../assets/slo-def.png) 40 | 41 | RHOBS has currently established the following default Service Level Objectives. This is based on the infrastructure dependencies we have listed [here (internal)](https://visual-app-interface.devshift.net/services#/services/rhobs/app.yml). 42 | 43 | > Previous docs (internal): 44 | > * [2019-10-30](https://docs.google.com/document/d/1LN-3yDtXmiDmGi5ZwllklJCg3jx-4ysNv6oUZudFj2g/edit#heading=h.20e6cn146nls) 45 | > * [2021-02-10](https://docs.google.com/document/d/1iGRsFMR9YmWG8Mk95UXU_PAUKvk1_zyNUkevbk7ZnFw/edit#heading=h.bupciudrwmna) 46 | 47 | #### Metrics SLIs 48 | 49 | | API | SLI Type | SLI Spec | Period | SLI Implementation | Dashboard | 50 | |----------|--------------|----------------------------------------|--------|--------------------------------|-----------------------------------------------------------------------------------------------------| 51 | | `/write` | Availability | The % of successful (non 5xx) requests | 28d | Metrics from Observatorium API | [Dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADJ/telemeter?orgId=1&refresh=1m) | 52 | | `/write` | Latency | The % of requests under X latency | 28d | Metrics from Observatorium API | [Dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADJ/telemeter?orgId=1&refresh=1m) | 53 | 54 | Read Metrics TBD. 55 | 56 | *Agreements*: 57 | 58 | > NOTE: No entry for your case (e.g. dev/staging) means zero formal guarantees. 59 | 60 | | SLI | Date of Agreement | Tier | SLO | Notes | 61 | |-----------------------|-------------------|--------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 62 | | `/write` Availability | 2020/2019 | Internal (default) | 99% success rate for incoming [requests](#write-limits) | This depends on SSO RedHat com SLO (98.5%). In worst case (everyone needs to refresh token) we have below 98.5%, in the avg case with caching being 5m (we cannot change it) ~99% (assuming we can preserve 5m) | 63 | | `/write` Latency | 2020/2019 | Internal (default) | 95% of [requests](#write-limits) < 250ms, 99% of [requests](#write-limits) < 1s | | 64 | 65 | ##### Write Limits 66 | 67 | Within our SLO, the write request must match following criteria to be considered valid: 68 | 69 | * Valid remote write requests using official remote write protocol (See conformance test) 70 | * Valid credentials: (explanation TBD(https://github.com/rhobs/handbook/issues/24)) 71 | * Max samples: TBD(https://github.com/rhobs/handbook/issues/24) 72 | * Max series: TBD(https://github.com/rhobs/handbook/issues/24) 73 | * Rate limit: TBD(https://github.com/rhobs/handbook/issues/24) 74 | 75 | TODO: Provide example tune-ed client Prometheus configurations for remote write 76 | -------------------------------------------------------------------------------- /content/Proposals/Accepted/202205-oo.md: -------------------------------------------------------------------------------- 1 | ## Evolution of the Observability Operator (fka MSO) 2 | 3 | * **Owners:** 4 | * [`@dmohr`](https://github.com/danielm0hr) 5 | * [`@jan--f`](https://github.com/jan--f) 6 | 7 | * **Related Tickets:** 8 | * https://issues.redhat.com/browse/MON-2279 9 | * https://issues.redhat.com/browse/MON-2482 10 | * https://issues.redhat.com/browse/MON-2508 (short-term focus) 11 | 12 | * **Other docs:** 13 | * [Original proposal for MSO](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) 14 | 15 | > TL;DR: As a mid-to-long term vision for the Monitoring Stack Operator we propose to rename MSO as OO (Observability Operator). With the name, we propose to establish OO as *the* open source, (single) cluster-side component for OpenShift observability needs. In conjunction with Observatorium / RHOBS - as *the* multi-cluster, multi-tenant, scalable observability backend component. OO is thought to manage different kinds of cluster-side monitoring, alerting, logging, and tracing (and potentially profiling) stack setups covering the needs of OpenShift variants like the client-side for the fully managed multi-cluster use cases to HyperShift and single node air gapped setups. 16 | 17 | ## Why 18 | 19 | With the rise of new OpenShift variants with very different needs regarding observability, the desire for a consistent way of providing differently configured monitoring stacks grows. Examples: 20 | - Traditional single-cluster on-prem OpenShift deployments need a self-contained monitoring stack where all components (scraping, storage, visualization, alerting, log collection) run on the same cluster. This is the kind of setup [Cluster Monitoring Operator](https://github.com/openshift/cluster-monitoring-operator) was designed for. 21 | - Multi-cluster deployments need a central (aggregated, federated) view on metrics, logs and alerts. Certain components of the stack don't run on the workload clusters but in some central infrastructure. 22 | - Resource-constraint deployments need a stripped down version of the monitoring stack, e.g. only forwarding signals to a central infrastructure or disabling certain parts of the stack completely. 23 | - Mixed deployments (e.g. OpenShift + OpenStack) are conceptually very similar to the multi-cluster use case, also needing a central pane of glass for observability signals. 24 | - Special purpose deployments (e.g. OpenShift Data Foundation) have special requirements when it comes to monitoring, that are tricky to align with the existing CMO setup. 25 | - Looking at eventually correlating different observability signals also the cluster-side stack would potentially benefit from a holistic approach for deploying monitoring, logging and tracing components and configuring them in the right way to work together. 26 | - Managed service deployments need a multi tenancy capable way of deploying many similarly built monitoring stacks to a single cluster. This is the short-term focus for OO. 27 | 28 | The proposal is to combine all these (and more) use cases into one single (meta) operator (as recommended by [the operator best practices](https://sdk.operatorframework.io/docs/best-practices/best-practices/#development)) which can be configured with e.g. presets to instruct lower-level operators (like prometheus-operator, potentially Loki operator or Jaeger one) to deploy purpose-built monitoring stacks for different uses cases. This is similar to the original CMO concept but with much higher flexibility, and feature velocity in mind, thanks to not being tied to OpenShift Core versioning. 29 | 30 | Additionally, supporting multiple different ways of deploying monitoring stacks (CMO as the standard OpenShift way, OO for managed services, something else for e.g. HyperShift or edge scenarios, ...) is a burden for the team. Instead, eventually supporting only one way to deploy monitoring stacks - with OO - covering all these use cases makes it a lot simpler and far more consistent. 31 | 32 | ### Pitfalls of the current solution 33 | 34 | CMO is built for traditional self-operated single-cluster focused deployments of OpenShift. It intentionally lacks the flexibility for many other use cases (see above) in order to provide monitoring that is resilient against configuration drift. E.g. the reason for creating OO (MSO) in the first place - supporting managed service uses cases - can't currently be covered by CMO. See the [original MSO proposal](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) for more details. 35 | 36 | The results of this lack of flexibility can be readily observed: Red Hat teams have built their own solutions for their monitoring use cases, e.g. leveraging community operators or self-written deployments, with varying success, reliability and supportability. 37 | 38 | ## Goals 39 | 40 | Goals and use cases for the solution as proposed in [How](#how): 41 | 42 | * Widen the scope of OO to cover additional use cases besides managed services. 43 | * Replace existing ways of deploying monitoring stacks across Red Hat products with OO. 44 | * Focus on OpenShift use cases primarily but don't exclude vanilla Kubernetes as a possible target. 45 | * Create an API that easily allows common configuration across observability signals. 46 | 47 | ## Non-Goals 48 | 49 | * Create a multi-cluster capable observability operator. 50 | 51 | ## How 52 | 53 | * Define use cases to be covered in detail. 54 | * Prioritize use cases and add needed features one by one. 55 | 56 | ## Alternatives 57 | 58 | 1. Tackle each monitoring use case across Red Hat products one by one and build a custom solution for them. This would lead to many different (but potentially simpler) implementations which need to be supported. 59 | 2. Develop signal specific operators that can handle the required use cases. This would likely require an API between those operators to apply common configuration. 60 | 61 | ## Action Plan 62 | 63 | Collection of requirements and prioritization of use cases currently in progess (Q3 2022). 64 | -------------------------------------------------------------------------------- /content/Proposals/process.md: -------------------------------------------------------------------------------- 1 | # Proposals Process 2 | 3 | ## Where to Propose Changes/Where to Submit Proposals? 4 | 5 | As defined in [handbook proposal](Accepted/202106-handbook.md#pitfalls-of-the-current-solution), our Handbook should tell you that Handbook is meant to be an index for our team resources and a linking point to other distributed projects we maintain or contribute to. 6 | 7 | First, we need to identify if the idea we have is something we can contribute to an upstream project, or it does not fit anywhere else, so we can leverage the [Handbok Proposal directory](..) and the [process](#proposal-process). See the below algorithm to find it out: 8 | 9 | ![where](../assets/proposal-where.png) 10 | 11 | [Internal Team Drive for Public and Confidential Proposals](https://drive.google.com/drive/folders/1WGqC3gMCxIQlrnjDUYfNUTPYYRI5Cxto) 12 | 13 | ## Proposal Process 14 | 15 | If there is no problem, there is no need for changing anything, no need for a proposal. This might feel trivial, but we should first ask ourselves this question before even thinking about writing a proposal. 16 | 17 | It takes time to propose an idea, find consensus and implement more significant concepts, so let's not waste time before it's worth it. But, unfortunately, even good ideas sometimes have to wait for a good moment to discuss them. 18 | 19 | Let's assume the idea sounds interesting to you; what to do next, where to propose it? How to review it? Follow the algorithm below: 20 | 21 | ![where](../assets/proposal-how.png) 22 | 23 | > Note: It's totally ok to reject a proposal if a team member feels the idea is wrong. It's better to explicitly oppose it than to ignore it and leave it in limbo. 24 | 25 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this process from the Monitoring Group handbook to Observability Group. Still, it has to grow organically (if the Logging, Tracing team will see the value of joining us here). 26 | 27 | ### On Review Process 28 | 29 | As you see on the above algorithm, if the content relates to any upstream project, it should be proposed, reviewed and potentially implemented together with the community. This does not mean that you can involve other team members towards this effort. Share the proposal with team members, even if they are not part of maintainer's team on a given project, any feedback, and voice are useful and can help to move idea further. 30 | 31 | Similar to proposals that touch our team only, despite mentioning mandatory approval process from leads, anyone can give feedback! Our process is in fact very similar to [Hashicorp's RFC process](https://works.hashicorp.com/articles/writing-practices-and-culture): 32 | 33 | > Once you’ve written the first draft of an RFC, share it with your team. They’re likely to have the most context on your proposal and its potential impacts, so most of your feedback will probably come at this stage. Any team member can comment on and approve an RFC, but you need explicit approval only from the appropriate team leads in order to move forward. Once the RFC is approved and shared with stakeholders, you can start implementing the solution. For major projects, also share the RFC to the company-wide email list. While most members of the mailing list will just read the email rather than the full RFC, sending it to the list gives visibility into major decisions being made across the company. 34 | 35 | ## Templates 36 | 37 | ### Google Docs Template 38 | 39 | [Open Source Design Doc Template](https://docs.google.com/document/d/1zeElxolajNyGUB8J6aDXwxngHynh4iOuEzy3ylLc72U/edit#). 40 | 41 | ### Markdown Template: 42 | 43 | ## Your Proposal Title 44 | 45 | * **Owners:** 46 | * `<@author: single champion for the moment of writing>` 47 | 48 | * **Related Tickets:** 49 | * `` 50 | 51 | * **Other docs:** 52 | * `` 53 | 54 | > TL;DR: Give here a short summary of what this document is proposing and what components it is touching. Outline rough idea of proposer's view on proposed changes. 55 | > 56 | > *For example: This design doc is proposing a consistent design template for “example.com” organization.* 57 | 58 | ## Why 59 | 60 | Put here a motivation behind the change proposed by this design document, give context. 61 | 62 | *For example: It’s important to clearly explain the reasons behind certain design decisions in order to have a consensus between team members, as well as external stakeholders. Such a design document can also be used as a reference and knowledge-sharing purposes. That’s why we are proposing a consistent style of the design document that will be used for future designs.* 63 | 64 | ### Pitfalls of the current solution 65 | 66 | What specific problems are we hitting with the current solution? Why it’s not enough? 67 | 68 | *For example, We were missing a consistent design doc template, so each team/person was creating their own. Because of inconsistencies, those documents were harder to understand, and it was easy to miss important sections. This was causing certain engineering time to be wasted.* 69 | 70 | ## Goals 71 | 72 | Goals and use cases for the solution as proposed in [How](#how): 73 | 74 | * Allow easy collaboration and decision making on design ideas. 75 | * Have a consistent design style that is readable and understandable. 76 | * Have a design style that is concise and covers all the essential information. 77 | 78 | ### Audience 79 | 80 | If not clear, the target audience that this change relates to. 81 | 82 | ## Non-Goals 83 | 84 | * Move old designs to the new format. 85 | * Not doing X,Y,Z. 86 | 87 | ## How 88 | 89 | Explain the full overview of the proposed solution. Some guidelines: 90 | 91 | * Make it concise and **simple**; put diagrams; be concrete, avoid using “really”, “amazing” and “great” (: 92 | * How you will test and verify? 93 | * How you will migrate users, without downtime. How we solve incompatibilities? 94 | * What open questions are left? (“Known unknowns”) 95 | 96 | ## Alternatives 97 | 98 | The section stating potential alternatives. Highlight the objections reader should have towards your proposal as they read it. Tell them why you still think you should take this path [[ref](https://twitter.com/whereistanya/status/1353853753439490049)] 99 | 100 | 1. This is why not solution Z... 101 | 102 | ## Action Plan 103 | 104 | The tasks to do in order to migrate to the new idea. 105 | 106 | * [ ] Task one 107 | * [ ] Task two ... 108 | -------------------------------------------------------------------------------- /content/Services/RHOBS/README.md: -------------------------------------------------------------------------------- 1 | # Red Hat Observability Service 2 | 3 | *Previous Documents:* 4 | 5 | * [Initial Design for RHOBS (Internal Content)](https://docs.google.com/document/d/1cSz_ZbS35mk8Op92xhB9ijW1ivOtJuD1uAzPiBdSUqs/edit) 6 | * [Initial FAQ (Internal Content)](https://docs.google.com/document/d/1_xnJBS3v7n4m229L3tqCqBXzZy55yu6dxCJY-vh_Egs/edit) 7 | 8 | ## What 9 | 10 | Red Hat Observability Service (RHOBS) is a managed, centralized, multi-tenant, scalable backend for observability data. Functionally it is an internal deployment of [Observatorium](../../Projects/Observability/observatorium.md) project. RHOBS is designed to allow ingesting, storing and consuming (visualisations, import, alerting, correlation) observability signals like metrics, logging and tracing. 11 | 12 | This document provides the basic overview of the RHOBS service. If you want to learn about RHOBS architecture, look for [Observatorium](https://observatorium.io) default deployment and its design. 13 | 14 | ### Background 15 | 16 | With the amount of managed Openshift clusters, for Red Hat’s own use as well as for customers, there is a strong need to improve the observability of those clusters and of their workloads to the multi-cluster level. Moreover, with the “clusters as cattle” concept, more automation and complexity there is a strong need for a uniform service gathering observability data including metrics, logs, traces, etc into a remote, centralized location for multi-cluster aggregation and long term storage. 17 | 18 | This need is due to many factors: 19 | 20 | * Managing applications running on more than one cluster, which is a default nowadays (cluster as a cattle), 21 | * Need to offload data and observability services from edge cluster to reduce complexity and cost of edge cluster (e.g remote health capabilities). 22 | * Allow offloading teams from maintaining their own observability service. 23 | 24 | It’s worth noting that there is also a significant benefit to collecting and using multiple signals within one system: 25 | * Correlating signals and creating a smooth and richer debugging UX. 26 | * Sharing common functionality, like rate limiting, retries, auth, etc, which allows a consistent integration and management experience for users. 27 | 28 | The Openshift Monitoring Team began preparing for this shift in 2018 with the [Telemeter](use-cases/telemetry.md) Service. In particular, while creating the second version of the Telemeter Service, we put effort into developing and contributing to open source systems and integration to design [“Observatorium”](../../Projects/Observability/observatorium.md): a multi-signal, multi-tenant system that can be operated easily and cheaply as a Service either by Red Hat or on-premise. After extending the scope of the RHOBS, Telemeter become the first "tenant" of the RHOBS. 29 | 30 | In Summer 2020, the Monitoring Team together with the OpenShift Logging Team added a logging signal to [“Observatorium”](../../Projects/Observability/observatorium.md) and started to manage it for internal teams as the RHOBS. 31 | 32 | ### Current Usage 33 | 34 | RHOBS is running in production and has already been offered to various internal teams, with more extensions and expansions coming in the near future. 35 | 36 | * There is currently no plan to offer RHOBS to external customers. 37 | * However anyone is welcome to deploy and manage an RHOBS-like-service on their own using [Observatorium](../../Projects/Observability/observatorium.md). 38 | 39 | Usage (state as of 2021.07.01): 40 | 41 | The metric usage is visualised in the following diagram: 42 | 43 | ![RHOBS](../../assets/rhobs.png) 44 | 45 | RHOBS is functionally separated into two main usage categories: 46 | 47 | * Since 2018 we run Telemeter tenant for metric signal (hard tenancy, `telemeter-prod-01` cluster). See [telemeter](use-cases/telemetry.md) for details. 48 | * Since 2021 we ingest metrics for selected Managed Services as soft tenants in an independent deployment (separate soft tenant, `telemeter-prod-01` cluster). See [MST](use-cases/observability.md) for details. 49 | 50 | Other signals: 51 | 52 | * Since 2020 we ingest logs for the DPTP team (hard tenancy, `telemeter-prod-01` cluster). 53 | 54 | > Hard vs Soft tenancy: In principle hard tenancy means that in order to run system for multiple tenants you run each stack (can be still in single cluster, yet isolated through e.g namespaces) for each tenant. Soft tenancy means that we reuse same endpoint and services to handle tenant APIs. For tenant both tenancy models should be invisible. We chose Telemeter to be deployed in different stack, because of Telemeter critical nature and different functional purpose that makes Telemeter performance characteristics bit different to normal monitoring setup (more analytics-driven cardinality). 55 | 56 | ### Owners 57 | 58 | RHOBS was initially designed by `Monitoring Group` and its metric signal is managed and supported by `Monitoring` Group team (OpenShift organization) together with AppSRE team (Service Delivery organization). 59 | 60 | Other signals are managed by others team (each together with AppSRE help): 61 | 62 | * Logging signal: OpenShift Logging Team 63 | * Tracing signal: OpenShift Tracing Team 64 | 65 | Telemeter part client side is maintained by In-cluster Monitoring team, but managed by corresponding client cluster owner. 66 | 67 | ### Metric Label Restrictions 68 | 69 | There is small set of reserved label name that are reserved on RHOBS side. If remote write request contains series with those labels, they will be overridden (not honoured). 70 | 71 | Restricted labels: 72 | 73 | * `tenant_id`: Internal tenant ID in form of UUID that is injected to all series in various placed and constructed from tenant authorization (HTTP header). 74 | * `receive`: Special label used internally. 75 | * `replica` and `rule_replica`: Special label used internally to replicate receive and rule data for reliability. 76 | 77 | Special labels: 78 | 79 | * `prometheus_replica`: Use this label for Agent/Prometheus unique scraper ID if you scrape and push from multiple of replicas scraping the same data. RHOBS is configured to deduplicate metrics based on this. 80 | 81 | Recommended labels: 82 | 83 | * `_id` for cluster ID (in form of lowercase UUID) is the common way of identifying clusters. 84 | * `prometheus` is a name of Prometheus (or any other scraper) that was used to collect metrics. 85 | * `job` is scrape job name that is usually dynamically set to abstracted service name representing microservice (usually deployment/statefulset name) 86 | * `namespace` is Kubernetes namespace. Useful to make sure identify same microservices across namespaces. 87 | * For more "version" based granularit (warning - every pod rollout creates new time series). 88 | * `pod` is a pod name 89 | * `instance` is an IP:port (useful when pod has sidecars) 90 | * `image` is image name and tag. 91 | 92 | ### Support 93 | 94 | For support see: 95 | 96 | * [MST escalation flow](use-cases/observability.md#support) and [MST SLA](use-cases/observability.md#service-level-agreement). 97 | * [Telemeter escalation flow](use-cases/telemetry.md#support) and [Telemeter SLA](use-cases/telemetry.md#service-level-agreement). 98 | -------------------------------------------------------------------------------- /content/contributing.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | This document explains the process of modifying this handbook. 4 | 5 | > You were instead looking at how to contribute to the projects we maintain and contribute as part of Monitoring Group? Check out the [Projects list!](Projects/README.md) 6 | 7 | ## Who can Contribute? 8 | 9 | Anyone is welcome to read and contribute to our handbook! It's licensed with Apache 2. Any team member, another Red Hat team member, or even anyone outside Red Hat is welcome to help us refine processes, update documentation and FAQs. 10 | 11 | If the handbook inspired you, let us know too! 12 | 13 | ## What and How You Can Contribute? 14 | 15 | Follow [Modifying Documents Guide](#modifying-documents) and add GitHub Pull Request to [https://github.com/rhobs/handbook](https://github.com/rhobs/handbook): 16 | 17 | * If you find a broken link, typo, unclear statements, gap in the documentation. 18 | * If you find a broken element on `https://rhobs-handbook.netlify.app/` website or via [GitHub UI](https://github.com/rhobs/handbook/tree/main/content). 19 | * If you want to add a new proposal (please also follow the proposal process (TBD)). 20 | * If there is no active and existing PR for a similar work item. [TODO General Contribution Guidelines](https://github.com/rhobs/handbook/issues/7) 21 | 22 | Add GitHub Issue to https://github.com/rhobs/handbook: 23 | 24 | * If you don't want to spend time doing the actual work, you want to report it and potentially do it in your free time. 25 | * If there is no current issue for a similar work item. 26 | 27 | ## Modifying Documents 28 | 29 | This section will guide you through changing, adding or removing anything from handbook content. 30 | 31 | > NOTE: It's ok to start with some large changes in Google Docs first for easier collaboration. Then, you can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert it to markdown later on. 32 | 33 | ### Prerequisites 34 | 35 | * Familiarise with [GitHub Flavored Markdown Spec](https://github.github.com/gfm/). Our source documentation is written in that markdown dialect. 36 | * Check the goals for this handbook defined in the [Handbook proposal](Proposals/Accepted/202106-handbook.md) to ensure you understand the scope of the handbook materials. 37 | * Make sure you have installed Go 1.15+ on your machine. 38 | 39 | ### How to decide where to put the content? 40 | 41 | ![flow](assets/handbook-process.png) 42 | 43 | NOTE: Make sure to not duplicate too much information with the project documentation. Docs are aging quickly. 44 | 45 | ### Editing the content 46 | 47 | As you can read from the [Handbook proposal](Proposals/Accepted/202106-handbook.md), our website framework ([`mdox`](https://github.com/bwplotka/mdox) and [Hugo](https://gohugo.io/) is designed to allow maintaining the source documentation files written in markdown in website agnostic way. Just write a markdown like you would write it if our main rendering point would be a [GitHub UI](https://github.com/rhobs/handbook/tree/main/content) itself! 48 | 49 | #### Changing Documentation Files 50 | 51 | 1. Go to https://github.com/rhobs/handbook/ and fork it. 52 | 53 | 2. Git clone it on your machine, add a branch to your fork (alternatively, you can use GitHub UI to edit file). 54 | 55 | 3. Add, edit or remove any markdown file in [`./content`](https://github.com/rhobs/handbook/tree/main/content) directory. The structure of directories will map to the menu on the website. 56 | 57 | 4. Format markdown. 58 | 59 | We mandate opinionated markdown formatting on our docs to ensure maintainability. As humans, we can make mistakes, and formatting will quickly fix inconsistent newlines, whitespaces etc. It will also validate if your relative links are pointing to something existing (!). Run `make docs` from repo root. 60 | 61 | > NOTE: Use relative links directly to markdown files when pointing to another file in the `content` directory. Don't link to the same resources on the website. The framework will transform from link to md file into the link to relative URL path on the website. 62 | 63 | > NOTE: The CI job will check the formatting and local links, so before committing, make sure to run `make docs` from the main repo root! 64 | 65 | 1. Create PR 66 | 67 | 2. Once the PR is merged, the content will appear on our website `https://rhobs-handbook.netlify.app/` and [GitHub UI](https://github.com/rhobs/handbook/tree/main/content). 68 | 69 | #### Adding Static Content 70 | 71 | All static content that you want to link to or embed (e.g. images) add to the `content` (ideally to [`content/assets`](https://github.com/rhobs/handbook/tree/main/content/assets) for clean grouping) and relatively link in your main document. 72 | 73 | It's recommended to check if your page renders well before merging (see [section below](#early-previews)). 74 | 75 | > NOTE: If you used some tool for creating diagrams (e.g checkout amazing https://excalidraw.com/), put the source file into assets too. If online tool was used, e.g Google Draw, add link to the markdown so we can edit those easily. 76 | 77 | #### Early Previews 78 | 79 | While editing documentation, you can run the website locally to check how it all looks! 80 | 81 | *Locally:* 82 | 83 | Run `make web-serve` from repo root. This will serve the website on `localhost:1313`. 84 | 85 | > NOTE: Currently, some options reload in runtime e.g. CSS, but content changes require `make web-serve` restart. See https://github.com/bwplotka/mdox/issues/35 that tracks fix for this. 86 | 87 | *From the PR:* 88 | 89 | Click `details` on website preview CI check from netlify: 90 | 91 | ![preview](assets/screen-preview.png) 92 | 93 | > Something does not work? Link works on GitHub but not on the website? Don't try to hack this up. It might bug with our framework. Please raise an issue on https://github.com/rhobs/handbook/ for us to investigate. 94 | 95 | ## Advanced: Under the hood 96 | 97 | We do advanced conversions from source files in `content` dir to the website. Let's do a quick overview of our framework. 98 | 99 | * [`./web`](https://github.com/rhobs/handbook/tree/main/web) contains resources related to the website. 100 | * [`./web/themes`](https://github.com/rhobs/handbook/tree/main/web/themes) hosts our Hugo theme [`Docsy`](https://themes.gohugo.io/docsy/) as git submodule. 101 | * [`./web/config.toml`](https://github.com/rhobs/handbook/blob/main/web/config.toml) holds Hugo and theme configuration. 102 | * [`./netlify.yaml`](https://github.com/rhobs/handbook/blob/main/netlify.toml) holds our Netlify deployment configuration.) 103 | * [`.mdox.yaml`](https://github.com/rhobs/handbook/blob/main/.mdox.yaml) holds https://github.com/bwplotka/mdox configuration that tells mdox how to *pre-process* our docs. 104 | 105 | When you run `make web` or `make web-server`, a particular process is happening to "render" the website: 106 | 107 | 1. `mdox` using `.mdox.yaml` is transforming the resources from `content` directory, moving into website friendly form (adding frontmatter, adjusting links and many others) and outputs results in `./web/content` and `./web/static` which are never committed (temporary directories). 108 | 2. We run Hugo, which renders the markdown files from `./web/content` and using the rest of the resources in `./web` to render the website in `./web/publish`. 109 | 3. Netlify then knows that website has to be hosted from `./web/publish`. 110 | 111 | ## How to contact us? 112 | 113 | Feel free to put an [issue or discussion topic on our handbook repo](https://github.com/rhobs/handbook), ask @bwplotka or @onprem users on Slack or catch us on Twitter. 114 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/faq.md: -------------------------------------------------------------------------------- 1 | # Frequently asked questions 2 | 3 | This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects: 4 | 5 | * [Platform Cluster Monitoring - PM](https://docs.openshift.com/container-platform/latest/monitoring/understanding-the-monitoring-stack.html#understanding-the-monitoring-stack_understanding-the-monitoring-stack) 6 | * [User Workload Monitoring - UWM](https://docs.openshift.com/container-platform/latest/monitoring/enabling-monitoring-for-user-defined-projects.html) 7 | 8 | ## How can I (as a monitoring developer) troubleshoot support cases? 9 | 10 | See this [presentation](https://docs.google.com/presentation/d/1SY0xHNO-QMvhMi1kRSJlZuYnkuZ3nFLJvWjG0CM7wgw/edit) to understand which tools are at your disposal. 11 | 12 | ## How do I understand why targets aren't discovered and metrics are missing? 13 | 14 | Both `PM` and `UWM` monitoring stacks rely on the `ServiceMonitor` and `PodMonitor` custom resources in order to tell Prometheus which endpoints to scrape. 15 | 16 | The examples below show the namespace `openshift-monitoring`, which can be replaced with `openshift-user-workload-monitoring` when dealing with `UWM`. 17 | 18 | A detailed description of how the resources are linked exists [here](https://prometheus-operator.dev/docs/operator/troubleshooting/#troubleshooting-servicemonitor-changes), but we will walk through some common issues to debug the case of missing metrics. 19 | 20 | 1. Ensure the `serviceMonitorSelector` in the `Prometheus` CR matches the key in the `ServiceMonitor` labels. 21 | 2. The `Service` you want to scrape *must* have an explicitly named port. 22 | 3. The `ServiceMonitor` *must* reference the `port` by this name. 23 | 4. The label selector in the `ServiceMonitor` must match an existing `Service`. 24 | 25 | Assuming this criteria is met but the metrics don't exist, we can try debug the cause. 26 | 27 | There is a possibility Prometheus has not loaded the configuration yet. The following metrics will help to determine if that is in fact the case or if there are errors in the configuration: 28 | 29 | ```bash 30 | prometheus_config_last_reload_success_timestamp_seconds 31 | prometheus_config_last_reload_successful 32 | ``` 33 | 34 | If there are errors with reloading the configuration, it is likely the configuration itself is invalid and examining the logs will highlight this. 35 | 36 | ```bash 37 | oc logs -n openshift-monitoring prometheus-k8s-0 -c 38 | ``` 39 | 40 | Assuming that the reload was a success then the Prometheus should see the configuration. 41 | 42 | ```bash 43 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/status/config | grep "" 44 | ``` 45 | 46 | If the `ServiceMonitor` does not exist in the output, the next step would be to investigate the logs of both `prometheus` and the `prometheus-operator` for errors. 47 | 48 | Assuming it does exist then we know `prometheus-operator` is doing its job. Double check the `ServiceMonitor` definition. 49 | 50 | Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found [here](https://github.com/openshift/cluster-monitoring-operator/blob/23201e012586d4864ca23593621f843179c47412/assets/prometheus-k8s/role-specific-namespaces.yaml#L35-L50). 51 | 52 | ## How do I troubleshoot the TargetDown alert? 53 | 54 | First of all, check the [TargetDown runbook](https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TargetDown.md). 55 | 56 | We have, in the past seen cases where the `TargetDown` alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause. 57 | 58 | As the alert fires, get the list of active targets in Prometheus 59 | 60 | ```bash 61 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-0.json 62 | 63 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-1.json 64 | ``` 65 | 66 | --- 67 | 68 | Reports all targets that Prometheus couldn’t connect to with some reason (timeout, refused, …) 69 | 70 | A `dialer_name` can be passed as a label to limit the query to interesting components. For example `{dialer_name=~".+openshift-.*"}`. 71 | 72 | ```bash 73 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=rate(net_conntrack_dialer_conn_failed_total{}[1h]) > 0' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-0.json 74 | 75 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=net_conntrack_dialer_conn_failed_total{} > 1' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-1.json 76 | ``` 77 | 78 | --- 79 | 80 | Identify targets that are slow to serve metrics and may be considered as down. 81 | 82 | ```bash 83 | oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-0.json 84 | 85 | oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json 86 | ``` 87 | 88 | ## How do I troubleshoot high CPU usage of Prometheus? 89 | 90 | Often, when "high" CPU usage or spikes are identified it can be a symptom of expensive rules. 91 | 92 | A good place to start the investigation is the `/rules` endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times. 93 | 94 | A sorted list of rule evaluation times can be gathered with the following: 95 | 96 | ```bash 97 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/rules' | jq -r '.data.groups[] | .rules[] | [.evaluationTime, .health, .name] | @tsv' | sort 98 | ``` 99 | 100 | An overview of the timeseries database can be retrieved with: 101 | 102 | ```bash 103 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -s 'http://localhost:9090/api/v1/status/tsdb' | jq 104 | ``` 105 | 106 | Within Prometheus, the `prometheus_rule_evaluation_duration_seconds` metric can be used to view evalutation time by quantile for each instance. Additionally, the `prometheus_rule_group_last_duration_seconds` can be used to determine the longest evaluating rulegroups. 107 | 108 | ## How do I retrieve CPU profiles? 109 | 110 | In cases where excessive CPU usage is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span. 111 | 112 | To gather CPU profiles over a period of 30 minutes, run the following: 113 | 114 | ```bash 115 | SLEEP_MINUTES=5 116 | duration=${DURATION:-30} 117 | while [ $duration -ne 0 ]; do 118 | for i in 0 1; do 119 | echo "Retrieving CPU profile for prometheus-k8s-$i..." 120 | oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/profile?seconds="$duration" > cpu.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof; 121 | done 122 | echo "Sleeping for $SLEEP_MINUTES minutes..." 123 | sleep $(( 60 * $SLEEP_MINUTES )) 124 | (( --duration )) 125 | done 126 | ``` 127 | 128 | ## How do I debug high memory usage? 129 | 130 | The following queries might prove useful for debugging. 131 | 132 | Calculate the ingestion rate over the last two minutes: 133 | 134 | ```bash 135 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \ 136 | -- curl -s http://localhost:9090/api/v1/query --data-urlencode \ 137 | 'query=sum by(pod,job,namespace) (max without(instance) (rate(prometheus_tsdb_head_samples_appended_total{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[2m])))' > samples_appended.json 138 | ``` 139 | 140 | Calculate "non-evictable" memory: 141 | 142 | ```bash 143 | oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \ 144 | -- curl -s http://localhost:9090/api/v1/query --data-urlencode \ 145 | 'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json 146 | ``` 147 | 148 | ## How do I get memory profiles? 149 | 150 | In cases where excessive memory is being reported, it might be useful to obtain [Pprof profiles](https://github.com/google/pprof/blob/02619b876842e0d0afb5e5580d3a374dad740edb/doc/README.md) from the Prometheus containers over a short time span. 151 | 152 | To gather memory profiles over a period of 30 minutes, run the following: 153 | 154 | ```bash 155 | SLEEP_MINUTES=5 156 | duration=${DURATION:-30} 157 | while [ $duration -ne 0 ]; do 158 | for i in 0 1; do 159 | echo "Retrieving memory profile for prometheus-k8s-$i..." 160 | oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/heap > heap.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof; 161 | done 162 | echo "Sleeping for $SLEEP_MINUTES minutes..." 163 | sleep $(( 60 * $SLEEP_MINUTES )) 164 | (( --duration )) 165 | done 166 | ``` 167 | -------------------------------------------------------------------------------- /content/Services/RHOBS/configuring-clients-for-observatorium.md: -------------------------------------------------------------------------------- 1 | # Configuring Clients for Red Hat’s Observatorium Instance 2 | 3 | - **Authors:** 4 | - [`@squat`](https://github.com/squat) 5 | - [`@spaparaju`](https://github.com/spaparaju) 6 | - [`@onprem`](https://github.com/onprem) 7 | 8 | ## Overview 9 | 10 | Teams that have identified a need for collecting their logs and/or metrics into a centralized service that offers querying and dashboarding may choose to send their data to Red Hat’s hosted Observatorium instance. This document details how to configure clients, such as Prometheus, to remote write data for tenants who have been onboarded to Observatorium following the team’s onboarding doc: [Onboarding a Tenant into Observatorium (internal)](https://docs.google.com/document/d/1pjM9RRvij-IgwqQMt5q798B_4k4A9Y16uT2oV9sxN3g/edit). 11 | 12 | ## 0. Register Service Accounts and Configure RBAC 13 | 14 | Before configuring any clients, follow the steps in the [Observatorium Tenant Onboarding doc (internal)](https://docs.google.com/document/d/1pjM9RRvij-IgwqQMt5q798B_4k4A9Y16uT2oV9sxN3g/edit) to register the necessary service accounts and give them the required permissions on the Observatorium platform. The result of this process should be an OAuth client ID and client secret pair for each new service account. Save these credentials somewhere secure. 15 | 16 | ## 1. Remote Writing Metrics to Observatorium 17 | 18 | ### Using the Cluster Monitoring Stack 19 | 20 | This section describes the process of sending metrics collected by the Cluster Monitoring stack on an OpenShift cluster to Observatorium. 21 | 22 | #### Background 23 | 24 | In order to remote write metrics from a cluster to Observatorium using the OpenShift Cluster Monitoring stack, the cluster’s Prometheus servers must be configured to authenticate and make requests to the correct URL. The OpenShift Cluster Monitoring ConfigMap exposes a user-editable field for configuring the Prometheus servers to remote write. However, because Prometheus does not support OAuth, it cannot authenticate directly with Observatorium and because the Cluster Monitoring stack is, for all intents and purposes, immutable, the Prometheus Pods cannot be configured with sidecars to do the authentication. For this reason, the Prometheus servers must be configured to remote write through an authentication proxy running on the cluster that in turn is pointed at Observatorium and is able to perform an OAuth flow and set the received access token on proxied requests. 25 | 26 | #### 1. Configure the Cluster Monitoring Stack 27 | 28 | The OpenShift Cluster Monitoring stack provides a ConfigMap that can be used to modify the configuration and behavior of the components. The first step is to modify the “cluster-monitoring-config” ConfigMap in the cluster to include a remote write configuration for Prometheus as shown below: 29 | 30 | ```yaml 31 | apiVersion: v1 32 | kind: ConfigMap 33 | metadata: 34 | name: cluster-monitoring-config 35 | namespace: openshift-monitoring 36 | data: 37 | config.yaml: | 38 | prometheusK8s: 39 | retention: 2h 40 | remoteWrite: 41 | - url: http://token-refresher.openshift-monitoring.svc.cluster.local 42 | queueConfig: 43 | max_samples_per_send: 500 44 | batch_send_deadline: 60s 45 | write_relabel_configs: 46 | - source_labels: [__name__] 47 | - regex: metric_name.* 48 | ``` 49 | 50 | #### 2. Deploy the Observatorium Token-Refresher Proxy 51 | 52 | Because Prometheus does not have built-in support for acquiring OAuth2 access tokens for authorization, which are required by Observatorium, the in-cluster Prometheus must remote write its data through a proxy that is able to fetch access tokens and set them as headers on outbound requests. The Observatorium stack provides such a proxy, which may be deployed to the “openshift-monitoring” namespace and guarded by a NetworkPolicy so that only Prometheus can use the access tokens. The following snippet shows an example of how to deploy the proxy for the stage environment: 53 | 54 | ```bash 55 | export TENANT= 56 | export CLIENT_ID= 57 | export CLIENT_SECRET= 58 | # For staging: 59 | export STAGING=true 60 | cat < **NOTE:** OAuth2 support was added to Prometheus in version 2.27.0. If using an older version of Prometheus without native OAuth2 support, the remote write traffic needs to go through a proxy such as [token-refresher](https://github.com/observatorium/token-refresher/), similar to what is described above for Cluster Monitoring Stack. 182 | 183 | #### 1. Modify the Prometheus configuration 184 | 185 | The configuration file for Prometheus must be patched to include a remote write configuration as shown below. ``, ``, and `` need to be replaced with appropriate values. 186 | 187 | ```yaml 188 | remote_write: 189 | - url: https://observatorium-mst.api.stage.openshift.com/api/metrics/v1//api/v1/receive 190 | oauth2: 191 | client_id: 192 | client_secret: 193 | token_url: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token 194 | ``` 195 | -------------------------------------------------------------------------------- /content/Proposals/Accepted/202106-handbook.md: -------------------------------------------------------------------------------- 1 | # 2021-06: Handbook 2 | 3 | * **Owners:** 4 | * [`@bwplotka`](https://github.com/bwplotka) 5 | 6 | * **Other docs:** 7 | * https://works.hashicorp.com/articles/writing-practices-and-culture 8 | * https://about.gitlab.com/handbook/handbook-usage/#wiki-handbooks-dont-scale 9 | * https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/5143 10 | * [Observability Team Process (Internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit) 11 | * [Monitoring Team Processes (Internal)](https://drive.google.com/drive/folders/1yU5DYhpBLzmp3q9SIhudRMZ1rsmQ_sR2) 12 | 13 | > TL;DR: I would like to propose to put all public documentation pieces related to the Monitoring Group (and not tied to a specific project) in the public GitHub repository called [`handbook`](https://github.com/rhobs/handbook). I propose to review all documents with a similar flow as code and put documents in the form of markdown files that can be read from both GitHub UI and automatically served on https://rhobs-handbook.netlify.app/ website. 14 | > 15 | > The [diagram](#flow-of-addingconsuming-documentation-to-handbook) below shows what fits into this handbook and what should be distributed to the relevant upstream project (e.g developer documentation). 16 | 17 | ## Why 18 | 19 | ##### Documentation is essential 20 | 21 | * Without good team processes documentation, collaboration within the team can be challenging. Members have to figure out what to do on their own, or tribal knowledge has to be propagated. Surprises and conflicts can arise. On-boarding new team members are hard. It's vital given that our Red Hat teams are distributed over the world and working remotely. 22 | * Additionally, it's hard for any internal or external team to discover how to reach us or escalate without noise. 23 | * Without a good team subject matter overview, it's hard to wrap your head around the number of projects we participate in. In addition, each team member is proficient in a different area, and we miss some "index" overview of where to navigate for various project aspects (documentation, contributing, proposals, chats). 24 | * Even if documentation is created, it risks being placed in the wrong place. 25 | * Without a place for written design proposals (those in progress, those accepted and rejected), the team risks repeating iterating over the same ideas or challenging old ideas already researched. 26 | * Without good operational or configuration knowledge, we keep asking the same question about, e.g. how to rollout service X or contribute to X etc. 27 | 28 | ##### Despite strong incentives, writing documentation has proven to be of one the most unwanted task among engineers 29 | 30 | Demotivation is because our (Google Docs based) process tends to create the following obstacles: 31 | 32 | * There are too many side decisions to make, e.g. where to put this documentation, what format to use, how long, how to be on-topic, are we sure this information is not recorded somewhere else? Every, even small decision takes our energy and have the risk of procrastination. 33 | * There is no review process, so it's hard to maintain a high quality of those documents. 34 | * Created documentation is tough to consume and discover. 35 | * Because docs are hard to discover, the documentation tends to be often duplicated, has gaps, or is obsolete. 36 | * Documents used to be private, which brings extra demotivation. Some of the information is useful for the public audience. Some of this could be useful for external contributors. It's hard to reuse such private docs without recreating them. 37 | 38 | All of those make people decide NOT to write documentation but rather schedule another meeting and repeat the same information repeatedly. 39 | 40 | On a similar side, anyone looking for information about our teams' work, proposals or project is demotivated to look, find and read our documentation because it's not consistent, not in a single place, hard to discover or not completed. 41 | 42 | ### Pitfalls of the current solution 43 | 44 | * It mainly exists in Google Docs, which has the following issues: 45 | * Not everything is in our Team drive, there are docs not owned by us, created adhoc. 46 | * It's painful to organize them well e.g in directories, since it's so easy so copy, create one. 47 | * Even if it's organized well, it's not easily discoverable. 48 | * Existing Google doc-based documents are hard to consume. The formatting is widely different. Naming is inconsistent. 49 | * *Document creation is rarely actionable*. There is no review process, so the effort of creating a relevant document might be wasted, as the document is lost. This also leads to docs being in the half-completed state, demotivating readers to look at it. 50 | * It's hard to track previous discussions around docs, who approved them (e.g. proposals). 51 | * It's not public, and it's hard to share best practices with other external and internal teams. 52 | 53 | ## Goals 54 | 55 | Goals and use cases for the solution as proposed in [How](#how): 56 | 57 | * Single source of truth for Monitoring Group Team docs like processes, overviews, runbooks, links for internal content. 58 | * Have a consistent documentation format that is readable and understandable. 59 | * Searchable and easily discoverable. 60 | * Process of adding documents should be easy and consistent. 61 | * Automation and normal review process should be in place to ensure high quality (e.g. link checking). 62 | * Allow public collaboration on processes and other docs. 63 | 64 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this handbook from Monitoring Group handbook to Observability Group, but it has to grow organically (if Logging, Tracing team will see the value joining us here). 65 | 66 | ### Audience 67 | 68 | The currently planned audience for proposed documentation content is following (in importance order): 69 | 70 | 1. Monitoring Group Team Members. 71 | 2. External Teams at Red Hat. 72 | 3. Teams outside Red Hat, contributors to our projects, potential future hires, people interested in best practices, team processes etc. 73 | 74 | ## Non-Goals 75 | 76 | * Support other formats than `Markdown` e.g. Asciidoc. 77 | * Replace official project or product documentation. 78 | * Precise design proposal process (it will come in a separate proposal). 79 | * Sharing Team Statuses, we use JIRA and GH issues for that. 80 | 81 | ## How 82 | 83 | The idea is simple: 84 | 85 | Let's make sure we maintain the process of adding/editing documentation **as easy and rewarding** as possible. This will increase the chances team members will document things more often and adopt this as a habit. Produced content will be more likely complete and up-to-date, increasing the chances it will be helpful to our audience, which will reduce the meeting burden. This will make writing docs much more rewarding, which creates a positive loop. 86 | 87 | I propose to use git repository [`handbook`](https://github.com/rhobs/handbook) to put all related team documentation pieces there. Furthermore, I suggest reviewing all documents with a similar flow as code and placing information in the form of markdown files that can be read from both GitHub UI and automatically served on https://rhobs-handbook.netlify.app/ website. 88 | 89 | Pros: 90 | * Matches our [goals](#goals). 91 | * Sharing by default. 92 | * Low barriers to write documents in a consistent format, low barrier to consume it. 93 | * Ensures high quality with local CI and review process. 94 | 95 | Cons: 96 | * Some website maintenance is needed, but we use the same and heavily automated flow in Prometheus-operator, Thanos, Observatorium websites etc. 97 | 98 | The idea of a handbook is not new. Many organizations do this e.g [GitLab](https://about.gitlab.com/handbook/handbook-usage/#wiki-handbooks-dont-scale). 99 | 100 | > NOTE: The website style might be not perfect (https://rhobs-handbook.netlify.app/). Feel free to propose issues, fixes to the overall look and readability! 101 | 102 | ### Flow of Adding/Consuming Documentation to Handbook 103 | 104 | ![flow](../../assets/handbook-process.png) 105 | 106 | If you want to add or edit markdown documentation, refer to [our technical guide](../../contributing.md). 107 | 108 | ## Alternatives 109 | 110 | 1. Organize Team Google Drive with all Google docs we have. 111 | 112 | Pros: 113 | * Great for initial collaboration 114 | 115 | Cons: 116 | * Inconsistent format 117 | * Hard to track approvers 118 | * Never know when the doc is "completed." 119 | * Hard to maintain over time 120 | * Hard to share and reuse outside 121 | 122 | 1. Create Red Hat scoped only, a private handbook. 123 | 124 | Pros: 125 | * No worry if we share something internal? 126 | 127 | Cons: 128 | * We don't have many internal things we don't want to share at the current moment. All our projects and products are public. 129 | * Sharing means we have to duplicate the information, copy it in multiple places. 130 | * Harder to share with external teams 131 | * We can't use open source tools, CIs etc. 132 | 133 | ## Action Plan 134 | 135 | * [X] Create handbook repo and website 136 | * [X] Create website automation (done with [mdox](https://github.com/bwplotka/mdox)) 137 | * [ ] Move existing up-to-date public documents (team processes, project overviews, faqs, design docs) over to the handbook (deadline: End of July). 138 | * Current Team Drive: [internal](https://drive.google.com/drive/folders/1PJHtAtxBUHxbmMx1xftrNSOJEIoYVhyO) 139 | * TIP: You can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert Google Doc into markdown easily. 140 | * [ ] Clean up Team Drive from not used or non-relevant project (or move it to some trash dir). 141 | -------------------------------------------------------------------------------- /content/Teams/observability-platform/sre.md: -------------------------------------------------------------------------------- 1 | # Observability Platform Team SRE Processes 2 | 3 | This document explains a few processes related to operating our production software together with AppSRE team. All operations can be summarised as the Site Reliability Engineering part of our work. 4 | 5 | ## Goals 6 | 7 | The main goals of our SRE work within this team is to deploy and operate software that meets functional ([API](../../Projects/Observability/observatorium.md) and performance ([Telemeter SLO](../../Services/RHOBS/use-cases/telemetry.md#service-level-agreement), [MST SLO](../../Services/RHOBS/use-cases/observability.md#service-level-agreement)) requirements of our customers. 8 | 9 | Currently, we offer internal service called [RHOBS](../../Services/RHOBS) that is used across the company. 10 | 11 | ## Releasing Changes / Patches 12 | 13 | In order to maintain our software on production reliably we need to be able to test, release and deploy our changes rapidly. 14 | 15 | We can divide things we change into a few categories. Let's elaborate all processes per each category: 16 | 17 | ### Software Libraries 18 | 19 | Library is a project that is not deployed directly, but rather it is a dependency for [micro-service application](#software-for-micro-services) we deploy. For example https://github.com/prometheus/client_golang or https://github.com/grpc-ecosystem/go-grpc-middleware 20 | 21 | *Testing*: 22 | * Linters, unit and integration tests on every PR. 23 | 24 | *Releasing* 25 | * GitHub release using git tag (RC first, then major/minor or patch release). 26 | 27 | *Deploying* 28 | * Bumping dependency in Service 29 | * Following on Service [release and deploy steps](#software-for-micro-services). 30 | 31 | ### Software for (Micro) Services 32 | 33 | Software for (micro) services usually lives in many separate open source repositories in GitHub e.g https://github.com/thanos-io/thanos, https://github.com/observatorium/api. 34 | 35 | *Testing*: 36 | * Linters, unit and integration tests on every PR. 37 | 38 | *Releasing* 39 | * GitHub release using git tag (RC first, then major/minor or patch release). 40 | * Building docker images in quay.io and dockerhub (backup) using CI for each tag or main branch PR. 41 | 42 | *Deploying* 43 | * Bumping docker tag version in configuration. 44 | * Following [configuration release and deployment steps](#configuration). 45 | 46 | ### Configuration 47 | 48 | All or configuration is rooted in https://github.com/rhobs/configuration configuration templates written in jsonnet. It can be then overridden by defined parameters in app-interface saas files . 49 | 50 | *Testing*: 51 | * [Building jsonnet resources](https://github.com/rhobs/configuration/blob/main/.github/workflows/ci.yml), linting jsonnet, validating openshift templates. 52 | * Validation on App-interface side against Kubernetes YAMLs compatibility 53 | 54 | *Releasing* 55 | * Merge to main and get the commit hash you want to deploy. 56 | 57 | *Deploying* 58 | * Make sure you are on RH VPN. 59 | * Propose PR to https://gitlab.cee.redhat.com/service/app-interface with the change of ref for the desired environment to desired commit sha is app-interface Saas file for desired tenant ([telemeter](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/rhobs/telemeter/cicd/saas.yaml#L70) or MST) or change environment parameter. 60 | * Ask team for review. If change is impacting production heavily, notify AppSRE. 61 | * If only `saas.yaml` file was changed `/lgtm` from Observability Platform team is enough for PR to get merged automatically. 62 | * If any other file was changed, AppSRE engineer has to lgtm it. 63 | * When merged, CI will deploy the changes to namespace specified in `saas.yaml` e.g to production. 64 | 65 | NOTE: Don't change both production and staging in the same MR. NOTE: Deploy to production only changed that were previously in staging (automation for this TBD). 66 | 67 | You can see the version change: 68 | * [On monitoring dashboard](e.g https://prometheus.telemeter-prod-01.devshift.net/graph?g0.range_input=1h&g0.expr=thanos_build_info&g0.tab=0) 69 | * See report of deploy on `#team-monitoring-info` Slack channel on CoreOS Slack. 70 | 71 | ### Monitoring Resources 72 | 73 | Grafana Dashboards are defined here: https://github.com/rhobs/configuration/tree/main/observability/dashboards Alerts and Rules here: https://github.com/rhobs/configuration/blob/main/observability/prometheusrules.jsonnet 74 | 75 | *Testing*: 76 | * ): 77 | 78 | *Releasing* 79 | * Merge to main and get the commit hash you want to deploy. 80 | 81 | *Deploying* 82 | * Use `synchronize.sh` to create an MR/PR against app-interface. This will copy all generated YAML resources to proper places. 83 | * Ask team for review. If change is impacting production heavily, notify AppSRE. 84 | * When merged, CI will deploy the change to production. You can see the version change on Monitoring dashboard too e.g https://prometheus.telemeter-prod-01.devshift.net/graph?g0.range_input=1h&g0.expr=thanos_build_info&g0.tab=0. 85 | 86 | ## On-Call Rotation 87 | 88 | Currently, RHOBS services are supported 24h/7d by two teams: 89 | 90 | * `AppSRE` team for infrastructure and "generalist" support. They are our first incident response point. They try to support our stack as far as runbooks and general knowledge allows. 91 | * `Observability Platform` Team (Dev on-call) for incidents impacting SLA outside of the AppSRE expertise (bug fixes, more complex troubleshooting). We are notified when AppSRE needs. 92 | 93 | ## Incident Handling 94 | 95 | This is the process we as the Observability Team try to follow during incident response. 96 | 97 | The incident occurs when any of our services violates SLO we set with our stakeholders. Refer to [Telemeter SLO](../../Services/RHOBS/use-cases/telemetry.md#service-level-agreement) and [MST SLO](../../Services/RHOBS/use-cases/observability.md#service-level-agreement) for details on SLA. 98 | 99 | NOTE: Following procedure applies to both Production and Staging. Many teams e.g SubWatch depends on working staging, so follow similar process as production. The only difference is that *we do not need to mitigate of fix staging issues outside of office hours*. 100 | 101 | *Potential Trigger Points*: 102 | 103 | * You got notified on Slack by AppSRE team or paged through Pager Duty. 104 | * You got notified about potential SLA violation by the customer: Unexpected responses, things that worked before do not work now etc. 105 | * You touch production for unrelated reasons and noticed something worrying (error logs, un-monitored resource etc). 106 | 107 | 1. If you are not on-call [notify Observability Platform on-call engineer](../../Services/RHOBS/use-cases/telemetry.md#escalations). If you are on-call, on-call engineer is not present or you agreed that you will handle this incident, go to step 2. 108 | 2. Straight away, create JIRA for potential incident. Don't think twice, it's easy to create and essential to track incident later on. Fill the following parts: 109 | 110 | * Title: Symptom you see. 111 | * Type: Bug 112 | * Priority: Try to assess how important it is. If impacting production it's a "Blocker" 113 | * Component: RHOBS 114 | * (Important) Label: `incident` `no-qe` 115 | * Description: Mention how you were notified (ideally with link to alert/Slack thread). Mention what you know so far. 116 | 117 | ![jira](jira.png) See example incident tickets [here](https://issues.redhat.com/issues/?jql=project%20%3D%20MON%20AND%20labels%20%3D%20incident) 118 | 119 | 3. If AppSRE is not yet aware, drop link to created incident ticket to #sd-app-sre channel and notify `@app-sre-primary` and `@observatorium-oncall`. Don't repeat yourself, ask everyone to follow the ticket comments. 120 | 121 | * AppSRE may or may not create dedicated channel, do communication efforts and start on-call Zoom meeting. We as the dev team don't need to worry about those elements, go to step 4. 122 | 4. Investigate possible mitigation. Ensure the problem is mitigated before focusing on root cause analysis. 123 | 124 | * Important: Note all performed actions and observations on created JIRA ticket via comments. This allows anyone to follow up what was checked and how. It is also essential for detailed Post Mortem / RCA process later on. 125 | * Note on JIRA ticket all automation or monitoring gaps you wished you had. This will be useful for actions after the incident. 126 | 5. Ensure if incident is mitigated with AppSRE. Investigate root cause. If mitigation is applied and root cause is known, claim the incident being over. 127 | 6. Note the potential long term fixes and ideas. Close the incident JIRA ticket. 128 | 129 | ### After Incident 130 | 131 | 1. After some time (within week), start a Post-mortem (RCA) document in Google Docs. Use following [Google Docs RCA template](https://docs.google.com/document/d/12ZVT35yApp7D-uT4p29cEhS9mpzin4Z-Ufh9eOiiaKU/edit). Put it in our RCA Team Google directory [here](https://drive.google.com/drive/folders/1z1j2ZwljT9jv-aYu7bkXzi03XMdBdFT9). Link it in JIRA ticket too. 132 | 2. Collaborate on Post Mortem. Make sure it is *blameless* but accurate. Share it as soon as possible with the team and AppSRE. 133 | 3. Once done, schedule meeting with the Observability Platform and optionally AppSRE to discuss RCA/Post Mortem action items and effect. 134 | 135 | > Idea: If you have time, before sharing Post Mortem / RCA perform "Wheel of Misfortune". Select an on-call engineer who was not participaing in incident and simulate the error by triggering root-cause in safe environment. Then meet together with team and allow engineer to coordinate simulated incident. Help on the way to share knowledge and insights. This is the best way to on-board people to production topics. 136 | 137 | > NOTE: Freely available SRE book is a good source of general patterns around in efficient incident management. Recommended read! 138 | -------------------------------------------------------------------------------- /content/Proposals/Done/202106-proposals-process.md: -------------------------------------------------------------------------------- 1 | # 2021-06: Proposal Process 2 | 3 | * **Owners:**: 4 | * [`@bwplotka`](https://github.com/bwplotka) 5 | 6 | * **Other docs:** 7 | * [KEP Process](https://github.com/kubernetes/enhancements/blob/master/keps/README.md) 8 | * [Observability Team Process (Internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit) 9 | 10 | > TL;DR: We would like to propose an improved, official proposal process for Monitoring Group that clearly states when, where and how to create proposal/enhancement/design documents. 11 | 12 | ## Why 13 | 14 | More extensive architectural, process, or feature decisions are hard to explain, understand and discuss. It takes a lot of time to describe the idea, to motivate interested parties to review it, give feedback and approve. That's why it is essential to streamline the proposal process. 15 | 16 | Given that we work in highly distributed teams and work with multiple communities, we need to allow asynchronous discussions. This means it's essential to structure the talks into shared documents. Persisting in those decisions, once approved or rejected, is equally important, allowing us to understand previous motivations. 17 | 18 | There is a common saying [`"I've just been around long enough to know where the bodies are buried"`](https://twitter.com/AlexJonesax/status/1400103567822835714). We want to ensure the team related knowledge is accessible to everyone, every day, no matter if the team member is new or part of the team for ten years. 19 | 20 | ### Pitfalls of the current solution 21 | 22 | Currently, the Observability Platform team have the process defined [here (internal)](https://docs.google.com/document/d/1eojXStPdq1hYwv36pjE-vKR1q3dlBbpIx5w_L_v2gNo/edit#heading=h.kpdg1wrd3pcc), whereas the In-Cluster part were not defining any official process ([as per here (internal)](https://docs.google.com/document/d/1vbDGcjMjJMTIWcua5Keajla9FzexjLKmVk7zoUc0_MI/edit#heading=h.n0ac5lllvh13)). 23 | 24 | In practice, both teams had somehow similar flow: 25 | 26 | * For upstream: Follow the upstream project's contributing guide, e.g Thanos 27 | * For downstream: 28 | * Depending on the size: 29 | * Small features can be proposed during the bi-weekly team-sync or directly in Slack. 30 | * If the team can reach consensus in this time, then document the decision somewhere written, e.g. an email, Slack message to which everyone can add an emoji reaction, etc. 31 | * Add a JIRA ticket to plan this work. 32 | * Large features might need a design doc: 33 | 1. Add a JIRA ticket for creating the design doc 34 | 2. Create a new Google Doc in the team folder based on [this template](https://docs.google.com/document/d/1ddl_dLxjoIvWQuRgLdzL2Gd1EX1mkJQUZ-rgUh-T4d8/edit) 35 | 3. Fill sections 36 | 4. Announce it on the team mailing list and Slack channel 37 | 5. Address comments / concerns 6 Define what "done" means for this proposal, i.e. what is the purpose of this design document: 38 | * Knowledge sharing / Brain dump: This kind of document may not need a thorough review or any official approval 39 | * Long term vision and Execution & Implementation: If approved (with LGTM comments, or in an approved section) by a majority of the team and no major concerns consider it approved. NOTE: The same applies to rejected proposals. 40 | 1. If the document has no more offline comments and no consensus was reached, schedule a meeting with interested parties. 41 | 2. When the document changes status, move it to the appropriate status folder in the design docs directory of the team folder. If an approved proposal concerns a component with its own directory, e.g. Telemeter, then create a shortcut to the proposal document in the component-specific directory. This helps us find design documents by topic and by status. 42 | 43 | It served us well, but it had the following issues (really similar to ones stated in [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution)): 44 | 45 | * Even if our Google Design docs organized in our team drive, those Google documents are not easily discoverable. 46 | * Existing Google doc-based documents are hard to consume. The formatting is widely different. Naming is inconsistent. 47 | * Document creation is rarely actionable. There is no review process, so the effort of creating a relevant document might be wasted, as the document is lost. This also leads to docs being in the half-completed state, demotivating readers to look at it. 48 | * It's hard to track previous discussions around proposals, who approved them (e.g. proposals). 49 | * It's not public, and it's hard to share good proposals with other external and internal teams. 50 | 51 | ## Goals 52 | 53 | Goals and use cases for the solution as proposed in [How](#how): 54 | 55 | * Allow easy collaboration and decision making on design ideas. 56 | * Have a consistent design style that is readable and understandable. 57 | * Ensure design docs are discoverable for better awareness and knowledge sharing about past decisions. 58 | * Define a clear review and approval process. 59 | 60 | ## Non-Goals 61 | 62 | * Define process for other documents (see [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution)) 63 | 64 | ## How 65 | 66 | We want to propose an improved, official proposal process for Monitoring Group that clearly states *when, where and how* to create proposal/enhancement/design documents. 67 | 68 | Everything starts with a problem statement. It might be a missing functionality, confusing existing functionality or broken one. It might be an annoying process, performance or security issue (or potential one). 69 | 70 | ### Where to Propose Changes/Where to Submit Proposals? 71 | 72 | As defined in [handbook proposal](../Accepted/202106-handbook.md#pitfalls-of-the-current-solution), our Handbook should tell you that Handbook is meant to be an index for our team resources and a linking point to other distributed projects we maintain or contribute to. 73 | 74 | First, we need to identify if the idea we have is something we can contribute to an upstream project, or it does not fit anywhere else, so we can leverage the [Handbok Proposal directory](..) and the [process](#handbook-proposal-process). See the below algorithm to find it out: 75 | 76 | ![where](../../assets/proposal-where.png) 77 | 78 | [Internal Team Drive for Public and Confidential Proposals](https://drive.google.com/drive/folders/1WGqC3gMCxIQlrnjDUYfNUTPYYRI5Cxto) 79 | 80 | [Templates](../process.md#templates) 81 | 82 | ### Handbook Proposal Process 83 | 84 | If there is no problem, there is no need for changing anything, no need for a proposal. This might feel trivial, but we should first ask ourselves this question before even thinking about writing a proposal. 85 | 86 | It takes time to propose an idea, find consensus and implement more significant concepts, so let's not waste time before it's worth it. But, unfortunately, even good ideas sometimes have to wait for a good moment to discuss them. 87 | 88 | Let's assume the idea sounds interesting to you; what to do next, where to propose it? How to review it? Follow the algorithm below: 89 | 90 | ![where](../../assets/proposal-how.png) 91 | 92 | > Note: It's totally ok to reject a proposal if a team member feels the idea is wrong. It's better to explicitly oppose it than to ignore it and leave it in limbo. 93 | 94 | > NOTE: We would love to host Logging and Tracing Teams if they choose to follow our process, but we don't want to enforce it. We are happy to extend this process from the Monitoring Group handbook to Observability Group. Still, it has to grow organically (if the Logging, Tracing team will see the value of joining us here). 95 | 96 | ### On Review Process 97 | 98 | As you see on the above algorithm, if the content relates to any upstream project, it should be proposed, reviewed and potentially implemented together with the community. This does not mean that you cannot involve other team members towards this effort. Share the proposal with team members, even if they are not part of maintainer's team on a given project, any feedback, and voice are useful and can help to move idea further. 99 | 100 | Similar to proposals that touch our team only, despite mentioning mandatory approval process from leads, anyone can give feedback! Our process is in fact very similar to [Hashicorp's RFC process](https://works.hashicorp.com/articles/writing-practices-and-culture): 101 | 102 | > Once you’ve written the first draft of an RFC, share it with your team. They’re likely to have the most context on your proposal and its potential impacts, so most of your feedback will probably come at this stage. Any team member can comment on and approve an RFC, but you need explicit approval only from the appropriate team leads in order to move forward. Once the RFC is approved and shared with stakeholders, you can start implementing the solution. For major projects, also share the RFC to the company-wide email list. While most members of the mailing list will just read the email rather than the full RFC, sending it to the list gives visibility into major decisions being made across the company. 103 | 104 | ### Summary 105 | 106 | Overall, we want to bring a culture where design docs will be reviewed in certain amount of time and authors (team members) will be given feedback. This, coupled with recognizing the work and being able to add it to your list of achievements (even if proposal was rejected), should bring more motivation for people and teams to assess ideas in structure, sustainable way. 107 | 108 | ## Alternatives 109 | 110 | 1. Organize Team Google Drive with all Google docs we have. 111 | 112 | Pros: 113 | * Great for initial collaboration 114 | 115 | Cons: 116 | * Inconsistent format 117 | * Hard to track approvers 118 | * Never know when the doc is "completed." 119 | * Hard to maintain over time 120 | * Hard to share and reuse outside 121 | 122 | ## Action Plan 123 | 124 | * [X] Explain process in [Proposal Process Guide](../process.md) 125 | * [ ] Move existing up-to-date public design docs over to the Handbook (deadline: End of July). 126 | * TIP: You can use [Google Chrome Plugin](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to convert Google Doc into markdown easily. 127 | * [ ] Propose a similar process to upstream projects that do not have it. 128 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /content/Projects/Observability/kube-rbac-proxy.md: -------------------------------------------------------------------------------- 1 | # kube-rbac-proxy 2 | 3 | `kube-rbac-proxy`, as the name suggests, is an HTTP proxy that sits in front of a workload and performs authentication and authorization of incoming requests using the `TokenReview` and `SubjectAccessReview` resources of the Kubernetes API. 4 | 5 | ## Workflow 6 | 7 | The purpose of `kube-rbac-proxy` is to distinguish between calls made by same or different user(s) (or service account(s)) to endpoint(s) and protect them from unauthorized resource access based on their trusted identity (e.g. tokens, TLS certificates, etc.) or the RBACs they hold, respectively. Once the request is authenticated and/or authorized, the proxy forwards the response from the server to the client unmodified. 8 | 9 | ### [**Authentication**](https://github.com/brancz/kube-rbac-proxy/blob/74181c75b8b6fbcde7eff1ca5fae353faac5cfae/pkg/authn/config.go#L33-L39) 10 | 11 | kube-rbac-proxy can be configured with one of the 2 mechanisms for authentication: 12 | 13 | * [OpenID Connect](https://github.com/brancz/kube-rbac-proxy/blob/52e49fbdb75e009db4d02e3986e51fdba0526378/pkg/authn/oidc.go#L45-L63) where kube-rbac-proxy validates the client-provided token against the configured OIDC provider. This mechanism isn't used by the monitoring components. 14 | 15 | * Kubernetes API using bearer tokens or mutual TLS: 16 | * [Delegated authentication](https://github.com/kubernetes/apiserver/blob/8ad2e288d62d02276033ea11ee1efd94bb627836/pkg/authentication/authenticatorfactory/delegating.go#L102-L112) relies on Bearer tokens. The token represents the identity of the user or service account that is making the request and kube-rbac-proxy uses a [`TokenReview` request](https://github.com/kubernetes/apiserver/blob/21bbcb57c672531fe8c431e1035405f9a4b061de/plugin/pkg/authenticator/token/webhook/webhook.go#L51-L53) to verify the identity of the client. 17 | * If kube-rbac-proxy is configured with a client certificate authority, it can also verify the identify of the client presenting a TLS certificate. Some monitoring components use this [mechanism](#downstream-usage) which avoids a round-trip communication with the Kubernetes API server. 18 | 19 | In the case of a failed authentication, an [HTTP `401 Unauthorized` status code](https://github.com/brancz/kube-rbac-proxy/blob/9efde2a776fd516cfa082cc5f2c35c7f9e0e2689/pkg/filters/auth_test.go#L290) is returned (note the distinction between *authentication* and *unauthorized* here). Note that anonymous access is always disabled, and the proxy doesn't rely on HTTP headers to authenticate the request but it can add them if started with `--auth-header-fields-enabled`. 20 | 21 | Refer to [this page](https://kubernetes.io/docs/reference/access-authn-authz/authentication/) for more information on authentication in Kubernetes. 22 | 23 | ### [**Authorization**](https://github.com/brancz/kube-rbac-proxy/blob/1c7f88b5e951d25a493a175e93515068f5c77f3b/pkg/authz/auth.go#L31C1-L37) 24 | 25 | Once authentication is done, `kube-rbac-proxy` must then decide whether to allow the user's request to go through or not. A [`SubjectAccessReview` request is created](https://github.com/kubernetes/apiserver/blob/21bbcb57c672531fe8c431e1035405f9a4b061de/plugin/pkg/authorizer/webhook/webhook.go#L57-L59) for the API server, which allows for the review of the subject's access to a particular resource. Essentially, it checks whether the authenticated user or service account has sufficient permissions to perform the desired action on the requested resource, based on the RBAC permissions granted to it. If so, the request is forwarded to the endpoint, otherwise it is rejected. It is worth mentioning that the HTTP verbs are internally mapped to their [corresponding RBAC verbs](https://github.com/brancz/kube-rbac-proxy/blob/ccd5bc7fec36f9db0747033c2d698cc75a0e314c/pkg/proxy/proxy.go#L49-L60). Note that static authorization (as described in the [downstream usage](#downstream-usage) section) without SubjectAccessReview is also possible. 26 | 27 | Once the request is authenticated and authorized, it is forwarded to the endpoint. The response from the endpoint is then forwarded back to the client. If the request fails at any point, the proxy returns an error response to the client. If the authorization step fails, i.e., the client doesn't have the required permissions to access the requested resource, `kube-rbac-proxy` returns an [HTTP `403 Forbidden` status code](https://github.com/brancz/kube-rbac-proxy/blob/9efde2a776fd516cfa082cc5f2c35c7f9e0e2689/pkg/filters/auth_test.go#L300) to the client and does not forward the request to the endpoint. 28 | 29 | ## Downstream usage 30 | 31 | ### Inter-component communication 32 | 33 | In the context of monitoring, we're talking here about metric scrapes. These communications are usually secured using Mutual TLS (mTLS), which is a two-way authentication mechanism (see [configuring Prometheus to scrape metrics](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/)). 34 | 35 | Initially, the server (Prometheus) provides its digital certificate to the client which validates the server's identity. The process is then reciprocated, as the client shares its digital certificate for authentication by the server. Following the successful completion of these authentication steps, a secure channel for encrypted communication is established, ensuring that data transfer between the entities is duly safeguarded. 36 | 37 | ```yaml 38 | apiVersion: apps/v1 39 | kind: Deployment 40 | ... 41 | spec: 42 | template: 43 | spec: 44 | containers: 45 | - name: kube-rbac-proxy 46 | image: quay.io/brancz/kube-rbac-proxy:v0.8.0 47 | args: 48 | - "--tls-cert-file=/etc/tls/private/tls.crt" 49 | - "--tls-private-key-file=/etc/tls/private/tls.key" 50 | - "--client-ca-file=/etc/tls/client/client-ca.crt" 51 | ... 52 | ``` 53 | 54 | CMO specifies the aforementioned CA certificate in the [metrics-client-ca](https://github.com/openshift/cluster-monitoring-operator/blob/6795f509e77cce6d24e5a3e371a432ca22e1a8e7/assets/cluster-monitoring-operator/metrics-client-ca.yaml) ConfigMap which is used to define client certificates for every `kube-rbac-proxy` container that's safeguarding a component. The component's `Service` endpoints are secured using the generated TLS `Secret` annotating it with the `service.beta.openshift.io/serving-cert-secret-name`. Internally, this requests the [`service-ca`](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/authentication/configuring-certificates#understanding-service-serving_service-serving-certificate) controller to generate a `Secret` containing a certificate and key pair for the `${service.name}.${service.namespace}.svc`. These TLS manifests are then used in various component `ServiceMonitors` to define their TLS configurations, and within CMO to ensure a "mutual" acknowledgement between the two. 55 | 56 | Static authorization involves configuring `kube-rbac-proxy` to allow access to certain resources or non-resources which are evaluated against the `Role` or `ClusterRole` RBAC permissions the user or the service account has. The example below demonstrates how this can be employed to give access to a known `ServiceAccount` to the `/metrics` endpoint. `/metrics` endpoints exposed by various monitoring components are protected this way. Note that after the initial user or service account authentication, the request is matched against a comma-separated list of paths, as defined by the [`--allow-path`](https://github.com/brancz/kube-rbac-proxy/blob/067e14a2d1ecdfe8c18da6b0a0507cd4684e2c1c/cmd/kube-rbac-proxy/app/options/options.go#L83) flag, [like so](https://github.com/openshift/cluster-monitoring-operator/blob/4e8efd9864bff3ff46499e86fef8bba1e0178f54/assets/alertmanager/alertmanager.yaml#L96). 57 | 58 | ```yaml 59 | apiVersion: v1 60 | kind: Secret 61 | ... 62 | stringData: 63 | # "path" is the path to match against the request path. 64 | # "resourceRequest" is a boolean indicating whether the request is for a resource or not. 65 | # "user" is the user to match against the request user. 66 | # "verb" is the verb to match against the corresponding request RBAC verb. 67 | config.yaml: |- 68 | "authorization": 69 | "static": 70 | - "path": "/metrics" 71 | "resourceRequest": false 72 | "user": 73 | "name": "system:serviceaccount:openshift-monitoring:prometheus-k8s" 74 | "verb": "get" 75 | ``` 76 | 77 | For more details, refer to the `kube-rbac-proxy`'s [static authorization](https://github.com/brancz/kube-rbac-proxy/blob/4a44b610cd12c4cfe076a2b306283d0598c1bb7a/examples/static-auth/README.md#L169) example. 78 | 79 | For more information on collecting metrics in such cases, refer to [this section](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#exposing-metrics-for-prometheus) of the handbook. 80 | 81 | ### Securing API endpoints 82 | 83 | `kube-rbac-proxy` is also used to secure API endpoints such as Prometheus, Alertmanager and Thanos. In this case, the proxy is configured to authenticate requests based on bearer tokens and to perform authorization with `SubjectAccessReview`. 84 | 85 | The following components use the same method in their `kube-rbac-proxy` configurations `Secrets` to authorize the `/metrics` endpoint and restrict it to `GET` requests only: 86 | 87 | * `alertmanager-kube-rbac-proxy-metric` (`alertmanager`) 88 | * `openshift-user-workload-monitoring` (`alertmanager-user-workload`) 89 | * `kube-state-metrics-kube-rbac-proxy-config` (`kube-state-metrics`) 90 | * `node-exporter-kube-rbac-proxy-config` (`node-exporter`) 91 | * `openshift-state-metrics-kube-rbac-proxy-config` (`openshift-state-metrics`) 92 | * `kube-rbac-proxy` (`prometheus-k8s`) (additionally the `/federate` endpoint, for the telemeter as well as its own client) 93 | * `prometheus-operator-kube-rbac-proxy-config` (`prometheus-operator`) 94 | * `prometheus-operator-uwm-kube-rbac-proxy-config` (`prometheus-operator`) 95 | * `kube-rbac-proxy-metrics` (`prometheus-user-workload`) 96 | * `telemeter-client-kube-rbac-proxy-config` (`telemeter-client`) 97 | * `thanos-querier-kube-rbac-proxy-metrics` (`thanos-querier`) 98 | * `thanos-ruler-kube-rbac-proxy-metrics` (`thanos-ruler`) 99 | 100 | On the other hand, the example below depicts restricted access to a resource, i.e., `monitoring.coreos.com/prometheusrules` in the `openshift-monitoring` namespace. 101 | 102 | ```yaml 103 | apiVersion: v1 104 | kind: Secret 105 | ... 106 | stringData: 107 | # "resourceAttributes" describes attributes available for resource request authorization. 108 | # "rewrites" describes how SubjectAccessReview may be rewritten on a given request. 109 | # "rewrites.byQueryParameter" describes which HTTP URL query parameter is to be used to rewrite a SubjectAccessReview 110 | # on a given request. 111 | config.yaml: |- 112 | "authorization": 113 | "resourceAttributes": 114 | "apiGroup": "monitoring.coreos.com" 115 | "namespace": "{{ .Value }}" 116 | "resource": "prometheusrules" 117 | "rewrites": 118 | "byQueryParameter": 119 | "name": "namespace" 120 | ``` 121 | 122 | The following components use the same method in their `kube-rbac-proxy` configuration `Secrets` to authorize the respective resources: 123 | * `alertmanager-kube-rbac-proxy` (`alertmanager`): `prometheusrules` 124 | * `alertmanager-kube-rbac-proxy-tenancy` (`alertmanager-user-workload`): `prometheusrules` 125 | * `kube-rbac-proxy-federate` (`prometheus-user-workload`): `namespaces` 126 | * `thanos-querier-kube-rbac-proxy-rules` (`thanos-querier`): `prometheusrules` 127 | * `thanos-querier-kube-rbac-proxy` (`thanos-querier`): `pods` 128 | 129 | Note that all applicable omitted configuration settings are interpreted as wildcards. 130 | 131 | ## Configuration 132 | 133 | Details on configuring `kube-rbac-proxy` under different scenarios can be found in the repository's [/examples](https://github.com/brancz/kube-rbac-proxy/tree/9f436d46699dfd425f2682e4338069642b682892/examples) section. 134 | 135 | ## Debugging 136 | 137 | In addition to enabling debug logs or compiling a custom binary with debugging capabilities (`-gcflags="all=-N -l"`), users can: 138 | * [put the CMO into unmanaged state](https://github.com/openshift/cluster-monitoring-operator/blob/e1803adfa64f9ef424cd6e10790791adbed25eb4/hack/local-cmo.sh#L68-L103) to enable a higher verbose level using `-v=12` (or higher), or, 139 | * grep [the audit logs](https://docs.openshift.com/container-platform/4.13/security/audit-log-view.html) for more information about requests or responses concerning `kube-rbac-proxy`. 140 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/collecting_metrics.md: -------------------------------------------------------------------------------- 1 | # Collecting metrics with Prometheus 2 | 3 | This document explains how to ingest metrics into the OpenShift Platform monitoring stack. **It only applies for the OCP core components and Red Hat certified operators.** 4 | 5 | For user application monitoring, please refer to the [official OCP documentation](https://docs.openshift.com/container-platform/latest/monitoring/enabling-monitoring-for-user-defined-projects.html). 6 | 7 | ## Targeted audience 8 | 9 | This document is intended for OpenShift developers that want to expose Prometheus metrics from their operators and operands. Readers should be familiar with the architecture of the [OpenShift cluster monitoring stack](https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview). 10 | 11 | ## Exposing metrics for Prometheus 12 | 13 | Prometheus is a monitoring system that pulls metrics over HTTP, meaning that monitored targets need to expose an HTTP endpoint (usually `/metrics`) which will be queried by Prometheus at regular intervals (typically every 30 seconds). 14 | 15 | To avoid leaking sensitive information to potential attackers, all OpenShift components scraped by the in-cluster monitoring Prometheus should follow these requirements: 16 | * Use HTTPS instead of plain HTTP. 17 | * Implement proper authentication (e.g. verify the identity of the requester). 18 | * Implement proper authorization (e.g. authorize requests issued by the Prometheus service account or users with GET permission on the metrics endpoint). 19 | 20 | All the requirements are well defined in the [OpenShift developer documentation](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#metrics). As described in the [Client certificate scraping](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/client-cert-scraping.md) enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API. 21 | 22 | To this goal, the Cluster monitoring operator provisions a TLS client certificate for the in-cluster Prometheus. The client certificate is issued for the `system:serviceaccount:openshift-monitoring:prometheus-k8s` Common Name (CN) and signed by the `kubernetes.io/kube-apiserver-client` [signer](https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers). The certificate can be verified using the certificate authority (CA) bundle located at the `client-ca-file` key of the `kube-system/extension-apiserver-authentication` ConfigMap. 23 | 24 | {{% alert color="info" %}} In practice the Cluster Monitoring Operator creates a CertificateSigningRequest object for the `prometheus-k8s` service account which is automatically approved by the cluster-policy-controller. Once the certificate is issued by the controller, CMO provisions a secret named `metrics-client-certs` which contains the TLS certificate and key (respectively under `tls.crt` and `tls.key` keys in the secret). CMO also rotates the certificate before it gets expired.{{% /alert %}} 25 | 26 | There are several options available depending on which framework your component is built. 27 | 28 | ### library-go 29 | 30 | If your component already relies on `*ControllerCommandConfig` from `github.com/openshift/library-go/pkg/controller/controllercmd`, it should automatically expose a TLS-secured `/metrics` endpoint which has an hardcoded authorizer for the `system:serviceaccount:openshift-monitoring:prometheus-k8s` service account (refer to [github.com/openshift/library-go/pkg/authorization/hardcodedauthorizer](https://pkg.go.dev/github.com/openshift/library-go/pkg/authorization/hardcodedauthorizer) package and [usage](https://github.com/openshift/library-go/blob/24668b1349e6276ebfa9f9e49c780559284defed/pkg/controller/controllercmd/builder.go#L277-L279)). Requests authenticating with bearer tokens are allowed if the authenticated user has the `get` verb permission on the `/metrics` (non-resource) URL. 31 | 32 | Example: the [Cluster Kubernetes API Server Operator](https://github.com/openshift/cluster-kube-apiserver-operator/). 33 | 34 | ### kube-rbac-proxy sidecar 35 | 36 | The "simplest" option when the component doesn't rely on `github.com/openshift/library-go` (and switching to library-go isn't an option) is to run a [`kube-rbac-proxy`](https://github.com/openshift/kube-rbac-proxy) sidecar in the same pod as the application being monitored. 37 | 38 | Here is an example of a sidecar container's definition to be added to the Pod's template of the Deployment (or Daemonset): 39 | 40 | ```yaml 41 | - args: 42 | - --secure-listen-address=0.0.0.0:8443 43 | - --upstream=http://127.0.0.1:8081 # make sure that the upstream container listens on 127.0.0.1. 44 | - --config-file=/etc/kube-rbac-proxy/config.yaml 45 | - --tls-cert-file=/etc/tls/private/tls.crt 46 | - --tls-private-key-file=/etc/tls/private/tls.key 47 | - --client-ca-file=/etc/tls/client/client-ca-file 48 | - --logtostderr=true 49 | - --allow-paths=/metrics 50 | image: quay.io/brancz/kube-rbac-proxy:v0.11.0 # usually replaced by CVO by the OCP kube-rbac-proxy image reference. 51 | name: kube-rbac-proxy 52 | ports: 53 | - containerPort: 8443 54 | name: metrics 55 | resources: 56 | requests: 57 | cpu: 1m 58 | memory: 15Mi 59 | securityContext: 60 | allowPrivilegeEscalation: false 61 | capabilities: 62 | drop: 63 | - ALL 64 | terminationMessagePolicy: FallbackToLogsOnError 65 | volumeMounts: 66 | - mountPath: /etc/kube-rbac-proxy 67 | name: secret-kube-rbac-proxy-metric 68 | readOnly: true 69 | - mountPath: /etc/tls/private 70 | name: secret-kube-rbac-proxy-tls 71 | readOnly: true 72 | - mountPath: /etc/tls/client 73 | name: metrics-client-ca 74 | readOnly: true 75 | [...] 76 | - volumes: 77 | # Secret created by the service CA operator. 78 | # We assume that the Kubernetes service exposing the application's pods has the 79 | # "service.beta.openshift.io/serving-cert-secret-name: kube-rbac-proxy-tls" 80 | # annotation. 81 | - name: secret-kube-rbac-proxy-tls 82 | secret: 83 | secretName: kube-rbac-proxy-tls 84 | # Secret containing the kube-rbac-proxy configuration (see below). 85 | - name: secret-kube-rbac-proxy-metric 86 | secret: 87 | secretName: secret-kube-rbac-proxy-metric 88 | # ConfigMap containing the CA used to verify the client certificate. 89 | - name: metrics-client-ca 90 | configMap: 91 | name: metrics-client-ca 92 | ``` 93 | 94 | {{% alert color="info"%}}The `metrics-client-ca` ConfigMap needs to be created by your component and synced from the `kube-system/extension-apiserver-authentication` ConfigMap.{{% /alert %}} 95 | 96 | Here is a Secret containing the kube-rbac-proxy's configuration (it allows only HTTPS requets to the `/metrics` endpoint for the Prometheus service account): 97 | 98 | ```yaml 99 | apiVersion: v1 100 | kind: Secret 101 | metadata: 102 | name: secret-kube-rbac-proxy-metric 103 | namespace: openshift-example 104 | stringData: 105 | config.yaml: |- 106 | "authorization": 107 | "static": 108 | - "path": "/metrics" 109 | "resourceRequest": false 110 | "user": 111 | "name": "system:serviceaccount:openshift-monitoring:prometheus-k8s" 112 | "verb": "get" 113 | type: Opaque 114 | ``` 115 | 116 | Example: [node-exporter](https://github.com/openshift/cluster-monitoring-operator/blob/e51a06ffdb974003d4024ade3545f5e5e6efe157/assets/node-exporter/daemonset.yaml#L65-L98) from the Cluster Monitoring operator. 117 | 118 | ### controller-runtime (>= v0.16.0) 119 | 120 | Starting with v0.16.0, the `controller-runtime` framework provides a way to expose and secure a `/metrics` endpoint using TLS with minimal effort. 121 | 122 | Refer to https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/metrics/server for details about TLS configuration and check the next section to understand how it needs to be configured. 123 | 124 | As an example, you can refer to the [Observability Operator](https://github.com/rhobs/observability-operator/blob/2d192ea23b05bf4e088cf331666fd616efbb3072/pkg/operator/operator.go#L143-L212) implementation. 125 | 126 | ### Roll your own HTTPS server 127 | 128 | {{% alert color="info" %}}You don't use `library-go`, `controller-runtime` >= v0.16.0 or don't want to run a `kube-rbac-proxy` sidecar.{{% /alert %}} 129 | 130 | In such situations, you need to implement your own HTTPS server for `/metrics`. As explained before, it needs to require and verify the TLS client certificate using the root CA stored under the `client-ca-file` key of the `kube-system/extension-apiserver-authentication` ConfigMap. 131 | 132 | In practice, the server should: 133 | * Set TLSConfig's `ClientAuth` field to `RequireAndVerifyClientCert`. 134 | * Reload the root CA when the source ConfigMap is updated. 135 | * Reload the server's certificate and key when they are updated. 136 | 137 | Example: https://github.com/openshift/cluster-monitoring-operator/pull/1870 138 | 139 | ## Configuring Prometheus to scrape metrics 140 | 141 | To tell the Prometheus pods running in the `openshift-monitoring` namespace (e.g. `prometheus-k8s-{0,1}`) to scrape the metrics from your operator/operand pods, you should use `ServiceMonitor` and/or `PodMonitor` custom resources. 142 | 143 | The workflow is: 144 | * Add the `openshift.io/cluster-monitoring: "true"` label to the namespace where the scraped targets live. 145 | * **Important: only OCP core components and Red Hat certified operators can set this label on namespaces.** 146 | * OCP core components can set the label on their namespaces in the CVO manifets directly. 147 | * For OLM operators: 148 | * There's no automatic way to enforce the label (yet). 149 | * The OCP console will display a checkbox at installation time to enable cluster monitoring for the operator if you add the `operatorframework.io/cluster-monitoring=true` annotation to the operator's CSV and if the installation namespace starts with `openshift-`. 150 | * For CLI installations, the requirement should be detailed in the installation procedure ([example](https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/logging/cluster-logging-deploying#logging-loki-cli-install_cluster-logging-deploying) for the Logging operator). 151 | * Add Role and RoleBinding to give the prometheus-k8s service account access to pods, endpoints and services in your namespace. 152 | * In case of ServiceMonitor: 153 | * Create a Service object selecting the scraped pods. 154 | * Create a ServiceMonitor object targeting the Service. 155 | * In case of PodMonitor: 156 | * Create a PodMonitor object targeting the pods. 157 | 158 | Below is an fictitious example using a ServiceMonitor object to scrape metrics from pods deployed in the `openshift-example` namespace. 159 | 160 | **Role and RoleBinding manifests** 161 | 162 | ```yaml 163 | apiVersion: rbac.authorization.k8s.io/v1 164 | kind: Role 165 | metadata: 166 | name: prometheus-k8s 167 | namespace: openshift-example 168 | rules: 169 | - apiGroups: 170 | - "" 171 | resources: 172 | - services 173 | - endpoints 174 | - pods 175 | verbs: 176 | - get 177 | - list 178 | - watch 179 | --- 180 | apiVersion: rbac.authorization.k8s.io/v1 181 | kind: RoleBinding 182 | metadata: 183 | name: prometheus-k8s 184 | namespace: openshift-example 185 | roleRef: 186 | apiGroup: rbac.authorization.k8s.io 187 | kind: Role 188 | name: prometheus-k8s 189 | subjects: 190 | - kind: ServiceAccount 191 | name: prometheus-k8s 192 | namespace: openshift-monitoring 193 | ``` 194 | 195 | **Service manifest** 196 | 197 | ```yaml 198 | apiVersion: v1 199 | kind: Service 200 | metadata: 201 | annotations: 202 | # This annotation tells the service CA operator to provision a Secret 203 | # holding the certificate + key to be mounted in the pods. 204 | # The Secret name is "" (e.g. "secret-my-app-tls"). 205 | service.beta.openshift.io/serving-cert-secret-name: tls-my-app-tls 206 | labels: 207 | app.kubernetes.io/name: my-app 208 | name: metrics 209 | namespace: openshift-example 210 | spec: 211 | ports: 212 | - name: metrics 213 | port: 8443 214 | targetPort: metrics 215 | # Select all Pods in the same namespace that have the `app.kubernetes.io/name: my-app` label. 216 | selector: 217 | app.kubernetes.io/name: my-app 218 | type: ClusterIP 219 | ``` 220 | 221 | **ServiceMonitor manifest** 222 | 223 | ```yaml 224 | apiVersion: monitoring.coreos.com/v1 225 | kind: ServiceMonitor 226 | metadata: 227 | name: my-app 228 | namespace: openshift-example 229 | spec: 230 | endpoints: 231 | - interval: 30s 232 | # Matches the name of the service's port. 233 | port: metrics 234 | scheme: https 235 | tlsConfig: 236 | # The name of the server (CN) in the server's certificate. 237 | serverName: my-app.openshift-example.svc 238 | # The openshift-monitoring/k8s resource declares a `tls-client-certificate-auth` 239 | # scrape class which defines the client certificate and key used by Prometheus 240 | # scrape the /metrics endpoint as well as the certificate authority used 241 | # used to verify the server's certificate (*). Referencing the scrape class here 242 | # avoids specifying the exact filepaths. 243 | # 244 | # (*) the certificate authority is the cluster's CA bundle from the service CA operator. 245 | scrapeClass: tls-client-certificate-auth 246 | selector: 247 | # Select all Services in the same namespace that have the `app.kubernetes.io/name: my-app` label. 248 | matchLabels: 249 | app.kubernetes.io/name: my-app 250 | ``` 251 | 252 | ## Next steps 253 | 254 | * [Configure alerting](alerting.md) with Prometheus. 255 | * [Send Telemetry metrics](telemetry.md). 256 | -------------------------------------------------------------------------------- /content/Teams/onboarding/README.md: -------------------------------------------------------------------------------- 1 | # Team member onboarding 2 | 3 | 4 | 5 | - [Team member onboarding](#team-member-onboarding) 6 | - [Team Background](#team-background) 7 | - [People relevant to Prometheus who you should know about:](#people-relevant-to-prometheus-who-you-should-know-about) 8 | - [Thanos](#thanos) 9 | - [Talks](#talks) 10 | - [First days (accounts & access)](#first-days-accounts--access) 11 | - [First weeks](#first-weeks) 12 | - [General](#general) 13 | - [Watch these talks](#watch-these-talks) 14 | - [[optional] Additional information & Exploration](#optional-additional-information--exploration) 15 | - [Who’s Who?](#whos-who) 16 | - [First project](#first-project) 17 | - [First Months](#first-months) 18 | - [Second project](#second-project) 19 | - [Glossary](#glossary) 20 | 21 | Welcome to the Monitoring Team! We created this document to help guide you through your onboarding. 22 | 23 | Please fork this repo and propose changes to the content as a pull request if something is not accurate, outdated or unclear. Use your fork to track the progress of your onboarding tasks, i.e. by keeping a copy in your fork with comments about your completion status. Please read the doc thoroughly and check off each item as you complete it. We consistently want to improve our onboarding experience and your feedback helps future team members. 24 | 25 | This documents contains some links to Red Hat internal documents and you won't be able to access them without a Red Hat associate login. We are still trying to keep as much information as possible in the open. 26 | 27 | ## Team Background 28 | 29 | Team Monitoring mainly focuses on Prometheus, Thanos, Observatorium, and their integration into Kubernetes and OpenShift. We are also responsible for the hosted multi-tenant Observability service which powers such services as OpenShift Telemetry and OSD metrics. 30 | 31 | Prometheus is a monitoring project initiated at SoundCloud in 2012. It became public and widely usable in early 2015. Since then, it found adoption across many industries. In early 2016 development diversified away from SoundCloud as CoreOS hired one of the core developers. Prometheus is not a single application but an entire ecosystem of well-separated components, which work well together but can be used individually. 32 | 33 | CoreOS was acquired by Red Hat in early 2018. The CoreOS monitoring team became the Red Hat Monitoring Team, which has evolved into the “Observability group”. The teams were divided into two teams: 34 | 35 | 1. The In-Cluster Observability Team 36 | 2. The Observability Platform team (aka RHOBS or Observatorium team) 37 | 38 | You may encounter references to these two teams. In early 2021 we decided to combine the efforts of these teams more closely in order to avoid working in silos and ensure we have a well functioning end-to-end product experience across both projects. We are, however, still split into separate scrum teams for efficiency. We are now collectively known as the “Monitoring Team” and each team works together across the various domains. 39 | 40 | We work on all important aspects of the upstream eco-system and a seamless monitoring experience for Kubernetes using Prometheus. Part of that is integrating our open source efforts into the OpenShift Container Platform (OCP), the commercial Red hat Kubernetes distribution. 41 | 42 | ## People relevant to Prometheus who you should know about: 43 | 44 | - Julius Volz ([@juliusv](https://github.com/juliusv)): Previously worked at Soundcloud where he developed prometheus. Now working as an independent contractor and organizing [PromCon](http://promcon.io/) (Prometheus community conference). He also worked with weave.works on a prototype for remote/long-term time series storage, with Influxdb on Flux PromQL support. Contributed new Prometheus UI in React. He also created a new company [PromLens](https://promlens.com/), for a rich PromQL UI. 45 | - Bjoern Rabenstein ([@beorn7](https://github.com/beorn7)): Worked at SoundCloud but now works at Grafana. He's active again upstream and the maintainer of pushgateway and client_golang (the Go client Prometheus library). 46 | - Frederic Branczyk ([@brancz](https://github.com/brancz)): Joined CoreOS in 2016 to work around Prometheus. Core Team member since then and one of the minds behind our team's vision. Left Red Hat in 2020 to start his new company around continuous profiling (PolarSignals). Still very active in upstream. 47 | - Julien Pivotto ([@roidelapluie](https://github.com/roidelapluie)): prometheus/prometheus maintainer. Very active in other upstream projects (Alertmanager, …). 48 | 49 | ## Thanos 50 | 51 | Thanos is a monitoring system, which was created based on Prometheus principles. It is a distributed version of Prometheus where every piece of Prometheus like scraping, querying, storage, recording, alerting, and compaction can be deployed as separate horizontally scalable components. This allows more flexible deployments and capabilities beyond single clusters. Thanos also supports object storage as the main storage option, allowing cheap long term retention for metrics. At the end it exposes the same (yet extended) Prometheus APIs and uses gRPC to communicate between components. 52 | 53 | Thanos was created because of the scalability limits of Prometheus in 2017. At that point a similar project Cortex was emerging too, but it was over complex at that time. In November 2017, Fabian Reinartz ([@fabxc](https://github.com/fabxc), consulting for Improbable at that time) and Bartek Plotka ([@bwplotka](https://github.com/bwplotka)), teamed up to create Thanos based on the Prometheus storage format. Around February 2018 the project was shown at Prometheus Meetup in London, and in Summer 2018 announced on PromCon 2018. In 2019, our team in Red Hat, led at that point by Frederic Branczyk [@brancz](https://github.com/brancz), contributed essential pieces allowing Thanos to receive remote-write (push model) for Prometheus metrics. Since then, we could leverage Thanos for Telemetry gathering and then in in-cluster Monitoring, too. 54 | 55 | When working with it you will most likely interact with Bartek Plotka ([@bwplotka](https://github.com/bwplotka)) and other team members. 56 | 57 | These are the people you’ll be in most contact with when working upstream. If you run into any communication issues, please let us know as soon as possible. 58 | 59 | ## Talks 60 | 61 | Advocating about sane monitoring and alerting practices (especially focused on Kubernetes environments) and how Prometheus implements them is part of our team’s work. That can happen internally or on public channels. If you are comfortable giving talks on the topic or some specific work we have done, let us know so we can plan ahead to find you a speaking opportunity at meetups or conferences. If you are not comfortable, but want to break this barrier let us know as well, we can help you get more comfortable in public speaking slowly step by step. If you want to submit a CFP for a talk please add it to this [spreadsheet](https://docs.google.com/spreadsheets/d/1eo_JVND3k4ZnL25kgnhITSE2DBkyw8fwg3MyCXMjdYU/edit#gid=1880565406) and inform your manager. 62 | 63 | ## First days (accounts & access) 64 | 65 | 1. Follow up on [administrative tasks](https://docs.google.com/document/d/1bJSrlyc-e7bcOxV4sjx3FesMNVgdwNxUzMvIYywbt-0). 66 | 2. Understand the meetings the team attends to: 67 | 68 | Ask your manager to be added to the [Observability Program](https://calendar.google.com/calendar/u/0?cid=cmVkaGF0LmNvbV91N3YwbGt2cnRuM2wwbWJmMnF2M2VkMm12MEBncm91cC5jYWxlbmRhci5nb29nbGUuY29t) calendar. Ensure you attend the following recurring meetings: 69 | 70 | * Team syncs 71 | * Sprint retro/planning 72 | * Sprint reviews 73 | * Weekly architecture call 74 | * 1on1 with manager 75 | * Weekly 1on1 with your mentor ([mentors are tracked here](https://docs.google.com/spreadsheets/d/1SpdBbZChBNuPHVtbCjOch1mfZGUuCjkrp7yyCClL9kk/edit#gid=0)) 76 | 77 | ## First weeks 78 | 79 | Set up your computer and development environment and do some research. Feel free to come back to these on an ongoing basis as needed. There is no need to complete them all at once. 80 | 81 | ### General 82 | 83 | 1. Review our product documentation (this is very important): [Understanding the monitoring stack | Monitoring | OpenShift Container Platform 4.10](https://docs.openshift.com/container-platform/4.10/monitoring/monitoring-overview.html#understanding-the-monitoring-stack_monitoring-overview) 84 | 2. Review our team’s process doc: [Monitoring Team Process](https://docs.google.com/document/d/1vbDGcjMjJMTIWcua5Keajla9FzexjLKmVk7zoUc0_MI/edit#heading=h.n0ac5lllvh13) 85 | 3. Review how others should formally submit requests to our team: [Requests: Monitoring Team](https://docs.google.com/document/d/10orRGt5zlmZ-XsXQNY-sg6lOzWDCrPmHP68Oi-ETU9I) 86 | 4. If you haven’t already, buy this book and make a plan to finish it over time (you can expense it): *“Site Reliability Engineering: How Google Runs Production Systems”*. Online version of the book can be found here: [https://sre.google/books/](https://sre.google/books/). 87 | 5. Ensure you attend a meeting with your team lead or architect to give a general overview of our in-cluster OpenShift technology stack. 88 | 6. Ensure you attend a meeting with your team lead or architect to give a general overview of our hosted Observatorium/Telemetry stack. 89 | 7. Bookmark this spreadsheet for reference of all [OpenShift release dates](https://docs.google.com/spreadsheets/d/19bRYespPb-AvclkwkoizmJ6NZ54p9iFRn6DGD8Ugv2c/edit#gid=0). Alternatively, you can add the [OpenShift Release Dates](https://calendar.google.com/calendar/embed?src=c_188dvhrfem5majheld63i20a7rslg%40resource.calendar.google.com&ctz=America%2FNew_York) calendar. 90 | 91 | ### Watch these talks 92 | 93 | * [Prometheus introduction](https://www.youtube.com/watch?v=PzFUwBflXYc) by Julius Volz (project’s cofounder) @ KubeCon EU 2020 94 | * [The Zen of Prometheus](https://www.youtube.com/watch?v=Nqp4fjw_omU), by Kemal (ex-Observability Platform team) @ PromCon 2020 95 | * [The RED Method: How To Instrument Your Services](https://www.youtube.com/watch?v=zk77VS98Em8) by Tom Wilkie @ GrafanaCon EU 2018 96 | * [Thanos: Prometheus at Scale](https://www.youtube.com/watch?v=q9j8vpgFkoY) by Lucas and Bartek (Observability Platform team) @ DevConf 2020 97 | * [Instrumenting Applications and Alerting with Prometheus](https://www.youtube.com/watch?v=sHKWD8XnmmY), from Simon (Cluster Observability team) @ OSSEU 2019 98 | * [PromQL for mere mortals](https://www.youtube.com/watch?v=hTjHuoWxsks) by Ian Billett (Observability Platform team)@ PromCon 2019 99 | * [Life of an alert (Alertmanager)](https://www.youtube.com/watch?v=PUdjca23Qa4), @ PromCon 2018 100 | * [Best practices and pitfalls](https://www.youtube.com/watch?v=_MNYuTNfTb4) @ PromCon 2017 101 | * [Deep Dive: Kubernetes Metric APIs using Prometheus](https://www.youtube.com/watch?v=cIoOAbzhR7k) 102 | * [Monitoring Kubernetes with prometheus-operator](https://www.youtube.com/watch?v=MuHPMXCGiLc) by Lili @ Cloud Native Computing Berlin meetup 2021 103 | * [Using Jsonnet to Package Together Dashboards, Alerts and Exporters](https://www.youtube.com/watch?v=b7-DtFfsL6E) by Tom Wilkie(Grafana Labs) @ Kubecon, Europe 2018 104 | * [(Internal) Observatorium Deep Dive](https://drive.google.com/drive/u/0/folders/1NHhgoYi5y58wJpi_qp49tx1V9TQcRQTj), January 2021 by Kemal 105 | * [(Internal) How to Get Reviewers to Block your Changes](https://drive.google.com/file/d/1KOWv5A2qAoO1CfbfhCIJW1wcOnGpfTST/view), March 2022 by Assaf 106 | 107 | ### [optional] Additional information & Exploration 108 | 109 | * [https://www.linkedin.com/learning/](https://www.linkedin.com/learning/) 110 | * Red Hat has a corporate subscription to LinkedIn Learning that has great introductory courses to many topics relevant to our team 111 | * [Prometheus Monitoring](https://www.youtube.com/channel/UC4pLFely0-Odea4B2NL1nWA/videos) channel on Youtube 112 | * PromLabs [PromQL cheat sheet](https://promlabs.com/promql-cheat-sheet/) 113 | * [Prometheus-example-app](https://github.com/brancz/prometheus-example-appPrometheus-example-apphttps://github.com/brancz/prometheus-example-app) 114 | * [Kubernetes-sample-controller](https://github.com/kubernetes/sample-controller) 115 | 116 | The team uses various tools, you should get familiar with them by reading through the documentation and trying them out: 117 | 118 | * [Go](https://golang.org/) 119 | * [Jsonnet](https://jsonnet.org/) 120 | * [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) 121 | * [Kubernetes](https://kubernetes.io/) 122 | * [Thanos](https://thanos.io/) 123 | * [Observatorium](https://github.com/observatorium) 124 | 125 | ### Who’s Who? 126 | 127 | * For all the teams and people in OpenShift Engineering, see this [Team Member Tracking spreadsheet](https://docs.google.com/spreadsheets/d/1M4C41fX2J1nBXhqPdtwd8UP4RAx98NA4ByIUv-0Z0Ds/edit?usp=drive_web&ouid=116712625969749019622). Bookmark this and refer to it as needed. 128 | * Schedule a meeting with your manager to go over the team organizational structure 129 | 130 | ### First project 131 | 132 | Your first project should ideally: 133 | 134 | * Provide an interesting set of related tasks that make you familiar with various aspects of internal and external parts of Prometheus and OpenShift. 135 | * Encourage discussion with other upstream maintainers and/or people at Red Hat. 136 | * Be aligned with the area of Prometheus and more generally the monitoring stack you want to work on. 137 | * Have a visible impact for you and others 138 | 139 | Here’s a list of potential starter projects, talk to us to discuss them in more detail and figure out which one suits you. 140 | 141 | (If you are not a new hire, please add/remove projects as appropriate) 142 | 143 | * Setup Prometheus, Alertmanager, and node-exporter 144 | * As binaries on your machine (Bonus: Compile them yourself) 145 | * As containers 146 | * Setup Prometheus as a StatefulSet on vanilla Kubernetes (minikube or your tool of choice) 147 | * Try the Prometheus Operator on vanilla Kubernetes (minikube or your tool of choice) 148 | * Try kube-prometheus on vanilla Kubernetes 149 | * Try the cluster-monitoring-operator on Openshift (easiest is through the cluster-bot on slack) 150 | 151 | During the project keep the feedback cycles with other people as long or short as you feel confident. If you are not sure, ask! Try to briefly check in with the team regularly. 152 | 153 | Try to submit any coding work in small batches. This makes it easier for us to review and realign quickly. 154 | 155 | Everyone gets stuck sometimes. There are various smaller issues around the Prometheus and Alertmanager upstream repositories and the different monitoring operators. If you need a bit of distance, tackle one of them for a while and then get back to your original problem. This will also help you to get a better overview. If you are still stuck, just ask someone and we’ll discuss things together. 156 | 157 | ## First Months 158 | 159 | * If you will be starting out working more closely with the in-cluster stack be sure to review this document as well: [In-Cluster Monitoring Onboarding](https://docs.google.com/document/d/16Uzd8OLkBdN0H4KxqQIr7HTPYZWHKz3WdGtOlB6Rcdk/edit#). Otherwise if you are starting out more focused on the Observatorium service, review this doc: [Observatorium Platform Onboarding](https://docs.google.com/document/d/1RXSJYpx2x3bje6fwy2PEUSOgDrBlxq24A5vh2mHcxnk/edit#) 160 | * Try to get *something* (anything) merged into one of our repositories 161 | * Begin your 2nd project 162 | * Create a PR for the the master onboarding doc (this one) with improvements you think would help others 163 | 164 | ### Second project 165 | 166 | After your starter project is done, we’ll discuss how it went and what your future projects will be. By then you'll hopefully have a good overview which areas you are interested in and what their priority is. Discuss with your team lead or manager what your next project will be. 167 | 168 | ## Glossary 169 | 170 | Our team's glossary can be found [here](https://docs.google.com/document/d/1bJSrlyc-e7bcOxV4sjx3FesMNVgdwNxUzMvIYywbt-0/edit#heading=h.9lupa64ck0pj). 171 | -------------------------------------------------------------------------------- /content/Products/OpenshiftMonitoring/telemetry.md: -------------------------------------------------------------------------------- 1 | # Sending metrics via Telemetry 2 | 3 | ## Targeted audience 4 | 5 | This document is intended for OpenShift developers that want to ship new metrics to the Red Hat Telemetry service. 6 | 7 | ## Background 8 | 9 | Before going to the details, a few words about [Telemetry](https://rhobs-handbook.netlify.app/services/rhobs/use-cases/telemetry.md/) and the process to add a new metric.. 10 | 11 | **What is Telemetry?** 12 | 13 | Telemetry is a system operated and hosted by Red Hat that allows to collect data from connected clusters to enable subscription management automation, monitor the health of clusters, assist with support, and improve customer experience. 14 | 15 | **What does sending metrics via Telemetry mean?** 16 | 17 | You should send the metrics via Telemetry when you want and need to see these metrics for **all** OpenShift clusters. This is primarily for gaining insights on how OpenShift is used, troubleshooting and monitoring the fleet of clusters. Users can already see these metrics in their clusters via Prometheus even when not available via Telemetry. 18 | 19 | **How are metrics shipped via Telemetry?** 20 | 21 | Only metrics which are already collected by the [in-cluster monitoring stack](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#in-cluster-monitoring-stack) can be shipped via Telemetry. The `telemeter-client` pod running in the `openshift-monitoring` namespace collects metrics from the `prometheus-k8s` service every 4m30s using the `/federate` endpoint and ships the samples to the Telemetry endpoint using a custom protocol. 22 | 23 | **How long will it take for my new telemetry metrics to show up?** 24 | 25 | Please start this process and involve the monitoring team as early as possible. The process described in this document includes a thorough review of the underlying metrics and labels. The monitoring team will try to understand your use case and perhaps propose improvements and optimizations. Metric, label and rule names will be reviewed for [following best practices](https://prometheus.io/docs/practices/naming/). This can take several review rounds over multiple weeks. 26 | 27 | ## Requirements 28 | 29 | Shipping metrics via Telemetry is only possible for components running in namespaces with the `openshift.io/cluster-monitoring=true` label. In practice, it means that your component falls into one of these 2 categories: 30 | * Your operator/operand is included in the OCP payload (e.g. it is a core/platform component). 31 | * Your operator/operand is deployed via OLM and has been certified by Red Hat. 32 | 33 | Your component should already be instrumented and scraped by the [in-cluster monitoring stack](#in-cluster-monitoring-stack) using `ServiceMonitor` and/or `PodMonitor` objects. 34 | 35 | ## Sending metrics via Telemetry step-by-step 36 | 37 | The overall process is as follows: 38 | 1. Request approval from the monitoring team. 39 | 2. Configure recording rules using `PrometheusRule` objects. 40 | 3. Modify the configuration of the Telemeter client in the [Cluster Monitoring Operator](https://github.com/openshift/cluster-monitoring-operator/) repository to collect the new metrics. 41 | 4. Synchronize the Telemeter server's configuration from the Cluster Monitoring Operator project. 42 | 5. Wait for the Telemeter server's configuration to be rolled out to production. 43 | 44 | ### Request approval 45 | 46 | The first step is to identify which metrics you want to send via Telemetry and what is the [cardinality](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#what-is-the-cardinality-of-a-metric) of the metrics (e.g. how many timeseries it will be in total). Typically you start with metrics that show how your component is being used. In practice, we recommend to start shipping not more than: 47 | * 1 to 3 metrics. 48 | * 1 to 10 timeseries per metric. 49 | * 10 timeseries in total. 50 | 51 | If you are above these limits, you have 2 choices: 52 | * (recommended) aggregate the metrics before sending. For instance: sum all values for a given metric. 53 | * request an exception from the monitoring team. The exception requires approval from upper management so make sure that your request is motivated! 54 | 55 | Finally your metric **MUST NOT** contain any personally identifiable information (names, email addresses, information about user workloads). 56 | 57 | Use the following information to file 1 JIRA ticket per metric in the [MON project](https://issues.redhat.com//secure/CreateIssueDetails!init.jspa?pid=12323177&issuetype=3&labels=telemetry-review-request&summary=Send+metric+...+via+Telemetry&description=h1.%20Request%20for%20sending%20data%20via%20telemetry%0A%0AThe%20goal%20is%20to%20collect%20metrics%20about%20...%20because%20...%0A%0Ah2.%20%3CMetric%20name%3E%0A%0A%3CMetric%20name%3E%20represents%20...%0A%0ALabels%0A%2A%20%3Clabel%201%3E%2C%20possible%20values%20are%20...%0A%2A%20%3Clabel%202%3E%2C%20possible%20values%20are%20...%0A%0AThe%20cardinality%20of%20the%20metric%20is%20at%20most%20%3CX%3E.%0A%0AComponent%20exposing%20the%20metric%3A%20https%3A%2F%2Fgithub.com%2F%3Corg%3E%2F%3Cproject%3E%0Ah2.&priority=4): 58 | 59 | * Type: `Task` 60 | * Title: `Send metric via Telemetry` 61 | * Label: `telemetry-review-request` 62 | * Description template: 63 | 64 | ``` 65 | h1. Request for sending data via telemetry 66 | 67 | The goal is to collect metrics about ... because ... 68 | 69 | represents ... 70 | 71 | Labels 72 | *